Slow host to device and device to host memory copies

[Old posts from the commercial version of ArrayFire] Discussion of ArrayFire using CUDA or OpenCL.

Moderator: pavanky

Slow host to device and device to host memory copies

Postby hazm13 » Tue Mar 18, 2014 11:29 am

Hi,

I know that memory copies from host to device and back again are the main bottleneck with utilising a GPU, and that you need to try to minimise them etc, do as much work as possible on the GPU etc.

However, I was doing a very simple test to calculate just how much of an effect the memory transfer has and it's looking way too large to me.

Here is a very simple fortran program that takes two arrays of size n*n*n, multiplies them element-wise and then takes the sum of all the values (inner product).

Code: Select all
program timer_with_transfer
  use arrayfire
  implicit none
  real :: c, t1, t2, t3, t4, t5, t6
  integer :: n
  real, dimension(:,:,:),allocatable :: a,b
  real, dimension(:),allocatable :: d
  type(array) M1,M2
 
  !Array Size = n*n*n
  n = 300

  !Input arrays become correct size
  allocate(a(n,n,n))
  allocate(b(n,n,n))

  !All values of input arrays are set to 0.001
  a=0.001
  b=0.001

  ! Start the CPU timer
  t1=secnds(0.0)
  ! Perform inner product
  c = sum(a*b)
  ! Stop the CPU timer
  t2=secnds(t1)

  ! Start the GPU timer
  t3=secnds(0.0)
  ! Perform Host->Device copy
  M1 = a
  M2 = b
  ! Perform inner product & Perform Device->Host copy
  t5=secnds(0.0)
  d = sum(sum(sum(M1*M2),2),3)
  t6=secnds(t5)
  ! Stop the timer
  t4=secnds(t3)

  !Write inner product results
  write (*,*) "CPU Result:"
  print '(f9.5)', c
  write (*,*) "GPU Result:"
  print '(f9.5)', d(1)

  ! Print time taken by each process
  write (*,"(a15, f8.4)") "CPU Time taken: ", t2
  write (*,"(a33, f8.4)") "GPU Time taken including copies: ", t4
  write (*,"(a33, f8.4)") "GPU Time taken excluding copies: ", t6

end program timer_with_transfer


For the example above, with a pair of 300x300x300 arrays... the results were:

CPU Time taken: 0.0234
GPU Time taken including copies: 12.1836
GPU Time taken excluding copies: 0.0078

I tried various array sizes and it nearly aways seems to be around 12 seconds taken by copying to/from the host.

Is this normal? Have I got a hardware issue? Or maybe a software issue somewhere? I have tried it on two systems, both with similar results:

PC: i5 3570k @ 3.4GHz, Nvidia GTX 680 2GB
Laptop: i7 3630QM @ 2.4GHz, Nvidia GT 650M 2GB

Both are running Arrayfire 2.0 with the latest Fortran build from Github and both have CUDA5.5.22 toolkit installed.

That memory transfer too and from the device just seems far too long to me - especially as it seems almost irrespective of the size of the data being transferred (from 10x10x10 up to 400x400x400). Also, I know it would be faster to initialise the arrays on the device but for my application of ArrayFire I am going to need to copy data from the host to the device and back, so initialising on the device isn't an option for me.

Any ideas? Thanks, Harry
hazm13
 
Posts: 12
Joined: Wed Feb 12, 2014 3:22 pm
Location: Southampton

Re: Slow host to device and device to host memory copies

Postby pavanky » Tue Mar 18, 2014 1:02 pm

Hi Harry,

The times you are seeing are for the time taken to initialize (and checkout license from our license server if you are using the free version).

Here are the times I am seeing on my machine:

Code: Select all
 $ ./test_cuda
 CPU Result:
 32.00000
 GPU Result:
 27.00001
CPU Time taken:  0.0234
GPU Time taken including copies:   0.2695
GPU Time taken excluding copies:   0.0039


The time taken for the first memory copy also involves cuda context creation. memory allocation on the gpu and memory allocation on the cpu before returning the output.

Here is what it looks like if I run the code more than once:

Code: Select all
 $ ./test_cuda
 CPU Result:
 32.00000
 GPU Result:
 27.00001
CPU Time taken:  0.0273
GPU Time taken including copies:   0.1953
GPU Time taken excluding copies:   0.0039
GPU Time taken including copies round 2:  0.0391
GPU Time taken excluding copies round 2:  0.0039



P.S. The results from the CPU seem to be a bit unexpected for me. Can you explain why the CPU result is 32 ? I'd expect 300 * 300 * 300 * 0.01 * 0.01 to be 27 but not 32!

Also you can do sum(flat(M1 * M2)) to do just one GPU call instead of 3! (flat just changes some meta information internally).
Pavan Yalamanchili,
ArrayFire
--
~ If it is not broken, you have not tried hard enough ~
User avatar
pavanky
Site Admin
 
Posts: 1123
Joined: Mon Mar 15, 2010 7:39 pm
Location: Atlanta, GA

Re: Slow host to device and device to host memory copies

Postby pavanky » Tue Mar 18, 2014 1:03 pm

Can you also mention the driver you are using ? The time taken seems unusually long even for the free license version. It would also be helpful if you can tell us the linux distribution you are using (including the 32 / 64 bit).
Pavan Yalamanchili,
ArrayFire
--
~ If it is not broken, you have not tried hard enough ~
User avatar
pavanky
Site Admin
 
Posts: 1123
Joined: Mon Mar 15, 2010 7:39 pm
Location: Atlanta, GA

Re: Slow host to device and device to host memory copies

Postby hazm13 » Tue Mar 18, 2014 6:15 pm

Ah I thought it might be something to do with initialisation - and yeah it is connecting to the license server so that would also slow it down.

I have since realised the CPU result only seems to be correct if you use double precision on the CPU rather than normal real(4) floats... quite odd!

Thanks for the note about flat - that makes sense too.

The Nvidia driver version I am using is 319.37 and the distribution is 64-bit Ubuntu 12.04 LTS.

Anyway, those figures you got are far closer to what I expected. I will soon have access to a proper workstation GPU with more memory which will mean larger data sets and should make the GPU more competitive.
hazm13
 
Posts: 12
Joined: Wed Feb 12, 2014 3:22 pm
Location: Southampton


Return to [archive-commercial] Programming & Development with ArrayFire

cron