ArrayFire programming for Optimization

[Old posts from the commercial version of ArrayFire] Discussion of ArrayFire using CUDA or OpenCL.

Moderator: pavanky

ArrayFire programming for Optimization

Postby jepsen » Thu Jan 09, 2014 9:22 am

Hello,

I am currently looking at different ways of automatically generating code for GPU exection and I came across ArrayFire.

To gain some experience with ArrayFire, I decided to make my own version of the matmul function, a matrix-matrix multiplication, which shown below.
Code: Select all
  gfor (array k, wA) {
    for (unsigned i = 0; i < wA; i++) {
      C(i, k) = sum(A(k, span) * B(i, span));
    }
  }


My first question is this: Does the ArrayFire compiler perform any optimizations specific to the GPU hardware on the generated CUDA/OpenCL code? For example by optimizing the code by using the shared memory, or by coalescing the memory accesses? If it does I would like to know which optimizations it is performing, and if there is a specific way to write your programs in order to enable as many optimizations as possible.

My second question is about the indices of the gfor loop. It seems to me that the gfor loop will only have a single index that one could use to define data parallelism in your program. Take for example the NBody simulation that I have written below (only the force calculation is shown)
Code: Select all
  gfor (array k, N) {
    float a_x = P(0, 0).scalar<float>();
    float a_y = P(1, 0).scalar<float>();
    float a_m = M(0).scalar<float>();
    float f_x = 0.0;
    float f_y = 0.0;
    for (unsigned j = 0; j < N; j++) {
      float b_x = P(0, j).scalar<float>();
      float b_y = P(1, j).scalar<float>();
      float b_m = M(j).scalar<float>();
   
      float r_x = b_x - a_x;
      float r_y = b_y - a_y;

      float d = r_x * r_x + r_y * r_y;
      array ja = (constant(j,1));
      array mask1 = (k == ja);
      float deno = sqrt(d * d * d) + mask1.as(f32).scalar<float>();
      deno = (a_m*b_m/deno) * (!mask1).as(f32).scalar<float>();
      f_x += (deno * r_x) ;
      f_y += (deno * r_y) ;
    }
    F(0, k) = f_x;
    F(1, k) = f_y;
  }


The gfor loop uses the array k to define the parallelism. If I were to parallelize also the j-loop, is it then possible to use both k and j to define the parallelism? For instance by using a gfor( array(k,j), N,N ) kind of loops? Or are you restricted to 1D parallelization at the moment? I am asking since I am doing some masking in the NBody simulation that makes use of the k and j indices and I need to be able to have both of them.

Best Regards,
Jacob
jepsen
 
Posts: 1
Joined: Wed Jan 08, 2014 7:55 am

Re: ArrayFire programming for Optimization

Postby shehzan » Fri Jan 10, 2014 1:46 pm

Hi Jacob,

Those are great questions.
In general, ArrayFire is designed to give you the best optimizations possible. Whether it is matmul or image processing or anything else, we work tirelessly to get every millisecond of performance possible.
Coming to your matmul example, the gfor is a way to do matmul, but you will not get the same performance. We internally use a combination of cuBLAS and our own kernels to get the best performance out of matmul. You can test it yourself by using the timeit function. See the "blas" example in the benchmarks folder of ArrayFire.

Coming to your question about nested or multivariate gfor loops, we do not currently have that functionality available. gfors use only a single index variable.
You could however look at possible ways to convert your code to be parallel using multidimensional arrays and see how the results turn out.
----
Shehzan
Developer
AccelerEyes
User avatar
shehzan
 
Posts: 121
Joined: Tue Feb 12, 2013 7:20 pm


Return to [archive-commercial] Programming & Development with ArrayFire

cron