I am currently looking at different ways of automatically generating code for GPU exection and I came across ArrayFire.

To gain some experience with ArrayFire, I decided to make my own version of the matmul function, a matrix-matrix multiplication, which shown below.

- Code: Select all
`gfor (array k, wA) {`

for (unsigned i = 0; i < wA; i++) {

C(i, k) = sum(A(k, span) * B(i, span));

}

}

My first question is this: Does the ArrayFire compiler perform any optimizations specific to the GPU hardware on the generated CUDA/OpenCL code? For example by optimizing the code by using the shared memory, or by coalescing the memory accesses? If it does I would like to know which optimizations it is performing, and if there is a specific way to write your programs in order to enable as many optimizations as possible.

My second question is about the indices of the gfor loop. It seems to me that the gfor loop will only have a single index that one could use to define data parallelism in your program. Take for example the NBody simulation that I have written below (only the force calculation is shown)

- Code: Select all
`gfor (array k, N) {`

float a_x = P(0, 0).scalar<float>();

float a_y = P(1, 0).scalar<float>();

float a_m = M(0).scalar<float>();

float f_x = 0.0;

float f_y = 0.0;

for (unsigned j = 0; j < N; j++) {

float b_x = P(0, j).scalar<float>();

float b_y = P(1, j).scalar<float>();

float b_m = M(j).scalar<float>();

float r_x = b_x - a_x;

float r_y = b_y - a_y;

float d = r_x * r_x + r_y * r_y;

array ja = (constant(j,1));

array mask1 = (k == ja);

float deno = sqrt(d * d * d) + mask1.as(f32).scalar<float>();

deno = (a_m*b_m/deno) * (!mask1).as(f32).scalar<float>();

f_x += (deno * r_x) ;

f_y += (deno * r_y) ;

}

F(0, k) = f_x;

F(1, k) = f_y;

}

The gfor loop uses the array k to define the parallelism. If I were to parallelize also the j-loop, is it then possible to use both k and j to define the parallelism? For instance by using a gfor( array(k,j), N,N ) kind of loops? Or are you restricted to 1D parallelization at the moment? I am asking since I am doing some masking in the NBody simulation that makes use of the k and j indices and I need to be able to have both of them.

Best Regards,

Jacob