Tuesday, 19 March 2013

If you are thinking that I am going to use CUDA/AMD APP/OpenCL etc, then you are wrong! Why use vendor dependent shit, when you can enjoy freedom.

In this tutorial, I will explain how can you use your graphics card to accelerate your programs in vendor independent way (It means, that same program will run on cards from nVidia, ATi, ASUS etc). I will use C++ AMP which is a really simple C++ extension from Microsoft and supports every DirectX11 compatible card in market. Programming older cards (compatible with DX9, DX10, or OpenGL) is quiet tricky.....will post that too after some time.

Software Requirement
Visual Studio 2012 Professional
DirectX SDK
Windows 7

Hardware Requirement
Any DirectX 11 compatible graphics card

Prerequisite Knowledge
C++ (Templates, STL, Lambdas)

In this tutorial, I am going to use matrix multiplication because

  1. It is time consuming,
  2. It is used in many fields of computer science
First I will present naive method which is straightforward:

for(int k=0; k<M; ++k)
    {
        for(int j=0; j<W; ++j)
        {
            _type result = 0;

            for(int i=0; i<N; ++i)
            {
                int idx_a = k * N + i;
                int idx_b = i * W + j;

                result += va[idx_a] * vb[idx_b];
            }

            vresult[k * W + j] = result;
        }
    }

Since time complexity of above algorithm is N^3, it will consume too much time if we will try to multiply 2 500x500 matrices....Let us try to make it a bit fast.....Laugh

As you already know, we can decide any element [i,j] in final matrix independent from calculation of any other element. Therefore, we can calculate each element in final matrix in parallel, and it will be quite fast....right?

Here comes C++ AMP in scene!

In the original algorithm, there are two nested loops....where innermost body is performing actual calculation. For parallelism, we need just that part, with parallel_for_each function from C++ AMP

The modified algorithm will be as shown below:


parallel_for_each(av_c.extent, [=](index<2> idx) restrict(amp)
        {
            _type result = 0;

            for(int i = 0; i < av_a.extent[1]; ++i)
            {
                index<2> idx_a(idx[0], i);
                index<2> idx_b(i, idx[1]);

                result += av_a[idx_a] * av_b[idx_b];
            }

            av_c[idx] = result;
        });

The third parameter for parallel_for_each() function is a lambda which is performing actual calculation. Compare it with innemost body of naive method presented in starting.

You can use the full code given below for benchmarking. This code gives approx 35x speedup for 500x500 matrix multiplication...pretty cool..isn't it?

#include <amp.h>
#include <iostream>
#include <assert.h>
#include <conio.h>

using namespace concurrency;

#define DATA_TYPE float

//----------------------------------------------------------------------------
// Generate random data
//----------------------------------------------------------------------------
template<typename _type>
void initialize_array(std::vector<_type> &v_data, unsigned size)
{
    for(unsigned i=0; i<size; ++i)
    {
        v_data[i] = (_type)((_type)rand() * 100 / (_type)(RAND_MAX + 1));
    }
}

//----------------------------------------------------------------------------
// Implement matrix multiplication on CPU - N cube algorithm
//----------------------------------------------------------------------------
template<typename _type>
void mxm_single_cpu(int M, int N, int W, const std::vector<_type>& va, const std::vector<_type>& vb, std::vector<_type>& vresult)
{
    if ((va.size() != M*N) || (vb.size() != N*W) || (vresult.size() != M*W))
        throw "Expected matrix dimension result(M*W) = a(M*N) * b(N*W)";

    for(int k=0; k<M; ++k)
    {
        for(int j=0; j<W; ++j)
        {
            _type result = 0;

            for(int i=0; i<N; ++i)
            {
                int idx_a = k * N + i;
                int idx_b = i * W + j;

                result += va[idx_a] * vb[idx_b];
            }

            vresult[k * W + j] = result;
        }
    }
}

//----------------------------------------------------------------------------
// Implement simple matrix multiplication on GPU using C++ AMP
//----------------------------------------------------------------------------
template<typename _type>
void mxm_amp_simple(int M, int N, int W, const std::vector<_type>& va, const std::vector<_type>& vb, std::vector<_type>& vresult)
{
    if ((va.size() != M*N) || (vb.size() != N*W) || (vresult.size() != M*W))
        throw "Expected matrix dimension result(M*W) = a(MxN) * b(N*W)";

    extent<2> e_a(M, N), e_b(N, W), e_c(M, W);

    // Copy in
    array_view<const _type, 2> av_a(e_a, va);
    array_view<const _type, 2> av_b(e_b, vb);
    array_view<_type, 2> av_c(e_c, vresult);
    av_c.discard_data();

    // Compute - outer 2 for loops of CPU is replaced by a parallel_for_each
    parallel_for_each(av_c.extent, [=](index<2> idx) restrict(amp)
        {
            _type result = 0;

            for(int i = 0; i < av_a.extent[1]; ++i)
            {
                index<2> idx_a(idx[0], i);
                index<2> idx_b(i, idx[1]);

                result += av_a[idx_a] * av_b[idx_b];
            }

            av_c[idx] = result;
        });
    // explicitly about copying out data
    av_c.synchronize();
}

template<typename _type>
bool verify(std::vector<_type>& v_res, std::vector<_type>& v_ref, int len)
{
    bool passed = true;

    for (int i = 0; i < len; ++i)
    {
        if (v_res[i] != v_ref[i])
        {
             printf("v_res[%d] = %f, v_ref[%d] = %f\n", i, v_res[i], i, v_ref[i]);
             passed = false;
             break;
        }
    }

    return passed;
}

template<>
bool verify(std::vector<float>& v_res, std::vector<float>& v_ref, int len)
{
    bool passed = true;

    for (int i = 0; i < len; ++i)
    {
        if (fabs(v_res[i] - v_ref[i]) > 0.01)
        {
             printf("v_res[%d] = %f, v_ref[%d] = %f\n", i, v_res[i], i, v_ref[i]);
             passed = false;
             break;
        }
    }

    return passed;
}

int main()
{
    accelerator default_device;
    std::wcout << L"Using device : " << default_device.get_description() << std::endl;
    if (default_device == accelerator(accelerator::direct3d_ref))
        std::cout << "WARNING!! Running on very slow emulator! Only use this accelerator for debugging." << std::endl;

    srand(2012);

    const int M = 100;
    const int N = 100;
    const int W = 100;
 
    std::vector<DATA_TYPE> v_a(M * N);
    std::vector<DATA_TYPE> v_b(N * W);
    std::vector<DATA_TYPE> v_c_simple(M * W);
    std::vector<DATA_TYPE> v_c_tiled(M * W);
    std::vector<DATA_TYPE> v_ref(M * W);

    initialize_array(v_a, M * N);
    initialize_array(v_b, N * W);

    assert((M!=0) && (W!=0) && (N!=0));

    printf("Matrix dimension C(%d x %d) = A(%d x %d) * B(%d x %d)\n", M, W, M, N, N, W);
    printf("CPU(single core) exec ");
    mxm_single_cpu(M, N, W, v_a, v_b, v_ref);
    printf("completed.\n");

    printf("AMP Simple ");
    mxm_amp_simple(M, N, W, v_a, v_b, v_c_simple);
    printf("completed.\n");
    printf("\t%s\n\n", verify(v_c_simple, v_ref, M * W) ? "Data matches" : "Data mismatch");

    _getch();
    return 0;




In future, I will show you how to program older cards (and that will work with Windows and LINUX both..:D)!! Stay Tuned!!

Join Our Mailing List Today Get Latest Updates : Click Here

0 comments:

Post a Comment