GPU support¶
The choice between CPU and GPU backends is performed by specifying ccl_stream_type value at the moment when ccl stream object is created:
For GPU backend you should specify
ccl_stream_syclas the first argument.For collective operations, which operate on SYCL* stream, C version of oneCCL API expects communication buffers to be
sycl::buffer*objects cast tovoid*.
The example below demonstrates these concepts.
Example¶
Consider simple allreduce example for GPU.
Create GPU ccl stream object:
C version of oneCCL API:
ccl_stream_create(ccl_stream_sycl, &q, &stream);
C++ version of oneCCL API:
ccl::stream_t stream = ccl::environment::instance().create_stream(cc::stream_type::sycl, &q);
qis an object of typesycl::queue.To illustrate the
ccl_allreduceexecution, initializesendbuf(in real scenario it is provided by application):auto host_acc_sbuf = sendbuf.get_access<mode::write>(); for (i = 0; i < COUNT; i++) { host_acc_sbuf[i] = rank; }
For demonstration purposes only, modify the
sendbufon the GPU side:q.submit([&](cl::sycl::handler& cgh) { auto dev_acc_sbuf = sendbuf.get_access<mode::write>(cgh); cgh.parallel_for<class allreduce_test_sbuf_modify>(range<1>{COUNT}, [=](item<1> id) { dev_acc_sbuf[id] += 1; }); });
ccl_allreduceinvocation performs reduction of values from all processes and then distributes the result to all processes. In this case, the result is an array with the size equal to the number of processes (\(p\)), where all elements are equal to the sum of arithmetical progression:\[p \cdot (p - 1) / 2\]C version of oneCCL API:
ccl_allreduce(&sendbuf, &recvbuf, COUNT, ccl_dtype_int, ccl_reduction_sum, NULL, /* attr */ NULL, /* comm */ stream, &request); ccl_wait(request);
C++ version of oneCCL API:
comm.allreduce(sendbuf, recvbuf, COUNT, ccl::reduction::sum, nullptr, /* attr */ stream)->wait();
Check the correctness of
ccl_allreduceon the GPU:q.submit([&](handler& cgh) { auto dev_acc_rbuf = recvbuf.get_access<mode::write>(cgh); cgh.parallel_for<class allreduce_test_rbuf_check>(range<1>{COUNT}, [=](item<1> id) { if (dev_acc_rbuf[id] != size*(size+1)/2) { dev_acc_rbuf[id] = -1; } }); });
if (rank == COLL_ROOT) { auto host_acc_rbuf_new = recvbuf.get_access<mode::read>(); for (i = 0; i < COUNT; i++) { if (host_acc_rbuf_new[i] == -1) { cout << "FAILED" << endl; break; } } if (i == COUNT) { cout<<"PASSED"<<endl; } }
Note
When using C version of oneCCL API, it is required to explicitly free the created GPU ccl stream object:
ccl_stream_free(stream);
For C++ version of oneCCL API this will be performed implicitly.