GPU support¶
The choice between CPU and GPU backends is performed by specifying ccl_stream_type
value at the moment when ccl stream object is created:
For GPU backend you should specify
ccl_stream_sycl
as the first argument.For collective operations, which operate on SYCL* stream, C version of oneCCL API expects communication buffers to be
sycl::buffer*
objects cast tovoid*
.
The example below demonstrates these concepts.
Example¶
Consider simple allreduce
example for GPU.
Create GPU ccl stream object:
C version of oneCCL API:
ccl_stream_create(ccl_stream_sycl, &q, &stream);
C++ version of oneCCL API:
ccl::stream_t stream = ccl::environment::instance().create_stream(cc::stream_type::sycl, &q);
q
is an object of typesycl::queue
.To illustrate the
ccl_allreduce
execution, initializesendbuf
(in real scenario it is provided by application):auto host_acc_sbuf = sendbuf.get_access<mode::write>(); for (i = 0; i < COUNT; i++) { host_acc_sbuf[i] = rank; }
For demonstration purposes only, modify the
sendbuf
on the GPU side:q.submit([&](cl::sycl::handler& cgh) { auto dev_acc_sbuf = sendbuf.get_access<mode::write>(cgh); cgh.parallel_for<class allreduce_test_sbuf_modify>(range<1>{COUNT}, [=](item<1> id) { dev_acc_sbuf[id] += 1; }); });
ccl_allreduce
invocation performs reduction of values from all processes and then distributes the result to all processes. In this case, the result is an array with the size equal to the number of processes (\(p\)), where all elements are equal to the sum of arithmetical progression:\[p \cdot (p - 1) / 2\]C version of oneCCL API:
ccl_allreduce(&sendbuf, &recvbuf, COUNT, ccl_dtype_int, ccl_reduction_sum, NULL, /* attr */ NULL, /* comm */ stream, &request); ccl_wait(request);
C++ version of oneCCL API:
comm.allreduce(sendbuf, recvbuf, COUNT, ccl::reduction::sum, nullptr, /* attr */ stream)->wait();
Check the correctness of
ccl_allreduce
on the GPU:q.submit([&](handler& cgh) { auto dev_acc_rbuf = recvbuf.get_access<mode::write>(cgh); cgh.parallel_for<class allreduce_test_rbuf_check>(range<1>{COUNT}, [=](item<1> id) { if (dev_acc_rbuf[id] != size*(size+1)/2) { dev_acc_rbuf[id] = -1; } }); });
if (rank == COLL_ROOT) { auto host_acc_rbuf_new = recvbuf.get_access<mode::read>(); for (i = 0; i < COUNT; i++) { if (host_acc_rbuf_new[i] == -1) { cout << "FAILED" << endl; break; } } if (i == COUNT) { cout<<"PASSED"<<endl; } }
Note
When using C version of oneCCL API, it is required to explicitly free the created GPU ccl stream object:
ccl_stream_free(stream);
For C++ version of oneCCL API this will be performed implicitly.