tgoop.com/DataScienceM/4641
Last Update:
Here is a trick to optimize a neural network that gives about 4x speedup when transferring data from CPU to GPU.
Let's consider an image classification task.
We define the model, load and transform the data.
In the training loop, we transfer data to the GPU and train the network.
What's the problem:
If you look into the profiler,
- most of the resources go to the kernel (i.e., the training itself),
- but a noticeable amount of time is also spent on transferring data from CPU to GPU (cudaMemcpyAsync).
This can be easily reduced.
Initially, the dataset consists of pixels as 8-bit integers. We convert them to 32-bit floats.
Then we send these float tensors to the GPU. As a result, the data size becomes 4 times larger, making the transfer heavier.
The solution:
Shift the transformation step after the transfer. That is, first transfer the 8-bit ints, and then convert them to floats on the GPU.
As a result, the data transfer step speeds up significantly.
Of course, this doesn't work everywhere; for example, in NLP we initially deal with float embeddings.
But in cases where it applies, the speedup is very noticeable.