Rubik’s Cube 3x3x3x5x5x5, by Olav Ahrens Røtne on Unsplash Photo by Olav Ahrens Røtne on Unsplash

MultiCUDA: Multiple Versions of CUDA on One Machine

I manage the server environment for a research department working a lot with machine learning, and part of that means I have to maintain the configurations of our GPU servers. The problem is that different researchers have different programs or models depending on different versions of TensorFlow and CUDA. However, there is no straightforward way of just installing multiple versions and having it work.

One way around it is by using Docker to create custom environments for the various necessary configurations. This is probably the way you want to do it if you can, but sadly our users are not all comfortable enough with Docker yet, so here I am, digging around Linux manuals and environment variables trying to figure out a way to make it work.

The good news is that there is a way. It isn’t especially complicated either, once you know how, but it is a bit finicky.


The example I use is of installing CUDA 8.0 on an Ubuntu 16.04 machine that is currently configured to run CUDA 10.0, so that TensorFlow 1.4, that depends on CUDA 8.0, can be used. The principle should be the same for other Linux distributions and CUDA versions as well.

The only dependency I am aware of is that an NVIDIA driver version that supports the CUDA versions is installed. A compatibility chart can be found at https://docs.nvidia.com/deploy/cuda-compatibility/index.html, and as of writing the latest driver is backwards compatible to CUDA 7.0.

Note: By default, different versions os Ubuntu support different versions of CUDA. On 16.04, CUDA 8.0 to 10.2 are supported, while only CUDA 10.0 to 10.2 are supported on Ubuntu 18.04. I assume there are similar cases for other operating systems.

1. Install wanted CUDA Toolkit versions

$ sudo apt-get install cuda-toolkit-8-0

Installing multiple versions won’t cause any of the previous versions to get overwritten, so no need to worry. Each version you install will overwrite the configurations that cause the operating system to use a certain version, but by default they all get installed under /usr/local in separate directories by their version numbers.

2. Point symlink /usr/local/cuda to default version

$ cd /usr/local
$ sudo rm cuda
$ sudo ln -s cuda-10.0 cuda

By default, through environment variables, the system will use the version of CUDA that the symlink /usr/local/cuda points to, and this symlink is updated when you install the new version in the previous step. If you have a certain version that you want to use as the default, update the symlink to point to that version’s installation.

3. Install suitable cuDNN versions for each CUDA using the Library for Linux tar files

$ tar -xzvf cudnn-8.0-linux-x64-v6.0.tgz
$ sudo cp cuda/include/cudnn.h /usr/local/cuda-8.0/include
$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda-8.0/lib64
$ sudo chmod a+r /usr/local/cuda-8.0/include/cudnn.h /usr/local/cuda-8.0/lib64/libcudnn*

Make sure to use the Library for Linux cuDNN packages. They are downloadable from https://developer.nvidia.com/rdp/cudnn-download, but you need to register. If you use the installer to install these, they will not get installed in the correct location for each version of CUDA.

4. Add each CUDA lib directory to LD_LIBRARY_PATH in order

$ sudo sh -c ‘echo export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH > /etc/profile.d/cuda.sh’

This is where the magic happens!

Luckily, CUDA’s lib files are named with their version numbers appended at the end. TensorFlow will look for lib files named to match the specific version it needs in the locations specified by the LD_LIBRARY_PATH environment variable. This step should make LD_LIBRARY_PATH contain something like this:

/usr/local/cuda/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda-8.0/lib64 TensorFlow will first look for lib files located in /usr/local/cuda (the default version), and if it matches the version it needs it will use those. However, if it doesn’t match, it will continue trying to find a match located in one of the versioned directories, in the order they are listed, until it finds one.

In other words, if you now run something that depends on a CUDA 8.0 lib file, it will check in /usr/local/cuda, followed by/usr/local/cuda-10.0, without finding the right version. It then checks in /usr/local/cuda-8.0, which has the version of the file it needs, so it uses that.

5. Create a conda environment and install the wanted TensorFlow GPU version

 $ conda create -n tf14 python=2.7.6 pip 
 $ conda activate tf14
 $ pip install tensorflow-gpu==1.4

Now all you need is the version of TensorFlow flow you want to use. Since different users will want to use different versions, you can use something like Anaconda/Miniconda (https://www.anaconda.com/, https://conda.io/en/latest/miniconda.html) to run environments with different version.

6. Test that it works

$ python -c “import tensorflow; print(tensorflow.__version__); 
  print(tensorflow.test.gpu_device_name()); 
  print(tensorflow.test.is_gpu_available())”

This is just a simple python command that checks for the version of TensorFlow and if it has access to the GPU resources, so run this to make sure the configuration is correct.


Disclaimer: This seems to work well enough for us, but it is very likely that I have missed some details about the configuration that makes certain cases fail. Feel free to leave a comment if you notice any.


References

https://docs.nvidia.com/deploy/cuda-compatibility/index.html

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html

https://developer.nvidia.com/rdp/cudnn-download

https://help.ubuntu.com/community/EnvironmentVariables#Persistent_environment_variables

https://www.anaconda.com/

https://conda.io/en/latest/miniconda.html

https://www.tensorflow.org/install/source#tested_build_configurations

https://help.ubuntu.com/community/EnvironmentVariables#Persistent_environment_variables