I've been struggling to the problem written below for many days and would like you to help me.
What I want to do is to use tensorflow with GPU on Docker on Ubuntu.
My GPU is GeForce GTX 1070, and my OS is Ubuntu 22.04.3 LTS
I've installed Docker
$ docker --version
Docker version 26.1.1, build 4cf5afa
Before I started the following, I removed every nvidia or cuda module.
$ sudo apt-get -y --purge remove nvidia*$ sudo apt-get -y --purge remove cuda*$ sudo apt-get -y --purge remove cudnn*$ sudo apt-get -y --purge remove libnvidia*$ sudo apt-get -y --purge remove libcuda*$ sudo apt-get -y --purge remove libcudnn*$ sudo apt-get autoremove$ sudo apt-get autoclean$ sudo apt-get update$ sudo rm -rf /usr/local/cuda*$ pip uninstall tensorflow-gpu
Afterward, I installed Nvidia driver
$ sudo apt install nvidia-driver-535
And nvidia-smi works fine.
$ nvidia-smi
Thu May 2 18:10:31 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
...
The next thing I did was to install CUDA Toolkit 12.2 Update 2 following the instruction shown below.
I think CUDA Toolkit 12.2 Update 2 and driver 535.104.05 are compatible according to the info shown below.
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
And then I installed NVIDIA Container Toolkit like below
$ curl https://get.docker.com | sh \&& sudo systemctl --now enable docker$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list$ sudo apt-get update$ sudo apt-get install -y nvidia-container-toolkit$ sudo nvidia-ctk runtime configure --runtime=docker$ sudo systemctl restart docker
And next, I pulled a docker image.
$ docker pull tensorflow/tensorflow:latest-gpu$ docker container run --rm --gpus all -it --name tf --mount type=bind,source=/home/(myname)/docker/tensorflow,target=/bindcont tensorflow/tensorflow:latest-gpu bash
In Docker container
root@a887e2a18124:/# python
Python 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2024-05-02 09:32:46.211605: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0
.
2024-05-02 09:32:46.238888: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> tf.config.list_physical_devices()
2024-05-02 09:32:55.124912: Eexternal/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2024-05-02 09:32:55.124931: Iexternal/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:134] retrieving CUDA diagnostic information for host: 226046be5f092024-05-02 09:32:55.124934: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:141] hostname: 226046be5f092024-05-02 09:32:55.124963: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:165] libcuda reported version is: 545.23.6
2024-05-02 09:32:55.124975: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:169] kernel reported version is: 535.104.5
2024-05-02 09:32:55.124977: E external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:251] kernel version 535.104.5 does not match DSO version 545.23.6 -- cannot find working devices in this configuration
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
-- End of Message --
It seems the driver version and cuda version are inconsistent but I installed a dvriver version 535 not 545 as shown above.And I removed everything before I installed the driver-535.
Could anyone suggest what is wrong and what I should do?
My problem has not been solved yet.
I removed everything and reinstalled the Nvidia driver-545.
And followed the instruction https://github.com/NVIDIA/nvidia-docker (deprecated) and
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
This time I didn't installed CUDA Tool-kit but NVIDIA Container Toolkit.
I got from nvidia-smi
NVIDIA-SMI 545.29.06
Driver Version 545.29.06
CUDA Version 12.3
Then I ran a container
$ docker container run --rm -it --name tf --mount type=bind,source=/home/susumu/docker/tensorflow,target=/bindcont tensorflow/tensorflow:2.15.0rc1-gpu bash
When I ran sample.py, I got
# python sample.py
2024-05-02 13:46:01.669548: Iexternal/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cudadrivers on your machine, GPU will not be used. 2024-05-0213:46:01.689375: Eexternal/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unableto register cuDNN factory: Attempting to register factory for plugincuDNN when one has already been registered 2024-05-0213:46:01.689395: Eexternal/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable toregister cuFFT factory: Attempting to register factory for plugincuFFT when one has already been registered 2024-05-0213:46:01.690008: Eexternal/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unableto register cuBLAS factory: Attempting to register factory for plugincuBLAS when one has already been registered 2024-05-0213:46:01.693281: I external/local_tsl/tsl/cuda/cudart_stub.cc:31]Could not find cuda drivers on your machine, GPU will not be used.
2024-05-02 13:46:01.693384: Itensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlowbinary is optimized to use available CPU instructions inperformance-critical operations. To enable the followinginstructions: AVX2 AVX_VNNI FMA, in other operations, rebuildTensorFlow with the appropriate compiler flags. 2024-05-0213:46:02.374705: Eexternal/local_xla/xla/stream_executor/cuda/cuda_driver.cc:274] failedcall to cuInit: UNKNOWN ERROR (34)tf.Tensor( [[1.] [1.]], shape=(2, 1), dtype=float32)
Here, sample.py is like below
# cat sample.pyimport osos.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'import tensorflow as tfx = tf.ones(shape=(2, 1))print(x)