Quantcast
Channel: Active questions tagged ubuntu - Stack Overflow
Viewing all articles
Browse latest Browse all 5956

Using tensorflow with GPU on Docker on Ubuntu

$
0
0

I've been struggling to the problem written below for many days and would like you to help me.
What I want to do is to use tensorflow with GPU on Docker on Ubuntu.
My GPU is GeForce GTX 1070, and my OS is Ubuntu 22.04.3 LTS

I've installed Docker

$ docker --version

Docker version 26.1.1, build 4cf5afa

Before I started the following, I removed every nvidia or cuda module.

$ sudo apt-get -y --purge remove nvidia*$ sudo apt-get -y --purge remove cuda*$ sudo apt-get -y --purge remove cudnn*$ sudo apt-get -y --purge remove libnvidia*$ sudo apt-get -y --purge remove libcuda*$ sudo apt-get -y --purge remove libcudnn*$ sudo apt-get autoremove$ sudo apt-get autoclean$ sudo apt-get update$ sudo rm -rf /usr/local/cuda*$ pip uninstall tensorflow-gpu

Afterward, I installed Nvidia driver

$ sudo apt install nvidia-driver-535

And nvidia-smi works fine.

$ nvidia-smi

Thu May 2 18:10:31 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
...

The next thing I did was to install CUDA Toolkit 12.2 Update 2 following the instruction shown below.

https://developer.nvidia.com/cuda-12-2-2-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local

I think CUDA Toolkit 12.2 Update 2 and driver 535.104.05 are compatible according to the info shown below.

https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

And then I installed NVIDIA Container Toolkit like below

$ curl https://get.docker.com | sh \&& sudo systemctl --now enable docker$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list$ sudo apt-get update$ sudo apt-get install -y nvidia-container-toolkit$ sudo nvidia-ctk runtime configure --runtime=docker$ sudo systemctl restart docker

And next, I pulled a docker image.

$ docker pull tensorflow/tensorflow:latest-gpu$ docker container run --rm --gpus all -it --name tf --mount type=bind,source=/home/(myname)/docker/tensorflow,target=/bindcont tensorflow/tensorflow:latest-gpu bash

In Docker container

root@a887e2a18124:/# python

Python 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import tensorflow as tf

2024-05-02 09:32:46.211605: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-05-02 09:32:46.238888: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

>>> tf.config.list_physical_devices()

2024-05-02 09:32:55.124912: Eexternal/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2024-05-02 09:32:55.124931: Iexternal/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:134] retrieving CUDA diagnostic information for host: 226046be5f092024-05-02 09:32:55.124934: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:141] hostname: 226046be5f092024-05-02 09:32:55.124963: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:165] libcuda reported version is: 545.23.6
2024-05-02 09:32:55.124975: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:169] kernel reported version is: 535.104.5
2024-05-02 09:32:55.124977: E external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:251] kernel version 535.104.5 does not match DSO version 545.23.6 -- cannot find working devices in this configuration
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
-- End of Message --

It seems the driver version and cuda version are inconsistent but I installed a dvriver version 535 not 545 as shown above.And I removed everything before I installed the driver-535.

Could anyone suggest what is wrong and what I should do?


My problem has not been solved yet.

I removed everything and reinstalled the Nvidia driver-545.
And followed the instruction https://github.com/NVIDIA/nvidia-docker (deprecated) and
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

This time I didn't installed CUDA Tool-kit but NVIDIA Container Toolkit.

I got from nvidia-smi
NVIDIA-SMI 545.29.06
Driver Version 545.29.06
CUDA Version 12.3

Then I ran a container

$ docker container run --rm -it --name tf --mount type=bind,source=/home/susumu/docker/tensorflow,target=/bindcont tensorflow/tensorflow:2.15.0rc1-gpu bash

When I ran sample.py, I got

# python sample.py

2024-05-02 13:46:01.669548: Iexternal/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cudadrivers on your machine, GPU will not be used. 2024-05-0213:46:01.689375: Eexternal/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unableto register cuDNN factory: Attempting to register factory for plugincuDNN when one has already been registered 2024-05-0213:46:01.689395: Eexternal/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable toregister cuFFT factory: Attempting to register factory for plugincuFFT when one has already been registered 2024-05-0213:46:01.690008: Eexternal/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unableto register cuBLAS factory: Attempting to register factory for plugincuBLAS when one has already been registered 2024-05-0213:46:01.693281: I external/local_tsl/tsl/cuda/cudart_stub.cc:31]Could not find cuda drivers on your machine, GPU will not be used.
2024-05-02 13:46:01.693384: Itensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlowbinary is optimized to use available CPU instructions inperformance-critical operations. To enable the followinginstructions: AVX2 AVX_VNNI FMA, in other operations, rebuildTensorFlow with the appropriate compiler flags. 2024-05-0213:46:02.374705: Eexternal/local_xla/xla/stream_executor/cuda/cuda_driver.cc:274] failedcall to cuInit: UNKNOWN ERROR (34)

tf.Tensor( [[1.] [1.]], shape=(2, 1), dtype=float32)

Here, sample.py is like below

# cat sample.pyimport osos.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'import tensorflow as tfx = tf.ones(shape=(2, 1))print(x)

Viewing all articles
Browse latest Browse all 5956

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>