Hi I am trying to run rastervision pipeline on a GPU NVIDIA GEOFORCE 3050 RTX.
- Ubuntu 22.04
- Pytorch: Version: 1.12.0+cu116
- CUDA: 12
But when I run the Docker container like that:
sudo docker run --rm --runtime=nvidia --gpus all -it -v ${RV_QUICKSTART_CODE_DIR}:/opt/src/code -v ${RV_QUICKSTART_OUT_DIR}:/opt/data/output quay.io/azavea/raster-vision:pytorch-0.20 /bin/bash
The model does not train and outputs this error:
RuntimeError: CUDA error: no kernel image is available for executionon the device CUDA kernel errors might be asynchronously reported atsome other API call,so the stacktrace below might be incorrect. Fordebugging consider passing CUDA_LAUNCH_BLOCKING=1.
PD: running nvidia-smi outputs the characteristics of the GPU, meaning it is recognized.
This is the output I get:
`Skipping 'analyze' command...python -m rastervision.pipeline.cli run_command /opt/data/output/pipeline-config.json trainRunning train command...2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Building datasets ...2023-03-09 08:53:29:rastervision.core.data.raster_source.rasterio_source: WARNING - Raster block size (2, 650) is too non-square. This can slow down reading. Consider re-tiling using GDAL.2023-03-09 08:53:29:rastervision.core.data.raster_source.rasterio_source: WARNING - Raster block size (2, 650) is too non-square. This can slow down reading. Consider re-tiling using GDAL.2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Physical CPUs: 122023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Logical CPUs: 162023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Total memory: 15.30 GB2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Size of /opt/data volume: 445.44 GB2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Size of / volume: 445.44 GB2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Python version: 3.9.16 (main, Jan 11 2023, 16:05:54) [GCC 11.2.0]/bin/sh: 1: nvcc: not found2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - 2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Thu Mar 9 08:53:29 2023 +-----------------------------------------------------------------------------+| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. || | | MIG M. ||===============================+======================+======================|| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A || N/A 37C P3 14W / 30W | 262MiB / 4096MiB | 7% Default || | | N/A |+-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|+-----------------------------------------------------------------------------+2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Devices:2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - index, name, driver_version, memory.total [MiB], memory.used [MiB], memory.free [MiB]0, NVIDIA GeForce RTX 3050 Ti Laptop GPU, 525.89.02, 4096 MiB, 262 MiB, 3639 MiB2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - PyTorch version: 1.12.1+cu1022023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - CUDA available: True2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - CUDA version: 10.22023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - CUDNN version: 76052023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Number of CUDA devices: 12023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Active CUDA Device: GPU 02023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - model=SemanticSegmentationModelConfig(backbone=<Backbone.resnet50: 'resnet50'>, pretrained=True, init_weights=None, load_strict=True, external_def=None) solver=SolverConfig(lr=0.0001, num_epochs=1, test_num_epochs=2, test_batch_sz=4, overfit_num_steps=1, sync_interval=1, batch_sz=2, one_cycle=True, multi_stage=[], class_loss_weights=None, ignore_class_index=None, external_loss_def=None) data=SemanticSegmentationGeoDataConfig(scene_dataset='<1 train_scenes, 1 validation_scenes, 0 test_scenes>', window_opts="method=<GeoDataWindowMethod.random: 'random'> size=300 stride=None padding=None pad_direction='end' size_lims=(300, 301) h_lims=None w_lims=None max_windows=10 max_sample_attempts=100 efficient_aoi_sampling=True") predict_mode=False test_mode=False overfit_mode=False eval_train=False save_model_bundle=True log_tensorboard=True run_tensorboard=False output_uri='/opt/data/output/train'2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Using device: cuda2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - train_ds: 10 items2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - valid_ds: 10 items2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - test_ds: 0 items2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Plotting sample training batch.2023-03-09 08:53:30:rastervision.pytorch_learner.learner: INFO - Plotting sample validation batch.2023-03-09 08:53:31:rastervision.pytorch_learner.learner: INFO - epoch: 0Training: 0%| | 0/5 [00:00<?, ?it/s]Traceback (most recent call last): File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 251, in <module> _main() File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 247, in _main main() File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__ return self.main(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 236, in run_command _run_command( File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 218, in _run_command command_fn() File "/opt/src/rastervision_core/rastervision/core/rv_pipeline/rv_pipeline.py", line 154, in train backend.train(source_bundle_uri=self.config.source_bundle_uri) File "/opt/src/rastervision_pytorch_backend/rastervision/pytorch_backend/pytorch_learner_backend.py", line 120, in train learner.main() File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py", line 267, in main self.train() File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py", line 1265, in train train_metrics = self.train_epoch( File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py", line 1188, in train_epoch output = self.train_step(batch, batch_ind) File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/semantic_segmentation_learner.py", line 26, in train_step out = self.post_forward(self.model(x)) File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/torchvision/models/segmentation/_utils.py", line 23, in forward features = self.backbone(x) File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/torchvision/models/_utils.py", line 69, in forward x = module(x) File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 148, in forward self.num_batches_tracked.add_(1) # type: ignore[has-type]RuntimeError: CUDA error: no kernel image is available for execution on the deviceCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.For debugging consider passing CUDA_LAUNCH_BLOCKING=1.make: *** [/opt/data/output/Makefile:6: 0] Error 1`