Quantcast
Channel: Active questions tagged ubuntu - Stack Overflow
Viewing all articles
Browse latest Browse all 6505

Locking NVIDIA GPU for User in a Shared Resource System

$
0
0

Requirement


We have a multi-user shared machine with 2 NVIDIA GPUs (A6000). Multiple users may be logged into the device, and the GPU must be checked to see if it is engaged (via nvidia-smi). Sometimes, a user may forget to check this and execute their code, crashing the other running script (CUDA memory full issues etc.)

Expected Solution


The user can call some form of a lock on the GPU, maybe by specifying the GPU ID, and the rest of the users can see that the resource is locked. A more desirable solution would also include defining the duration of the lock, after which it will automatically be released if not released earlier manually. This lock duration should also be visible to other users.

Ideally this lock is executed at a shell/terminal level and is not called for a specific script. Once the user sets the lock, the GPU (in some way) belongs to them for that period.

Please guide me with some tools or scripts I can use (or create) for this purpose. I tried finding solutions but could not find an appropriate one for this use case.

In the future, I believe this discussion would help any person setting up a shared workspace and not wanting to rely on manual checks and due diligence, e.g. someone forgetting to check if GPU is being used and crashing an already running script.

Ps - I am aware of setting up slurm, which would enable queue requests, but we do not want to set that up right now, and looking for a lighter and easier solution.


Viewing all articles
Browse latest Browse all 6505

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>