GPU Settings¶
Configure NVIDIA GPU support for accelerated training.
Overview¶
GPU acceleration significantly speeds up: - Neural network training - Foundation model embeddings - Large dataset processing
Requirements¶
- NVIDIA GPU with CUDA support
- NVIDIA drivers installed
- NVIDIA Container Toolkit (for Docker mode)
Checking GPU Availability¶
Host System¶
Should show your GPU(s) with driver version and CUDA version.
In Docker¶
Installing NVIDIA Container Toolkit¶
Follow the official guide: NVIDIA Container Toolkit Installation
Quick Install (Ubuntu/Debian)¶
# Add repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# Install
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# Configure Docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verification¶
Running with GPU¶
Default (GPU Enabled)¶
GPU is detected and used automatically.
CPU Only¶
Disables GPU even if available.
Specific GPUs (Local Mode Only)¶
In local mode, you can use CUDA_VISIBLE_DEVICES:
# Use only GPU 0
CUDA_VISIBLE_DEVICES=0 ./run.sh --local
# Use GPUs 0 and 1
CUDA_VISIBLE_DEVICES=0,1 ./run.sh --local
# Use no GPU (equivalent to --cpu-only)
CUDA_VISIBLE_DEVICES="" ./run.sh --local
In Docker mode, Agentomics uses all available GPUs and does not expose a flag to select a subset.
GPU Memory¶
Monitoring¶
During training, monitor GPU memory:
Out of Memory¶
If you encounter OOM errors:
- Use
--cpu-onlyto switch to CPU - Reduce batch size in generated training scripts
- Use a smaller model architecture
- Use gradient accumulation
Multi-GPU¶
Agentomics-ML supports multi-GPU training:
- Agent-generated scripts may use DataParallel or DistributedDataParallel
- All available GPUs are passed to containers by default
To limit GPUs:
Docker GPU Flags¶
When running containers manually, you can limit GPUs with Docker flags:
docker run --gpus all ... # All GPUs
docker run --gpus '"device=0"' ... # Specific GPU
docker run --gpus 2 ... # First 2 GPUs
Troubleshooting¶
"nvidia-smi not found"¶
NVIDIA drivers not installed. Install from: NVIDIA Driver Downloads
"docker: Error response from daemon: could not select device driver"¶
NVIDIA Container Toolkit not installed or configured. Follow installation steps above.
GPU not detected in container¶
- Verify host GPU works:
nvidia-smi - Verify Docker can see GPU:
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi - Check Docker runtime:
docker info | grep -i runtime
CUDA version mismatch¶
The container uses a specific CUDA version. If your driver is older:
- Update NVIDIA drivers
- Or use
--cpu-onlymode
Performance is slow¶
- Check GPU utilization with
nvidia-smi - Ensure you're not CPU-bound (data loading)
- Verify training is actually using GPU (check nvidia-smi during training)
Local Mode GPU¶
In local mode, GPU is used automatically if: - NVIDIA drivers are installed - PyTorch CUDA is available
Check PyTorch CUDA:
Cloud GPU Instances¶
AWS¶
Use GPU instances (p3, p4, g4, g5 series) with Deep Learning AMI.
Google Cloud¶
Use GPU instances with Deep Learning VM.
Azure¶
Use NC/ND series VMs with GPU support.