RX 7900 XTX (24 GB) handles QLoRA fine-tuning of 7B–13B models comfortably. Flash-attention 2 works via ROCm CK (install separately). bitsandbytes ROCm fork required for quantized training.
ENV vars
export HSA_OVERRIDE_GFX_VERSION=11.0.0export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
Install hints
- git clone https://github.com/axolotl-ai-cloud/axolotl && cd axolotl
- pip install torch --index-url https://download.pytorch.org/whl/rocm6.2
- pip install packaging ninja && pip install flash-attn --no-build-isolation
- pip install -e '.[deepspeed]'
- accelerate launch -m axolotl.cli.train examples/llama-3/qlora.yml
Works well on RX 7900 XTX with PyTorch ROCm 6.2+. SDXL runs comfortably in 24 GB VRAM. Flux.1 also works but requires careful memory management.
ENV vars
export HSA_OVERRIDE_GFX_VERSION=11.0.0export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
Install hints
- Linux: git clone https://github.com/comfyanonymous/ComfyUI && cd ComfyUI
- python -m venv venv && source venv/bin/activate
- pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2
- pip install -r requirements.txt && python main.py --listen
- Windows (HIP SDK): pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2
RX 7900 XTX — excellent performance. ExLlamaV2 is one of the fastest GPTQ/EXL2 backends on AMD. Mistral 7B EXL2 4bpw runs at ~80 tok/s. 24 GB allows 34B Q4.
ENV vars
export HSA_OVERRIDE_GFX_VERSION=11.0.0
Install hints
- pip install torch --index-url https://download.pytorch.org/whl/rocm6.2
- pip install exllamav2
- python test_inference.py -m /path/to/model -p "Hello world"
- Or build from source for latest features: git clone https://github.com/turboderp/exllamav2 && pip install -e .
- cmake -DCMAKE_BUILD_TYPE=Release . && cmake --build . --target exl2 --config Release
faster-whisper itself targets CUDA; on AMD use the openai-whisper or whisperX route with PyTorch + ROCm, or run faster-whisper on CPU with int8 quantization (still fast for short clips). For GPU on AMD, use openai-whisper-rocm fork.
ENV vars
export HSA_OVERRIDE_GFX_VERSION=11.0.0
Install hints
- pip install torch torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
- Verify torch.cuda.is_available() returns True (yes, 'cuda' under ROCm)
- For pure faster-whisper: use device='cpu', compute_type='int8' as fallback
Compile with GGML_HIP=ON. Runs well on RX 7900 XTX; Q4_K_M models up to 70B fit in 24 GB. Pre-built HIP binaries available in GitHub releases. Vulkan build also works if you prefer to avoid ROCm: cmake -B build -DGGML_VULKAN=ON.
Install hints
- Linux: git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
- HIP (ROCm): cmake -B build -DGGML_HIP=ON && cmake --build build --config Release -j$(nproc)
- Vulkan (no ROCm needed): cmake -B build -DGGML_VULKAN=ON && cmake --build build --config Release -j$(nproc)
- Windows: download pre-built HIP binary from GitHub Releases (look for 'hip' in filename)
- Verify GPU is used: ./build/bin/llama-cli -m model.gguf -p 'hello' --n-gpu-layers 99
Works out of the box on Linux with ROCm 6.x. Tested on RX 7900 XTX (24 GB) running Qwen 2.5 14B and Llama 3.1 8B.
Install hints
- Linux: curl -fsSL https://ollama.com/install.sh | sh
- Windows: download the Ollama installer from https://ollama.com/download/windows (ships HIP libs)
- Verify with: ollama run llama3.1:8b (should hit GPU, not CPU)
- Watch GPU usage live: watch -n 1 rocm-smi
Works on RX 7900 XTX with PyTorch ROCm wheels. SDXL and SD 1.5 run well. Flux.1 requires additional setup (install flux dependencies separately).
ENV vars
export HSA_OVERRIDE_GFX_VERSION=11.0.0export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
Install hints
- git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui && cd stable-diffusion-webui
- Set TORCH_COMMAND before launch: export TORCH_COMMAND='pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2'
- Linux launch: ./webui.sh
- Windows: set TORCH_COMMAND=pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2 && webui-user.bat
RX 7900 XTX works well. vLLM pre-allocates GPU memory (90 % by default) so 24 GB lets you run 13–34B models. Use --gpu-memory-utilization to tune. Flash-attention is supported via ROCm's CK library.
ENV vars
export HSA_OVERRIDE_GFX_VERSION=11.0.0
Install hints
- python -m venv venv && source venv/bin/activate
- pip install vllm --extra-index-url https://download.pytorch.org/whl/rocm6.2
- python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct
- Verify with: curl http://localhost:8000/v1/models