RX 7900 XTX (24 GB) handles QLoRA fine-tuning of 7B–13B models comfortably. Flash-attention 2 works via ROCm CK (install separately). bitsandbytes ROCm fork required for quantized training.
ENV vars
export HSA_OVERRIDE_GFX_VERSION=11.0.0export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
Install hints
- git clone https://github.com/axolotl-ai-cloud/axolotl && cd axolotl
- pip install torch --index-url https://download.pytorch.org/whl/rocm6.2
- pip install packaging ninja && pip install flash-attn --no-build-isolation
- pip install -e '.[deepspeed]'
- accelerate launch -m axolotl.cli.train examples/llama-3/qlora.yml
Works well on RX 7900 XTX with PyTorch ROCm 6.2+. SDXL runs comfortably in 24 GB VRAM. Flux.1 also works but requires careful memory management.
ENV vars
export HSA_OVERRIDE_GFX_VERSION=11.0.0export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
Install hints
- Linux: git clone https://github.com/comfyanonymous/ComfyUI && cd ComfyUI
- python -m venv venv && source venv/bin/activate
- pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2
- pip install -r requirements.txt && python main.py --listen
- Windows (HIP SDK): pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2
RX 7900 XTX — excellent performance. ExLlamaV2 is one of the fastest GPTQ/EXL2 backends on AMD. Mistral 7B EXL2 4bpw runs at ~80 tok/s. 24 GB allows 34B Q4.
ENV vars
export HSA_OVERRIDE_GFX_VERSION=11.0.0
Install hints
- pip install torch --index-url https://download.pytorch.org/whl/rocm6.2
- pip install exllamav2
- python test_inference.py -m /path/to/model -p "Hello world"
- Or build from source for latest features: git clone https://github.com/turboderp/exllamav2 && pip install -e .
- cmake -DCMAKE_BUILD_TYPE=Release . && cmake --build . --target exl2 --config Release
faster-whisper itself targets CUDA; on AMD use the openai-whisper or whisperX route with PyTorch + ROCm, or run faster-whisper on CPU with int8 quantization (still fast for short clips). For GPU on AMD, use openai-whisper-rocm fork.
ENV vars
export HSA_OVERRIDE_GFX_VERSION=11.0.0
Install hints
- pip install torch torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
- Verify torch.cuda.is_available() returns True (yes, 'cuda' under ROCm)
- For pure faster-whisper: use device='cpu', compute_type='int8' as fallback
RX 7900 XTX — 24 GB VRAM. Q4_K_M models up to 70B fit comfortably. Tested on ROCm 7.2.3 / Ubuntu 24.04. Benchmark: Gemma 4 26B Q4_K_M ~102 t/s TG, ~3355 t/s PP. WARNING: omitting -DGGML_HIP=ON compiles fine but silently falls back to CPU — always verify GPU is used. Set HIP_VISIBLE_DEVICES=0 if you have an iGPU to prevent ROCm picking it up as a second device. Add -ngl 99 when running — without it layers run on CPU regardless of build flags. Vulkan build also works if you prefer to avoid ROCm. EXPERIMENTAL: TBQ4 KV cache + MTP on a community fork enables 64k context in ~20 GB VRAM (38-54 tok/s on Qwen3-27B Q4_K_M). See github.com/DrBearJew/llama.cpp tree tbq4-rdna3-experiment. DUAL GPU: Tensor parallelism requires ROCm — Vulkan TP is a WIP and only works for very small contexts. Set HIP_VISIBLE_DEVICES=0,1 and ROCBLAS_USE_HIPBLASLT=1. Use --split-mode tensor --tensor-split 1,1 --flash-attn on. Q8_0 fits in 48 GB (2x24 GB) and gives best accuracy. MTP (--spec-type draft-mtp --spec-draft-n-max 3) works on dual GPU but limits --parallel to 1 and disables mmproj. Use --no-mmap if system RAM < total VRAM; --ctx-checkpoints 16 manages RAM overhead at 131072 context. See github.com/ggml-org/llama.cpp/pull/22673 for MTP status.
Install hints
- Linux: git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
- HIP (ROCm) — full flags: cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/opt/rocm/bin/hipcc -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc && cmake --build build -j$(nproc)
- Vulkan (no ROCm needed): cmake -B build -DGGML_VULKAN=ON && cmake --build build --config Release -j$(nproc)
- Single GPU run: HIP_VISIBLE_DEVICES=0 ./build/bin/llama-server -m model.gguf -ngl 99 --host 0.0.0.0 --port 8000
- Dual GPU env: export HIP_VISIBLE_DEVICES=0,1 ROCBLAS_USE_HIPBLASLT=1
- Dual GPU run: ./build/bin/llama-server -m model.gguf -ngl 99 --split-mode tensor --tensor-split 1,1 --flash-attn on --ctx-size 131072 --no-mmap --host 0.0.0.0 --port 8000
- Dual GPU + MTP: add --spec-type draft-mtp --spec-draft-n-max 3 --spec-draft-ngl 99 --parallel 1 (MTP: no mmproj, parallel locked to 1)
- Windows: download pre-built HIP binary from GitHub Releases (look for 'hip' in filename)
Works out of the box on Linux with ROCm 6.x. Tested on RX 7900 XTX (24 GB) running Qwen 2.5 14B and Llama 3.1 8B.
Install hints
- Linux: curl -fsSL https://ollama.com/install.sh | sh
- Windows: download the Ollama installer from https://ollama.com/download/windows (ships HIP libs)
- Verify with: ollama run llama3.1:8b (should hit GPU, not CPU)
- Watch GPU usage live: watch -n 1 rocm-smi
Works on RX 7900 XTX with PyTorch ROCm wheels. SDXL and SD 1.5 run well. Flux.1 requires additional setup (install flux dependencies separately).
ENV vars
export HSA_OVERRIDE_GFX_VERSION=11.0.0export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
Install hints
- git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui && cd stable-diffusion-webui
- Set TORCH_COMMAND before launch: export TORCH_COMMAND='pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2'
- Linux launch: ./webui.sh
- Windows: set TORCH_COMMAND=pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2 && webui-user.bat
RX 7900 XTX works well. vLLM pre-allocates GPU memory (90 % by default) so 24 GB lets you run 13–34B models. Use --gpu-memory-utilization to tune. Flash-attention is supported via ROCm's CK library.
ENV vars
export HSA_OVERRIDE_GFX_VERSION=11.0.0
Install hints
- python -m venv venv && source venv/bin/activate
- pip install vllm --extra-index-url https://download.pytorch.org/whl/rocm6.2
- python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct
- Verify with: curl http://localhost:8000/v1/models