rocmate

gfx1100 — RX 7900 XT/XTX

Back to matrix

Chip: gfx1100  ·  8 tool(s) with data

Axolotl ✅ tested ROCm 6.2

RX 7900 XTX (24 GB) handles QLoRA fine-tuning of 7B–13B models comfortably. Flash-attention 2 works via ROCm CK (install separately). bitsandbytes ROCm fork required for quantized training.

ENV vars

  • export HSA_OVERRIDE_GFX_VERSION=11.0.0
  • export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True

Install hints

  • git clone https://github.com/axolotl-ai-cloud/axolotl && cd axolotl
  • pip install torch --index-url https://download.pytorch.org/whl/rocm6.2
  • pip install packaging ninja && pip install flash-attn --no-build-isolation
  • pip install -e '.[deepspeed]'
  • accelerate launch -m axolotl.cli.train examples/llama-3/qlora.yml
ComfyUI ✅ tested ROCm 6.2

Works well on RX 7900 XTX with PyTorch ROCm 6.2+. SDXL runs comfortably in 24 GB VRAM. Flux.1 also works but requires careful memory management.

ENV vars

  • export HSA_OVERRIDE_GFX_VERSION=11.0.0
  • export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True

Install hints

  • Linux: git clone https://github.com/comfyanonymous/ComfyUI && cd ComfyUI
  • python -m venv venv && source venv/bin/activate
  • pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2
  • pip install -r requirements.txt && python main.py --listen
  • Windows (HIP SDK): pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2
ExLlamaV2 ✅ tested ROCm 6.2

RX 7900 XTX — excellent performance. ExLlamaV2 is one of the fastest GPTQ/EXL2 backends on AMD. Mistral 7B EXL2 4bpw runs at ~80 tok/s. 24 GB allows 34B Q4.

ENV vars

  • export HSA_OVERRIDE_GFX_VERSION=11.0.0

Install hints

  • pip install torch --index-url https://download.pytorch.org/whl/rocm6.2
  • pip install exllamav2
  • python test_inference.py -m /path/to/model -p "Hello world"
  • Or build from source for latest features: git clone https://github.com/turboderp/exllamav2 && pip install -e .
  • cmake -DCMAKE_BUILD_TYPE=Release . && cmake --build . --target exl2 --config Release
faster-whisper ✅ tested ROCm 6.2

faster-whisper itself targets CUDA; on AMD use the openai-whisper or whisperX route with PyTorch + ROCm, or run faster-whisper on CPU with int8 quantization (still fast for short clips). For GPU on AMD, use openai-whisper-rocm fork.

ENV vars

  • export HSA_OVERRIDE_GFX_VERSION=11.0.0

Install hints

  • pip install torch torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
  • Verify torch.cuda.is_available() returns True (yes, 'cuda' under ROCm)
  • For pure faster-whisper: use device='cpu', compute_type='int8' as fallback
llama.cpp ✅ tested ROCm 7.2.3

RX 7900 XTX — 24 GB VRAM. Q4_K_M models up to 70B fit comfortably. Tested on ROCm 7.2.3 / Ubuntu 24.04. Benchmark: Gemma 4 26B Q4_K_M ~102 t/s TG, ~3355 t/s PP. WARNING: omitting -DGGML_HIP=ON compiles fine but silently falls back to CPU — always verify GPU is used. Set HIP_VISIBLE_DEVICES=0 if you have an iGPU to prevent ROCm picking it up as a second device. Add -ngl 99 when running — without it layers run on CPU regardless of build flags. Vulkan build also works if you prefer to avoid ROCm. EXPERIMENTAL: TBQ4 KV cache + MTP on a community fork enables 64k context in ~20 GB VRAM (38-54 tok/s on Qwen3-27B Q4_K_M). See github.com/DrBearJew/llama.cpp tree tbq4-rdna3-experiment. DUAL GPU: Tensor parallelism requires ROCm — Vulkan TP is a WIP and only works for very small contexts. Set HIP_VISIBLE_DEVICES=0,1 and ROCBLAS_USE_HIPBLASLT=1. Use --split-mode tensor --tensor-split 1,1 --flash-attn on. Q8_0 fits in 48 GB (2x24 GB) and gives best accuracy. MTP (--spec-type draft-mtp --spec-draft-n-max 3) works on dual GPU but limits --parallel to 1 and disables mmproj. Use --no-mmap if system RAM < total VRAM; --ctx-checkpoints 16 manages RAM overhead at 131072 context. See github.com/ggml-org/llama.cpp/pull/22673 for MTP status.

Install hints

  • Linux: git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
  • HIP (ROCm) — full flags: cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/opt/rocm/bin/hipcc -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc && cmake --build build -j$(nproc)
  • Vulkan (no ROCm needed): cmake -B build -DGGML_VULKAN=ON && cmake --build build --config Release -j$(nproc)
  • Single GPU run: HIP_VISIBLE_DEVICES=0 ./build/bin/llama-server -m model.gguf -ngl 99 --host 0.0.0.0 --port 8000
  • Dual GPU env: export HIP_VISIBLE_DEVICES=0,1 ROCBLAS_USE_HIPBLASLT=1
  • Dual GPU run: ./build/bin/llama-server -m model.gguf -ngl 99 --split-mode tensor --tensor-split 1,1 --flash-attn on --ctx-size 131072 --no-mmap --host 0.0.0.0 --port 8000
  • Dual GPU + MTP: add --spec-type draft-mtp --spec-draft-n-max 3 --spec-draft-ngl 99 --parallel 1 (MTP: no mmproj, parallel locked to 1)
  • Windows: download pre-built HIP binary from GitHub Releases (look for 'hip' in filename)
Ollama ✅ tested ROCm 6.3

Works out of the box on Linux with ROCm 6.x. Tested on RX 7900 XTX (24 GB) running Qwen 2.5 14B and Llama 3.1 8B.

Install hints

  • Linux: curl -fsSL https://ollama.com/install.sh | sh
  • Windows: download the Ollama installer from https://ollama.com/download/windows (ships HIP libs)
  • Verify with: ollama run llama3.1:8b (should hit GPU, not CPU)
  • Watch GPU usage live: watch -n 1 rocm-smi
Stable Diffusion WebUI ✅ tested ROCm 6.2

Works on RX 7900 XTX with PyTorch ROCm wheels. SDXL and SD 1.5 run well. Flux.1 requires additional setup (install flux dependencies separately).

ENV vars

  • export HSA_OVERRIDE_GFX_VERSION=11.0.0
  • export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True

Install hints

  • git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui && cd stable-diffusion-webui
  • Set TORCH_COMMAND before launch: export TORCH_COMMAND='pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2'
  • Linux launch: ./webui.sh
  • Windows: set TORCH_COMMAND=pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2 && webui-user.bat
vLLM ✅ tested ROCm 6.2

RX 7900 XTX works well. vLLM pre-allocates GPU memory (90 % by default) so 24 GB lets you run 13–34B models. Use --gpu-memory-utilization to tune. Flash-attention is supported via ROCm's CK library.

ENV vars

  • export HSA_OVERRIDE_GFX_VERSION=11.0.0

Install hints

  • python -m venv venv && source venv/bin/activate
  • pip install vllm --extra-index-url https://download.pytorch.org/whl/rocm6.2
  • python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct
  • Verify with: curl http://localhost:8000/v1/models