rocmate — gfx1151 (Radeon 8060S / 8050S (Strix Halo))

Chip: gfx1151 · 8 tool(s) with data

Axolotl 🟡 partial ROCm 7.x

Strix Halo — QLoRA of 7B models should fit in the shared memory pool. Flash-attn 2 via ROCm CK. bitsandbytes ROCm fork required. Expect lower throughput vs. dGPU due to unified memory bandwidth.

ENV vars

export HSA_OVERRIDE_GFX_VERSION=11.5.1
export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True

Install hints

git clone https://github.com/axolotl-ai-cloud/axolotl && cd axolotl
pip install torch --index-url https://download.pytorch.org/whl/rocm7.2
Use gradient_checkpointing: true and micro_batch_size: 1 in your config

ComfyUI 🟡 partial ROCm 7.x

Strix Halo — works via PyTorch ROCm 7.x. APU shared memory limits model size. SD 1.5 and SDXL should work; Flux.1 may be tight with shared memory. Use --lowvram to conserve memory.

ENV vars

export HSA_OVERRIDE_GFX_VERSION=11.5.1
export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True

Install hints

git clone https://github.com/comfyanonymous/ComfyUI && cd ComfyUI
python -m venv venv && source venv/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm7.2
pip install -r requirements.txt && python main.py --listen --lowvram

ExLlamaV2 🟡 partial ROCm 7.x

Strix Halo — should work via PyTorch ROCm 7.x. GPTQ/EXL2 kernels may need ROCm 7.x rocBLAS. APU unified memory limits model size to ~7B Q4 with reasonable context length.

ENV vars

export HSA_OVERRIDE_GFX_VERSION=11.5.1

Install hints

pip install torch --index-url https://download.pytorch.org/whl/rocm7.2
pip install exllamav2
Reduce max_seq_len if VRAM exhausted.

faster-whisper ✅ tested ROCm 7.13

Strix Halo (Radeon 8060S) — tested successfully with insanely-fast-whisper-rocm on ROCm 7.13. Uses native HuggingFace Transformers pipeline + PyTorch SDPA on HIP. Model loads to GPU (cuda device = ROCm hip), hip=7.13.26154 detected, SDPA used. Supports word timestamps, VAD, and Demucs denoising.

Install hints

git clone https://github.com/beecave-homelab/insanely-fast-whisper-rocm && cd insanely-fast-whisper-rocm
mamba run -n therock pip install fastapi uvicorn python-multipart python-dotenv pyyaml click pydub ffmpeg-python soundfile rich huggingface-hub accelerate optimum gradio typer demucs openai-whisper
Also run: mamba install -n therock -c conda-forge ffmpeg -y
Download model: huggingface-cli download openai/whisper-large-v3-turbo
CLI: mamba run -n therock python -m insanely_fast_whisper_rocm.cli transcribe audio.wav --model openai/whisper-large-v3-turbo --language en --device cuda --timestamp-type word
Use --timestamp-type word (or chunk/sentence) to avoid post-processing bugs with None chunks
Tip: processing_time_seconds in output JSON shows actual GPU inference time
API server: mamba run -n therock python -m insanely_fast_whisper_rocm

llama.cpp ✅ tested ROCm 7.x

Strix Halo — compile with GGML_HIP=ON under ROCm 7.x. Also works via Vulkan backend as a fallback (GGML_VULKAN=ON). rocWMMA tuning may improve perf. APU unified memory; monitor total system RAM usage, not just VRAM.

Install hints

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build -DGGML_HIP=ON && cmake --build build --config Release -j$(nproc)
Vulkan fallback: cmake -B build -DGGML_VULKAN=ON
Verify GPU: ./build/bin/llama-cli -m model.gguf -p 'hello' --n-gpu-layers 99

Ollama ✅ tested ROCm 7.x

Radeon 8060S / 8050S (Strix Halo APU) — works with ROCm 7.x. Ollama detects the iGPU natively. Unified memory architecture means no dedicated VRAM; up to ~10 GB usable for models depending on system RAM configuration.

Install hints

curl -fsSL https://ollama.com/install.sh | sh
Verify with: ollama run llama3.1:8b --verbose (check GPU layers loaded)
Monitor via: watch -n 1 rocm-smi

Stable Diffusion WebUI 🟡 partial ROCm 7.x

Strix Halo — works via PyTorch ROCm 7.x. APU shared memory may limit batch size and resolution. SD 1.5 is comfortable; SDXL needs --medvram.

ENV vars

export HSA_OVERRIDE_GFX_VERSION=11.5.1
export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True

Install hints

export TORCH_COMMAND='pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm7.2'
./webui.sh --medvram

vLLM ✅ tested ROCm 7.13 nightly (TheRock)

Strix Halo (gfx1151) — use Kyuz0's Strix Halo vLLM toolbox/container. The stable image builds vLLM and PyTorch against TheRock ROCm gfx1151 nightlies and includes Strix-specific patches for amdsmi, AITER/FlashAttention, Triton, and APU unified-memory accounting. Smoke-tested locally on Radeon 8060S with docker.io/kyuz0/vllm-therock-gfx1151:stable: PyTorch ROCm matmul worked, vLLM 0.19.2rc1.dev113+g6aa057c9d served facebook/opt-125m through the OpenAI API, and completions returned successfully. If startup reports free unified memory below the requested allocation, lower --gpu-memory-utilization. A Linux kernel with the gfx1151 CWSR fix and sane GTT/TTM limits is still recommended.

Install hints

git clone https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes && cd amd-strix-halo-vllm-toolboxes
Recommended: ./refresh_toolbox.sh stable
Manual toolbox: toolbox create vllm --image docker.io/kyuz0/vllm-therock-gfx1151:stable -- --device /dev/dri --device /dev/kfd --group-add video --group-add render --security-opt seccomp=unconfined
Enter and launch: toolbox enter vllm && start-vllm
Direct serve example: vllm serve <model> --gpu-memory-utilization 0.5 --attention-backend TRITON_ATTN --mm-encoder-attn-backend TRITON_ATTN
If vLLM refuses to start with a free-memory error, reduce --gpu-memory-utilization (for example 0.35 on busy 128 GB systems).