<aside> 🧠

Gemma 4 on an M2 Mac (32GB): Setup + Context Window Guide

Get a fast local LLM workflow on Apple Silicon—and understand why context length can make performance fall off a cliff.

</aside>

Apple’s Unified Memory means your CPU/GPU share the same 32GB pool. That makes context size (KV cache) just as important as the model weights.


🛠 Step 1: The Setup (Recommended: Ollama)

Ollama is currently one of the fastest ways to run Gemma 4 on macOS because it leverages Apple’s MLX framework.

1) Install Ollama

2) Run the model

Open Terminal and run:

ollama run gemma4:26b

Tip: The 26B MoE variant is a great fit for 32GB machines because it activates roughly ~4B parameters at a time, making it surprisingly fast.

3) Adjust context length (critical)

By default, your runtime may start with a smaller context window. To increase it, create a Modelfile:

FROM gemma4:26b
PARAMETER num_ctx 32768

Then build and run the customized model:

ollama create gemma4-32k -f Modelfile
ollama run gemma4-32k

4) (Optional) Run Open WebUI (non-conflicting port)

Run this in your terminal to start Open WebUI with a non-conflicting port:

docker run -d \\
  -p 55001:8080 \\
  --add-host=host.docker.internal:host-gateway \\
  -v open-webui:/app/backend/data \\
  --name open-webui \\
  --restart always \\
  ghcr.io/open-webui/open-webui:main