<aside> 🧠
Gemma 4 on an M2 Mac (32GB): Setup + Context Window Guide
Get a fast local LLM workflow on Apple Silicon—and understand why context length can make performance fall off a cliff.
</aside>
Apple’s Unified Memory means your CPU/GPU share the same 32GB pool. That makes context size (KV cache) just as important as the model weights.
Ollama is currently one of the fastest ways to run Gemma 4 on macOS because it leverages Apple’s MLX framework.
Open Terminal and run:
ollama run gemma4:26b
Tip: The 26B MoE variant is a great fit for 32GB machines because it activates roughly ~4B parameters at a time, making it surprisingly fast.
By default, your runtime may start with a smaller context window. To increase it, create a Modelfile:
FROM gemma4:26b
PARAMETER num_ctx 32768
Then build and run the customized model:
ollama create gemma4-32k -f Modelfile
ollama run gemma4-32k
Run this in your terminal to start Open WebUI with a non-conflicting port:
docker run -d \\
-p 55001:8080 \\
--add-host=host.docker.internal:host-gateway \\
-v open-webui:/app/backend/data \\
--name open-webui \\
--restart always \\
ghcr.io/open-webui/open-webui:main
http://localhost:55001