Local-First AI

Ollama Modelfile Generator

Build a custom Ollama model visually. Set a system prompt, tune every parameter with sliders, and watch the live Modelfile preview update in real-time. Download and run in seconds.

Configuration
Letters, numbers, hyphens only. Run with: ollama run my-assistant
Focused 0.01.02.0 Creative
0.8
0.00.51.0
0.9
1100200
40
5128k32k
2048
-1 ∞20484096
128
0.5 Off1.02.0 Strong
1.1
Modelfile
Run your model in terminal
ollama create my-assistant -f ./Modelfile
ollama run my-assistant
ollama list

What is an Ollama Modelfile?

A Modelfile is Ollama's version of a Dockerfile — a plain text configuration file that defines a custom AI model. It tells Ollama which base model to build from, what system prompt to use, and how to set inference parameters like temperature and context length.

Once you have a Modelfile, you create your custom model with one command: ollama create my-model -f Modelfile. From then on, ollama run my-model launches your personalized assistant immediately — no cloud, no API key, no token cost.

Key Modelfile directives

  • FROM — specifies the base model (required). Must be a model already pulled via ollama pull.
  • SYSTEM — the hidden system prompt sent before every conversation.
  • PARAMETER — sets inference parameters like temperature, context window, and stop tokens.
  • MESSAGE — adds few-shot examples to teach the model its expected input/output format.
  • TEMPLATE — overrides the default prompt template. Rarely needed unless you know the model's exact format.

Ollama parameter reference

ParameterDefaultRangeWhat it does
temperature0.80.0 – 2.0Higher = more random/creative. Lower = more focused/deterministic. Set to 0 for fully deterministic output.
top_p0.90.0 – 1.0Nucleus sampling. Only tokens in the top P% of probability mass are considered. Lower = more conservative.
top_k401 – 200Limits the token pool to the top K candidates at each step. Lower = less variety, higher = more diverse output.
num_ctx2048512 – 131072Context window in tokens. Larger = more conversation history, but uses significantly more VRAM.
num_predict128-1 – ∞Maximum number of tokens to generate per response. -1 means unlimited (model decides when to stop).
repeat_penalty1.10.5 – 2.0Penalizes repeated tokens to reduce loops and repetitive output. 1.0 disables it entirely.
seed0any intRandom seed. 0 = random each run. Any other value = same output for the same input, useful for reproducibility.
stop(model default)string(s)Stop sequences — strings that cause generation to halt immediately when produced.

Frequently Asked Questions

What is an Ollama Modelfile?

An Ollama Modelfile is a configuration file — similar to a Dockerfile — that defines a custom AI model. It specifies the base model, system prompt, inference parameters, and optional few-shot examples. Run ollama create my-model -f Modelfile to build it and ollama run my-model to start chatting.

What does temperature do in Ollama?

Temperature controls output randomness. Values between 0.0–0.5 make responses more deterministic and precise — ideal for code generation and factual Q&A. Values from 0.8–1.5 produce more creative, varied output — good for writing and brainstorming. Ollama's default is 0.8. Set it to 0 for fully reproducible output.

What is top_p in Ollama?

top_p (nucleus sampling) restricts which tokens the model considers at each step. With top_p 0.9, only tokens that together account for 90% of the probability mass are eligible — cutting out the long tail of unlikely words. Lowering it makes output more conservative; raising it toward 1.0 opens up more variety. In practice, tune temperature first.

What is num_ctx and how much RAM does it need?

num_ctx sets the context window — how many tokens the model can see at once. The default is 2048 (~1500 words). Increasing it lets the model handle longer documents, but VRAM usage grows roughly linearly. A 7B parameter model typically needs ~1 GB extra VRAM per 4096 token increase in context length. For local hardware, 4096–8192 is a safe middle ground.

How do I run a Modelfile I've downloaded?

Make sure Ollama is installed and the base model is pulled first: ollama pull llama3.2. Then build your custom model: ollama create my-model -f ./Modelfile. Start chatting: ollama run my-model. To see all your models: ollama list.

What are few-shot messages in a Modelfile?

MESSAGE directives let you provide example conversations before the first user message. The model uses these to infer the expected response style, format, and tone. For example, adding a USER message of "Summarize this in 3 bullets:" followed by an ASSISTANT response showing the exact bullet format you want is highly effective for consistent output.

What are stop tokens and when should I use them?

Stop tokens are strings that cause generation to halt immediately when produced. Common examples: </s>, <|end|>, Human:, ###. Use them to prevent the model from writing the next turn of a conversation itself, or to terminate output at a logical boundary in your application.

What is the difference between temperature and top_p?

Both control output diversity but operate differently. Temperature scales the entire probability distribution (high temp = flat = all tokens more equally likely). top_p hard-cuts the token pool at a cumulative probability threshold. They interact: high temperature + low top_p can still produce focused output. The safest approach is to tune temperature first and leave top_p at 0.9 unless you need finer control.