What is top_p (nucleus sampling) in Ollama?

top_p controls which tokens the model considers at each step. With top_p 0.9, the model only considers the top 90% of the probability mass — cutting off unlikely tokens. Lowering it (e.g. 0.5) makes output more focused and conservative. Ollama's default is 0.9. In practice, tune temperature first, then top_p if you need finer control.

What is num_ctx in Ollama?

num_ctx sets the context window size — the maximum number of tokens the model can 'remember' in a conversation. The default is 2048 tokens (~1500 words). Increasing it (e.g. to 8192 or 32768) lets the model handle longer documents, but uses significantly more RAM. For a 7B model, 8192 tokens typically requires ~8GB VRAM.

How do I run a custom Ollama model from a Modelfile?

Save the Modelfile, then run: ollama create my-model-name -f ./Modelfile. Once built, start a chat with: ollama run my-model-name. To share it, push to Ollama's registry with: ollama push username/my-model-name.

What is the SYSTEM instruction in a Modelfile?

The SYSTEM instruction sets the system prompt — the hidden instruction that tells the model who it is and how to behave before any user message. For example: SYSTEM 'You are a senior Python engineer who gives concise, production-ready code.' This is the most powerful way to customize a model's personality and focus area.

What is repeat_penalty in Ollama?

repeat_penalty discourages the model from repeating the same words or phrases. The default is 1.1. Set it higher (1.3–1.5) if you find the model repeating itself. Set it to 1.0 to disable. Setting it too high can cause incoherent output.

What are stop tokens in Ollama?

Stop tokens are strings that cause the model to stop generating when it outputs them. For example, adding 'Human:' as a stop token prevents a chat model from writing the next turn of a conversation itself. Common stop tokens are ' ', ' ', 'User:', and '###'.

Ollama Modelfile Generator — Visual GUI for Custom Models & System Prompts

Q: What does temperature do in Ollama?

Temperature controls the randomness of the model's output. Lower values (0.0–0.5) make responses more deterministic and focused — good for code generation and factual Q&A. Higher values (0.8–1.5) make responses more creative and varied — good for storytelling and brainstorming. The default in Ollama is 0.8. Set it to 0 for completely deterministic output.

Configuration

Base Model

Model tag

Model name (for ollama create) Letters, numbers, hyphens only. Run with: ollama run my-assistant

System Prompt

Instructions

Parameters

Temperature (creativity vs. focus)

Focused 0.01.02.0 Creative

0.8

top_p (nucleus sampling)

0.00.51.0

0.9

top_k (token candidates)

1100200

num_ctx (context window, tokens)

5128k32k

2048

num_predict (max tokens to generate, -1 = unlimited)

-1 ∞20484096

128

repeat_penalty (penalize repetition)

0.5 Off1.02.0 Strong

1.1

seed (0 = random, any other = deterministic)

Stop Tokens optional

Few-Shot Messages optional

Modelfile

Run your model in terminal

ollama create my-assistant -f ./Modelfile

ollama run my-assistant

ollama list

What is an Ollama Modelfile?

A Modelfile is Ollama's version of a Dockerfile — a plain text configuration file that defines a custom AI model. It tells Ollama which base model to build from, what system prompt to use, and how to set inference parameters like temperature and context length.

Once you have a Modelfile, you create your custom model with one command: ollama create my-model -f Modelfile. From then on, ollama run my-model launches your personalized assistant immediately — no cloud, no API key, no token cost.

Key Modelfile directives

FROM — specifies the base model (required). Must be a model already pulled via ollama pull.
SYSTEM — the hidden system prompt sent before every conversation.
PARAMETER — sets inference parameters like temperature, context window, and stop tokens.
MESSAGE — adds few-shot examples to teach the model its expected input/output format.
TEMPLATE — overrides the default prompt template. Rarely needed unless you know the model's exact format.

Ollama parameter reference

Parameter	Default	Range	What it does
temperature	0.8	0.0 – 2.0	Higher = more random/creative. Lower = more focused/deterministic. Set to 0 for fully deterministic output.
top_p	0.9	0.0 – 1.0	Nucleus sampling. Only tokens in the top P% of probability mass are considered. Lower = more conservative.
top_k	40	1 – 200	Limits the token pool to the top K candidates at each step. Lower = less variety, higher = more diverse output.
num_ctx	2048	512 – 131072	Context window in tokens. Larger = more conversation history, but uses significantly more VRAM.
num_predict	128	-1 – ∞	Maximum number of tokens to generate per response. -1 means unlimited (model decides when to stop).
repeat_penalty	1.1	0.5 – 2.0	Penalizes repeated tokens to reduce loops and repetitive output. 1.0 disables it entirely.
seed	0	any int	Random seed. 0 = random each run. Any other value = same output for the same input, useful for reproducibility.
stop	(model default)	string(s)	Stop sequences — strings that cause generation to halt immediately when produced.

Frequently Asked Questions

What is an Ollama Modelfile?

An Ollama Modelfile is a configuration file — similar to a Dockerfile — that defines a custom AI model. It specifies the base model, system prompt, inference parameters, and optional few-shot examples. Run ollama create my-model -f Modelfile to build it and ollama run my-model to start chatting.

What does temperature do in Ollama?

Temperature controls output randomness. Values between 0.0–0.5 make responses more deterministic and precise — ideal for code generation and factual Q&A. Values from 0.8–1.5 produce more creative, varied output — good for writing and brainstorming. Ollama's default is 0.8. Set it to 0 for fully reproducible output.

What is top_p in Ollama?

top_p (nucleus sampling) restricts which tokens the model considers at each step. With top_p 0.9, only tokens that together account for 90% of the probability mass are eligible — cutting out the long tail of unlikely words. Lowering it makes output more conservative; raising it toward 1.0 opens up more variety. In practice, tune temperature first.

What is num_ctx and how much RAM does it need?

num_ctx sets the context window — how many tokens the model can see at once. The default is 2048 (~1500 words). Increasing it lets the model handle longer documents, but VRAM usage grows roughly linearly. A 7B parameter model typically needs ~1 GB extra VRAM per 4096 token increase in context length. For local hardware, 4096–8192 is a safe middle ground.

How do I run a Modelfile I've downloaded?

Make sure Ollama is installed and the base model is pulled first: ollama pull llama3.2. Then build your custom model: ollama create my-model -f ./Modelfile. Start chatting: ollama run my-model. To see all your models: ollama list.

What are few-shot messages in a Modelfile?

MESSAGE directives let you provide example conversations before the first user message. The model uses these to infer the expected response style, format, and tone. For example, adding a USER message of "Summarize this in 3 bullets:" followed by an ASSISTANT response showing the exact bullet format you want is highly effective for consistent output.

What are stop tokens and when should I use them?

Stop tokens are strings that cause generation to halt immediately when produced. Common examples: </s>, <|end|>, Human:, ###. Use them to prevent the model from writing the next turn of a conversation itself, or to terminate output at a logical boundary in your application.

What is the difference between temperature and top_p?

Both control output diversity but operate differently. Temperature scales the entire probability distribution (high temp = flat = all tokens more equally likely). top_p hard-cuts the token pool at a cumulative probability threshold. They interact: high temperature + low top_p can still produce focused output. The safest approach is to tune temperature first and leave top_p at 0.9 unless you need finer control.

What is an Ollama Modelfile?

Key Modelfile directives

Ollama parameter reference

Frequently Asked Questions

Related tools