4. Logits: Making Predictions
Concept: Logits are raw numerical scores the model assigns to each possible next token before making its final selection.
First phase: The model splits the prompt into tokens, converts the tokens to embeddings, and processes the sequence of embeddings through its layers (e.g., transformer blocks), which use attention mechanisms to understand relationships and context. The model then produces logits—raw scores for every possible next token. These logits are converted to probabilities using the softmax function (see Wikipedia). This calculation phase is deterministic—identical inputs always produce the same probability distribution.
Second phase: The model selects tokens from this distribution, either deterministically (by always choosing the highest-probability token, if configured to do so) or with controlled randomness (to balance accuracy with creativity, depending on the sampling parameters).
Everyday Example: When completing "The capital of France is ____," a model assigns high scores to relevant answers like "Paris" and low scores to irrelevant options like "banana."
Input: "The capital of France is"
How Token Selection Works
| Rank (k) |
Token |
Raw Logit |
Base Probability |
| 1 |
"Paris" |
8.2 |
80% |
| 2 |
"Lyon" |
4.6 |
10% |
| 3 |
"Nice" |
3.9 |
5% |
| 4 |
"Marseille" |
3.2 |
3% |
| 5 |
"banana" |
-5.0 |
0.1% |
| 6+ |
Other tokens |
varies |
1.9% |
Temperature
Modifies the probability distribution itself. Lower temperatures make the model more deterministic, leading to predictable outputs, while higher temperatures introduce more randomness and creativity. The allowed range depends on the provider and model—check your API documentation.
- Low (0.2): Makes likely tokens even more likely
- High (1.0): Makes distribution more uniform
With temperature 0.2, "Paris" might be 95% likely
topP (also called Nucleus Sampling)
Uses a cumulative probability distribution. Sorts all possible next tokens by their probability (from highest to lowest). Then selects the smallest set of tokens whose cumulative probability adds up to the value of topP (e.g., 0.9 means the top tokens that together make up 90% of the probability)
- topP = 0.9: Only "Paris", "Lyon", "Nice" considered (95% cumulative)
- topP = 0.8: Only "Paris" considered (80% cumulative)
It's more flexible than top-K because it dynamically adjusts the number of tokens based on their probabilities
topK
Considers only K most likely tokens
- topK = 3: Only "Paris", "Lyon", "Nice" considered
- topK = 1: Only "Paris" considered
Fixed number regardless of probabilities
💡Tip: For most use cases, set either temperature or topP—not both. Controlling both can lead to unpredictable or unstable results, as both parameters affect randomness in different ways.
Combined Effect: These parameters work together to control selection. Temperature modifies the distribution, then topP and topK filter which tokens can be selected from the modified distribution.
Practical Application: The temperature, topP, and topK parameters control creativity vs. predictability in responses. These parameters let you balance deterministic, factual outputs with more creative, varied responses.