TL;DR: We ran 1,140 LinkedIn post generations across 19 configurations to test min_p sampling. The headline finding: min_p prevents quality collapse at high temperatures - and this effect is massive (Cohen's d = 2.57, p < 0.001).
The Problem: High Temperature = Chaos
If you've ever cranked up an LLM's temperature hoping for more creative outputs, you've probably experienced the disappointment that follows. At temperature 2.0, most models produce word salad. The outputs become incoherent, repetitive, or just plain nonsense.
This is the creativity-quality tradeoff that's plagued LLM users since GPT-2. Want more creative outputs? Prepare to sift through garbage.
Or so we thought.
Enter Min_p: A Better Way to Sample
In 2024, researchers proposed a simple fix: instead of sampling from all possible next tokens (which includes garbage at high temperatures), only sample from tokens whose probability is at least some fraction of the top token's probability.
That fraction is min_p.
The idea is elegant: at high temperatures, the probability distribution flattens out, giving nonsensical tokens a fighting chance. Min_p acts as a dynamic filter - it says "I don't care how flat the distribution is, only consider tokens that are at least X% as likely as the best option."
The Experiment
We took 60 PLG startup founders (Notion, Figma, PostHog, etc.) and analyzed their LinkedIn writing styles. Then we generated posts for each founder across 19 different configurations:
- Claude Opus 4.5 at temperatures 0.7 and 1.0 (above 1.0 is not allowed by Claude API).
- Kimi K2.5 at temperatures 0.7, 1.0, 1.5, 2.0, and 3.0
- Various min_p thresholds: 0.05, 0.1, and 0.2
Each post was scored by GPT-5.2 on creativity and quality (1-10 scale). That's 1,140 total generations. (LLM evaluators are here: quality_score_linkedin, creativity_score_linkedin)
Finding #1: Min_p Rescues High Temperatures
This is the headline result, and it's not even close. At temperature 2.0, min_p prevents catastrophic quality collapse.
| Condition | Creativity | Quality |
|---|---|---|
| kimi-t2.0-no-minp | 2.28 | 1.60 |
| kimi-t2.0-minp0.2 | 6.75 | 7.12 |
| Improvement | +4.47 | +5.52 |
The statistical test confirmed what our eyes already told us:
- t = -14.08
- p < 0.001
- Cohen's d = 2.57 (anything above 0.8 is "large" - this is off the charts)
Min_p doesn't just help. It completely rescues outputs that would otherwise be unusable.
Finding #2: Entropy Predicts Quality
Why does this work? Entropy - how "surprised" the model is by its own word choices - is the strongest predictor of post quality (r = -0.746, p < 0.0001).
We calculated entropy from logprobs returned by Together AI (Kimi only - Claude doesn't support logprobs). The correlation is striking:
| Metric | Value |
|---|---|
| Pearson r (Entropy <> Quality) | -0.746 |
| Variance explained | 55.6% |
| Sample size | 170 traces |
The Quality Cliff
Quality stays rock-solid below entropy ~0.75, then collapses rapidly:
| Entropy range | Avg Quality | % scoring >= 8 |
|---|---|---|
| 0.00 - 0.75 | 8.87 | 97% |
| 0.75 - 1.00 | 7.79 | 71% |
| 1.50+ | 3.71 | 21% |
Min_p Controls Entropy
Here's the mechanism: min_p dramatically reduces entropy at high temperatures.
- temp=1.5: entropy drops from 0.965 to 0.328 (-66%) with min_p=0.1
- temp=2.0: entropy drops from 1.517 to 0.477 (-69%) with min_p=0.2
Finding #3: At Normal Temperatures, Models Are Equivalent
Here's a surprising result: when we compared Claude Opus 4.5 and Kimi K2.5 at the same temperature (1.0), there was no significant difference.
| Model | Creativity | Quality |
|---|---|---|
| Opus 4.5 (t=1.0) | 6.93 | 7.63 |
| Kimi K2.5 (t=1.0) | 6.80 | 7.30 |
Statistical test: t = 0.64, p = 0.52 (not significant)
This suggests that for standard creative tasks, model choice matters less than you might think. The real differentiator is how you configure the sampling parameters.
Why Kimi Appears "More Creative"
Kimi t1.5+minp0.1 (7.18 creativity) beats Opus t1.0 (6.93), but this isn't because Kimi is inherently more creative. It's because:
- Opus is limited to temperature 0-1 (Anthropic API constraint)
- Kimi supports temperature 0-3+ (via Together AI)
Kimi's advantage is access to higher temperatures, not superior creativity at the same settings.
Finding #4: Higher Temps Need Higher Min_p
The sweet spot: Temperature 1.5 with min_p=0.1 produced the highest creativity (7.18) while maintaining excellent quality (7.47).
| Temperature | Optimal min_p | Creativity | Quality |
|---|---|---|---|
| 1.0 | None needed | 6.80 | 7.30 |
| 1.5 | 0.1 | 7.18 | 7.47 |
| 2.0 | 0.2 | 6.75 | 7.12 |
| 3.0 | 0.2 (insufficient) | 4.75 | 4.03 |
At temperature 3.0, even min_p=0.2 isn't enough to fully maintain quality. There's a limit to how far you can push it.
Practical Takeaways
For Production Use
| Scenario | Configuration | Expected Quality |
|---|---|---|
| Safe default | t=1.0, no min_p | 7.3 |
| Maximum creativity | t=1.5, min_p=0.1 | 7.5 |
| Experimental | t=2.0, min_p=0.2 | 7.1 |
What to Avoid
- Temperature >= 2.0 without min_p (complete quality collapse)
- Temperature 3.0 even with min_p (still degraded)
- Using min_p at temperature <= 1.0 (no benefit)
Production Guardrail: Entropy Threshold
Since Together AI returns logprobs, we can calculate entropy per post and auto-retry if it exceeds a threshold:
| Threshold | Posts kept | Avg quality | % >= 8 |
|---|---|---|---|
| entropy < 0.8 | 58% | 8.87 | 97% |
With kimi-t1.5-minp0.1 (avg entropy = 0.328), almost no posts would be rejected - the guardrail is a safety net for edge cases.
Does This Replicate Prior Research?
Yes. Our findings align with the original min_p paper:
Nguyen et al. (2024). "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs" arXiv:2407.01082
What we add is validation in a different context:
- Domain: LinkedIn posts vs. creative writing benchmarks
- Model: Kimi K2.5 vs. Mistral/Llama
- Evaluation: LLM-as-Judge (GPT-5.2) vs. human evaluation
The core finding holds: min_p prevents quality collapse at high temperatures.
Cost Comparison
One more thing worth mentioning: we ran 17 conditions on Kimi K2.5 for ~$20, and only 2 conditions on Opus for ~$45.
Kimi achieved comparable results at roughly 1/10th the per-token cost.
What's Next?
This experiment answered one question definitively: min_p works, and works dramatically at high temperatures.
But it opened new questions:
- Would fine-tuning on creative examples achieve similar results?
- Does the optimal min_p vary by task type?
- Can we combine min_p with other sampling techniques (top_k, repetition penalty)?
For now: if you're generating creative content and your model supports it, try temperature 1.5 with min_p=0.1. The results might surprise you.
Experiment run: February 2026 | 19 conditions x 60 founders = 1,140 generations | Evaluation: GPT-5.2 LLM-as-Judge | Dataset: minp-linkedin-experiment-v3


