Back to all articles
Research8 min read

We Tested 1,140 LinkedIn Posts to Find the Optimal Creativity Settings

MuhamedDavide
Muhamed & Davide
February 10, 2026
We Tested 1,140 LinkedIn Posts to Find the Optimal Creativity Settings

TL;DR: We ran 1,140 LinkedIn post generations across 19 configurations to test min_p sampling. The headline finding: min_p prevents quality collapse at high temperatures - and this effect is massive (Cohen's d = 2.57, p < 0.001).


The Problem: High Temperature = Chaos

If you've ever cranked up an LLM's temperature hoping for more creative outputs, you've probably experienced the disappointment that follows. At temperature 2.0, most models produce word salad. The outputs become incoherent, repetitive, or just plain nonsense.

This is the creativity-quality tradeoff that's plagued LLM users since GPT-2. Want more creative outputs? Prepare to sift through garbage.

Or so we thought.


Enter Min_p: A Better Way to Sample

In 2024, researchers proposed a simple fix: instead of sampling from all possible next tokens (which includes garbage at high temperatures), only sample from tokens whose probability is at least some fraction of the top token's probability.

That fraction is min_p.

The idea is elegant: at high temperatures, the probability distribution flattens out, giving nonsensical tokens a fighting chance. Min_p acts as a dynamic filter - it says "I don't care how flat the distribution is, only consider tokens that are at least X% as likely as the best option."


The Experiment

We took 60 PLG startup founders (Notion, Figma, PostHog, etc.) and analyzed their LinkedIn writing styles. Then we generated posts for each founder across 19 different configurations:

  • Claude Opus 4.5 at temperatures 0.7 and 1.0 (above 1.0 is not allowed by Claude API).
  • Kimi K2.5 at temperatures 0.7, 1.0, 1.5, 2.0, and 3.0
  • Various min_p thresholds: 0.05, 0.1, and 0.2

Each post was scored by GPT-5.2 on creativity and quality (1-10 scale). That's 1,140 total generations. (LLM evaluators are here: quality_score_linkedin, creativity_score_linkedin)


Finding #1: Min_p Rescues High Temperatures

This is the headline result, and it's not even close. At temperature 2.0, min_p prevents catastrophic quality collapse.

ConditionCreativityQuality
kimi-t2.0-no-minp2.281.60
kimi-t2.0-minp0.26.757.12
Improvement+4.47+5.52

The statistical test confirmed what our eyes already told us:

  • t = -14.08
  • p < 0.001
  • Cohen's d = 2.57 (anything above 0.8 is "large" - this is off the charts)

Min_p doesn't just help. It completely rescues outputs that would otherwise be unusable.


Finding #2: Entropy Predicts Quality

Why does this work? Entropy - how "surprised" the model is by its own word choices - is the strongest predictor of post quality (r = -0.746, p < 0.0001).

We calculated entropy from logprobs returned by Together AI (Kimi only - Claude doesn't support logprobs). The correlation is striking:

MetricValue
Pearson r (Entropy <> Quality)-0.746
Variance explained55.6%
Sample size170 traces

The Quality Cliff

Quality stays rock-solid below entropy ~0.75, then collapses rapidly:

Entropy rangeAvg Quality% scoring >= 8
0.00 - 0.758.8797%
0.75 - 1.007.7971%
1.50+3.7121%

Min_p Controls Entropy

Here's the mechanism: min_p dramatically reduces entropy at high temperatures.

  • temp=1.5: entropy drops from 0.965 to 0.328 (-66%) with min_p=0.1
  • temp=2.0: entropy drops from 1.517 to 0.477 (-69%) with min_p=0.2

Finding #3: At Normal Temperatures, Models Are Equivalent

Here's a surprising result: when we compared Claude Opus 4.5 and Kimi K2.5 at the same temperature (1.0), there was no significant difference.

ModelCreativityQuality
Opus 4.5 (t=1.0)6.937.63
Kimi K2.5 (t=1.0)6.807.30

Statistical test: t = 0.64, p = 0.52 (not significant)

This suggests that for standard creative tasks, model choice matters less than you might think. The real differentiator is how you configure the sampling parameters.

Why Kimi Appears "More Creative"

Kimi t1.5+minp0.1 (7.18 creativity) beats Opus t1.0 (6.93), but this isn't because Kimi is inherently more creative. It's because:

  • Opus is limited to temperature 0-1 (Anthropic API constraint)
  • Kimi supports temperature 0-3+ (via Together AI)

Kimi's advantage is access to higher temperatures, not superior creativity at the same settings.


Finding #4: Higher Temps Need Higher Min_p

The sweet spot: Temperature 1.5 with min_p=0.1 produced the highest creativity (7.18) while maintaining excellent quality (7.47).

TemperatureOptimal min_pCreativityQuality
1.0None needed6.807.30
1.50.17.187.47
2.00.26.757.12
3.00.2 (insufficient)4.754.03

At temperature 3.0, even min_p=0.2 isn't enough to fully maintain quality. There's a limit to how far you can push it.


Practical Takeaways

For Production Use

ScenarioConfigurationExpected Quality
Safe defaultt=1.0, no min_p7.3
Maximum creativityt=1.5, min_p=0.17.5
Experimentalt=2.0, min_p=0.27.1

What to Avoid

  • Temperature >= 2.0 without min_p (complete quality collapse)
  • Temperature 3.0 even with min_p (still degraded)
  • Using min_p at temperature <= 1.0 (no benefit)

Production Guardrail: Entropy Threshold

Since Together AI returns logprobs, we can calculate entropy per post and auto-retry if it exceeds a threshold:

ThresholdPosts keptAvg quality% >= 8
entropy < 0.858%8.8797%

With kimi-t1.5-minp0.1 (avg entropy = 0.328), almost no posts would be rejected - the guardrail is a safety net for edge cases.


Does This Replicate Prior Research?

Yes. Our findings align with the original min_p paper:

Nguyen et al. (2024). "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs" arXiv:2407.01082

What we add is validation in a different context:

  • Domain: LinkedIn posts vs. creative writing benchmarks
  • Model: Kimi K2.5 vs. Mistral/Llama
  • Evaluation: LLM-as-Judge (GPT-5.2) vs. human evaluation

The core finding holds: min_p prevents quality collapse at high temperatures.


Cost Comparison

One more thing worth mentioning: we ran 17 conditions on Kimi K2.5 for ~$20, and only 2 conditions on Opus for ~$45.

Kimi achieved comparable results at roughly 1/10th the per-token cost.


What's Next?

This experiment answered one question definitively: min_p works, and works dramatically at high temperatures.

But it opened new questions:

  • Would fine-tuning on creative examples achieve similar results?
  • Does the optimal min_p vary by task type?
  • Can we combine min_p with other sampling techniques (top_k, repetition penalty)?

For now: if you're generating creative content and your model supports it, try temperature 1.5 with min_p=0.1. The results might surprise you.


Experiment run: February 2026 | 19 conditions x 60 founders = 1,140 generations | Evaluation: GPT-5.2 LLM-as-Judge | Dataset: minp-linkedin-experiment-v3

Growth Marketing Blog - AI Insights & Strategies | Luce