I Implemented the Self-Consistency Paper From Scratch. Here Is Where It Helps and Where It Does Not.

Self-consistency (Wang et al., 2022) is cited in 8,000 papers and used in almost zero production systems I know. I implemented it from scratch on two real tasks. On one it was a 9-point accuracy lift. On the other it was expensive theater. Here is why.

The Setup

Self-consistency is simple: instead of greedy decoding (one response at temperature 0), sample N responses at high temperature, then majority-vote the answers. Wang et al. showed this improves reasoning accuracy on math and logic tasks without any fine-tuning.

I implemented it from scratch on two tasks:

1. GSM8K subset — 200 grade-school math word problems (the "easy" reasoning benchmark)

2. Customer intent classification — 200 support tickets from a production dataset, same one from my synthetic data article

Model: GPT-4o-mini (cheap enough to run N=10 samples per question). N values tested: 1, 3, 5, 10, 20. Temperature: 0.7 for sampling runs, 0 for the baseline.

import openai

  from collections import Counter

  from concurrent.futures import
  ThreadPoolExecutor

  import re

   

  client = openai.OpenAI()

   

  def sample_once(prompt: str, temperature:
  float = 0.7) -> str:

      rsp
  = client.chat.completions.create(

          model="gpt-4o-mini",

         
  messages=[{"role": "user", "content":
  prompt}],

         
  temperature=temperature,

         
  max_tokens=512,

      )

     
  return rsp.choices[0].message.content

   

  def extract_answer(response: str) -> str:

     
  """Pull the final numeric answer from a
  chain-of-thought response."""

      #
  GSM8K standard: answer appears after "#### "

     
  match = re.search(r"####\s*([\d,]+)", response)

      if
  match:

         
  return match.group(1).replace(",", "")

      #
  Fallback: last number in the response

      numbers
  = re.findall(r"\b\d+\.?\d*\b", response)

     
  return numbers[-1] if numbers else ""

   

  def self_consistency(prompt: str, n: int = 10)
  -> str:

     
  """Run self-consistency voting over N
  samples."""

      with
  ThreadPoolExecutor(max_workers=min(n, 10)) as pool:

         
  responses = list(pool.map(lambda _: sample_once(prompt), range(n)))

     
  answers = [extract_answer(r) for r in responses]

     
  answers = [a for a in answers if a] 
  # filter empty

      if
  not answers:

         
  return ""

     
  most_common, _ = Counter(answers).most_common(1)[0]

     
  return most_common

The ThreadPoolExecutor is critical — without parallel sampling, N=10 takes 10x the wall time and makes the method impractical. With 10 workers, N=10 takes about the same time as N=2 in serial.

Task 1: Math Reasoning (GSM8K) — Where It Worked

This is the task the paper was designed for. Here's what I measured:

| N (samples) | Accuracy | Cost / 200 questions | |---|---|---| | 1 (greedy, temp=0) | 81.5% | $0.42 | | 1 (temp=0.7) | 79.0% | $0.42 | | 3 | 86.2% | $1.26 | | 5 | 88.0% | $2.10 | | 10 | 90.5% | $4.20 | | 20 | 91.0% | $8.40 |

N=5 gets you most of the gain. N=10 is the sweet spot — 9.0 points over greedy baseline, 10x the cost. N=20 adds 0.5 points for 2x more cost. The marginal returns past N=10 are not worth it.

This matches Wang et al.'s finding almost exactly: they reported ~8-10 point gains on GSM8K with similar N values. The method works here because wrong reasoning paths tend to reach different wrong answers, while correct paths tend to converge.

Task 2: Classification — Where It Didn't

For the intent classification task:

| N (samples) | Accuracy | Cost / 200 tickets | |---|---|---| | 1 (greedy) | 87.4% | $0.09 | | 3 | 87.8% | $0.27 | | 10 | 88.1% | $0.90 |

A 0.7-point lift for a 10x cost increase. Not a reasonable trade-off.

Why? Classification tasks with a small, well-defined label space don't benefit from self-consistency because the model is almost never "confused between different reasoning paths" — it's just pattern-matching to one of 12 categories. The variance in sampling is low, so majority voting over high-temperature samples degrades more than it helps. On any given ticket that the model gets wrong, it tends to get it wrong consistently across all N samples — which self-consistency can't fix.

What Surprised Me

The temperature baseline matters. Running N=1 at temperature 0.7 (without voting) performs worse than greedy (79% vs 81.5%). Temperature adds variance that voting harnesses — but without voting, it's just noise. This is obvious in retrospect but easy to miss if you're testing self-consistency against a temp=0 baseline without also checking temp=0.7/N=1.

Also: the answer extraction regex is fragile. GSM8K is designed with the #### answer format to make this easier, but real-world tasks rarely have that. Building a reliable answer extractor for free-form responses is its own engineering problem that the paper largely sidesteps.

The Latent Space podcast covered self-consistency in the context of OpenAI's o1 reasoning models — it's essentially what chain-of-thought sampling is doing internally. Worth listening to if you want the longer-form take.

When to Use It

Use self-consistency when:

• The task requires multi-step reasoning (math, logic, code debugging)

• Wrong answers vary (different wrong paths reach different wrong answers)

• You can afford 5-10x the inference cost

• Latency is not the primary constraint

Don't use it when:

• The task is classification or extraction (low variance in errors)

• Latency matters — parallel sampling helps, but N=10 still has overhead

• Your accuracy ceiling is the model's knowledge limit, not its reasoning consistency

Next Steps

• Test self-consistency on SWE-bench lite tasks — multi-step code repair is a better fit than classification

• Implement a confidence threshold: only use voting when N=1 confidence is below 0.8

• Compare against best-of-N with a reward model (Process Reward Model style) — should be more accurate but requires more infrastructure

• Full implementation and both datasets at github.com/rexcircuit/self-consistency-bench

DIAGRAM_HINT: Dual line chart showing accuracy vs. N (number of samples) for GSM8K math and intent classification, with a secondary axis showing cost per 200 questions, illustrating the diverging value curves.