Chain-of-thought reasoning models develop stronger position bias as they think longer, according to new research that challenges the assumption that more deliberate AI reasoning reduces shallow heuristic biases.

The study tested thirteen reasoning configurations across models including DeepSeek-R1 at 671 billion parameters, R1-distilled 7-8B models, and base models with chain-of-thought prompting on MMLU, ARC-Challenge, and GPQA benchmarks.

Twelve of the thirteen configurations showed positive correlations between reasoning trajectory length and Position Bias Score (PBS), ranging from 0.11 to 0.41 after controlling for accuracy. All open-weight reasoning models demonstrated monotonically increasing bias across length quartiles.

Truncation experiments provide causal evidence

Researchers used truncation interventions to establish causality, resuming reasoning from different points in trajectories. Continuations from later points increasingly shifted toward position-preferred options, rising from 16% to 32% for R1-Qwen-7B across position buckets.

The 671B DeepSeek-R1 model showed an aggregate PBS of just 0.019, but length-driven bias still emerged in the longest quartile (PBS = 0.071). This suggests accuracy gates bias expression rather than eliminating the underlying mechanism.

Direct-answer position bias proved distinct from reasoning-driven bias, showing different patterns across models. Chain-of-thought reasoning replaced baseline bias with length-accumulated bias rather than reducing it.

The findings challenge standard evaluation practices that assume reasoning models are order-robust in multiple-choice settings. The research provides diagnostic tools including PBS metrics, commitment change points, and truncation probes for auditing position bias.

The paper argues that reasoning-capable models should not be treated as inherently unbiased in multiple-choice evaluation pipelines, despite their enhanced reasoning capabilities.