Measuring Cross-Channel Disagreement in LLM Evaluation

Nirav Rohra

Preprint · 46 pages · ~45 min read

Measuring Cross-Channel Disagreement in LLM Evaluation

CCD-Bench: frozen diagnostic corpus for parser, judge, and human disagreement on identical transcripts

Nirav Rohra

University of Texas at Dallas

nirav.rohra@utdallas.edu

Download PDF Read in browser

Abstract

LLM evaluation often collapses heterogeneous evidence into scalars or judge scores, hiding when parsers, judges, and humans disagree on the same transcript. We release CCD-Bench, a frozen diagnostic corpus (eighteen behavioral suites, twelve heterogeneous checkpoints) with joint parser outputs, multi-judge scores, and replayable rows; suites were frozen before aggregate analysis with null and error rows retained. The central empirical claim is finite-population and protocol-specific: on n=642 suite-stratified transcript rows after QC (all eighteen suites), parser-only violation flags align far better with blinded human-majority labels than the frozen three-judge Llama panel at τ=0, while transcript-preserving judge replacements improve recall but leave substantial false negatives unless parsers enter fusion rules. We publish harness code, prompts, manifests, and frozen JSON trees for reproduction.

Contributions

Conceptual. Cross-channel disagreement (CCD), an evaluation inconsistency index (EII), and when scalar fusion cannot preserve incompatible channel orderings.
Apparatus. Per-run audit ledger with integrity hooks for identifier-aligned parser, judge, and human channels.
CCD-Bench. Eighteen suites spanning seven recurring risk constructs; five flagship suites anchor the judge-panel audit; stratified blinded human coding on n=642 pooled rows.
Evidence. Judge misses concentrate in parser-positive cells humans usually validate; replacing judges moves recall but does not eliminate disagreement with humans or parsers.

Headline results (n = 642)

Finite-population, protocol-specific metrics on stratified blinded human-majority labels across all eighteen CCD-Bench suites. Full tables and BCa intervals in the paper.

Parser-only recall vs. human majority: 87.3%; 95% Wilson CI [83.2, 90.5] on n=642 stratified rows
Judge-only recall vs. human majority: 27.8%; Frozen three-judge Llama panel at τ=0; CI [23.2, 33.0]
Parser-positive / judge-negative rows: 243; 203 (83.5%) validated as violations by blinded human-majority labels
Parser ∨ judge fusion FNR: 7.6%; Illustrative fusion rule; trades recall for higher false-positive rate

CCD-Bench coverage

Authority conflict
Prompt injection
False-premise honesty
Information control / redaction
Autonomy / continuity framing
Contained dual-use code
Multilingual / framing robustness

Eighteen frozen behavioral suites, twelve heterogeneous checkpoints, joint parser outputs and multi-judge scores on replayable rows. Suites were frozen before aggregate analysis; null, refusal, and error rows were retained. Intended as wind-tunnel diagnostic metrology—not natural-user prevalence estimation.

Paper outline

01
Introduction
Why scalar benchmarks hide cross-channel splits on identical transcripts.
02
Framework & CCD
Parser vs. judge channels, EII, and positioning vs. prior agent benchmarks.
03
CCD-Bench protocol
Eighteen frozen suites, twelve checkpoints, anti-cherry-picking design.
04
Empirical results
Parser–judge–human contingency tables on n=642 stratified rows.
05
Judge ablations
Transcript-preserving rescoring with parser-aware and API judges.
06
Reproducibility
Public harness, frozen JSON trees, and table-regeneration scripts.

Full paper (PDF)

Embedded viewer for Measuring Cross-Channel Disagreement in LLM Evaluation. Open PDF in a new tab.

How to cite

@misc{rohra2026ccd,
  author       = {Nirav Rohra},
  title        = {Measuring Cross-Channel Disagreement in LLM Evaluation},
  year         = {2026},
  howpublished = {Preprint},
  institution  = {University of Texas at Dallas},
  url          = {https://niravrohra.com/research}
}

Reproducibility artifacts and harness details are described in the full paper (Section 8).

About the author

Nirav Rohra researches mechanistic interpretability and LLM evaluation safety metrology. Founder of Honrly, AI/cyber intern at Zebra Technologies, CS (AI) at UT Dallas. More work and contact on the main portfolio.