Preprint · 46 pages · ~45 min read
Measuring Cross-Channel Disagreement in LLM Evaluation
CCD-Bench: frozen diagnostic corpus for parser, judge, and human disagreement on identical transcripts
Nirav Rohra
University of Texas at Dallas
Abstract
LLM evaluation often collapses heterogeneous evidence into scalars or judge scores, hiding when parsers, judges, and humans disagree on the same transcript. We release CCD-Bench, a frozen diagnostic corpus (eighteen behavioral suites, twelve heterogeneous checkpoints) with joint parser outputs, multi-judge scores, and replayable rows; suites were frozen before aggregate analysis with null and error rows retained. The central empirical claim is finite-population and protocol-specific: on n=642 suite-stratified transcript rows after QC (all eighteen suites), parser-only violation flags align far better with blinded human-majority labels than the frozen three-judge Llama panel at τ=0, while transcript-preserving judge replacements improve recall but leave substantial false negatives unless parsers enter fusion rules. We publish harness code, prompts, manifests, and frozen JSON trees for reproduction.
Contributions
- Conceptual. Cross-channel disagreement (CCD), an evaluation inconsistency index (EII), and when scalar fusion cannot preserve incompatible channel orderings.
- Apparatus. Per-run audit ledger with integrity hooks for identifier-aligned parser, judge, and human channels.
- CCD-Bench. Eighteen suites spanning seven recurring risk constructs; five flagship suites anchor the judge-panel audit; stratified blinded human coding on n=642 pooled rows.
- Evidence. Judge misses concentrate in parser-positive cells humans usually validate; replacing judges moves recall but does not eliminate disagreement with humans or parsers.
Headline results (n = 642)
Finite-population, protocol-specific metrics on stratified blinded human-majority labels across all eighteen CCD-Bench suites. Full tables and BCa intervals in the paper.
- Parser-only recall vs. human majority
- 87.3%
- 95% Wilson CI [83.2, 90.5] on n=642 stratified rows
- Judge-only recall vs. human majority
- 27.8%
- Frozen three-judge Llama panel at τ=0; CI [23.2, 33.0]
- Parser-positive / judge-negative rows
- 243
- 203 (83.5%) validated as violations by blinded human-majority labels
- Parser ∨ judge fusion FNR
- 7.6%
- Illustrative fusion rule; trades recall for higher false-positive rate
CCD-Bench coverage
- Authority conflict
- Prompt injection
- False-premise honesty
- Information control / redaction
- Autonomy / continuity framing
- Contained dual-use code
- Multilingual / framing robustness
Eighteen frozen behavioral suites, twelve heterogeneous checkpoints, joint parser outputs and multi-judge scores on replayable rows. Suites were frozen before aggregate analysis; null, refusal, and error rows were retained. Intended as wind-tunnel diagnostic metrology—not natural-user prevalence estimation.
Paper outline
- 01
Introduction
Why scalar benchmarks hide cross-channel splits on identical transcripts.
- 02
Framework & CCD
Parser vs. judge channels, EII, and positioning vs. prior agent benchmarks.
- 03
CCD-Bench protocol
Eighteen frozen suites, twelve checkpoints, anti-cherry-picking design.
- 04
Empirical results
Parser–judge–human contingency tables on n=642 stratified rows.
- 05
Judge ablations
Transcript-preserving rescoring with parser-aware and API judges.
- 06
Reproducibility
Public harness, frozen JSON trees, and table-regeneration scripts.
Full paper (PDF)
Embedded viewer for Measuring Cross-Channel Disagreement in LLM Evaluation. Open PDF in a new tab.
How to cite
@misc{rohra2026ccd,
author = {Nirav Rohra},
title = {Measuring Cross-Channel Disagreement in LLM Evaluation},
year = {2026},
howpublished = {Preprint},
institution = {University of Texas at Dallas},
url = {https://niravrohra.com/research}
}Reproducibility artifacts and harness details are described in the full paper (Section 8).