← Back to blog

Notebook entry

Why LLMs Need Reject Options

Tags

  • selective prediction
  • uncertainty
  • LLMs
  • calibration

LLM systems are usually evaluated on answer quality conditional on producing an answer. That misses an important deployment question: when should the system decline to answer, request more context, or route to another procedure? Selective prediction is the framework I use for that question. It treats abstention as an operating point, not as an afterthought.

The confidence problem

Token-level probabilities are useful but limited signals. A high sequence likelihood means the completion is plausible under the model distribution; it does not, by itself, imply factual correctness, answerability, or retrievability. Calibration gets harder once the system includes retrieval, multi-step reasoning, or task-specific tools because failure can come from several components, not only the decoder.

Post-hoc filters and retrieval can improve behavior, but they do not define the routing policy. A useful QA stack needs explicit decision rules for coverage, answer confidence, and recoverability: when to answer directly, when to retrieve, and when the current stack is unlikely to produce a reliable answer.

What vision models already know

Selective prediction is not new. In computer vision, reject-option classifiers are evaluated by their coverage/selective-accuracy tradeoff: the model is allowed to abstain on some inputs, and the retained predictions should become more accurate as coverage decreases.

My PhD work in this area (ISVC 2022, MVA 2025 journal extension) studies per-class reject thresholds rather than global thresholds. The motivation is simple: classifier errors are not evenly distributed across classes. A single threshold can be too conservative for easy classes and too permissive for hard ones. Estimating class-conditional thresholds from validation statistics gives a more direct handle on the coverage/selective-accuracy tradeoff.

The same design principle applies outside vision: estimate the relevant failure regions and attach explicit actions to them.

Why language is harder

Language introduces a routing problem that is more conditional than standard image classification. The same question can be answerable with the right passage and unanswerable without it. A failure can be due to missing context, insufficient model capacity, retrieval misspecification, prompt ambiguity, or the question being ill-posed. That makes a single “uncertain” label too coarse.

For QA systems, I think the practical distinction is three-way:

  1. Direct answer is likely sufficient.
  2. Retrieval may make the question answerable.
  3. The current model–retriever–corpus stack should abstain.

Collapsing these cases into one uncertainty score throws away information needed for routing. The operating point should depend on the action being considered.

Structured abstention

My current submission studies the second boundary: after direct answering has been ruled out, can we distinguish retrievable failures from unrecoverable ones? In the evaluated stack, answer-confidence baselines are not sufficient for that decision, so the routing policy needs a separate estimate of stack-relative recoverability.

The broader point is that answer confidence and stack-relative recoverability are different signals. The former estimates whether direct answering should proceed. The latter estimates whether retrieval is likely to change the answerability of the question. Treating both as one scalar confidence loses the distinction the routing policy needs.

Why this matters for deployment

The engineering reason to care is that many systems need a controllable coverage/reliability tradeoff. If the cost of a wrong answer is higher than the cost of no answer, the model should expose an abstention policy with measurable operating characteristics: coverage, selective accuracy, reject recall, recoverability, and downstream utility.

This is why I treat abstention as a modeling and evaluation problem rather than a product-layer patch. The interesting question is not only whether the model can answer more questions, but whether the system can choose the right action for the current input and operating point.

A note on signal and noise

The same selective-prediction framing also applies to research workflows. As the volume of submissions and generated text increases, review systems need better triage: not to replace expert judgment, but to allocate attention toward work with stronger evidence, clearer novelty, and reproducible artifacts.

That is a different problem from QA abstention, but the structure is familiar: define the action space, measure error costs, and calibrate the thresholds instead of pretending every case should receive the same decision.