Notebook entry
Why LLMs Need Reject Options
Tags
A well-known phenomenon: ask a modern LLM a plausible-sounding question it does not actually know the answer to, and it will produce one anyway — with the same tone and the same confidence it would produce a correct one. The problem is not that the model is sometimes wrong. The problem is that it has no mechanism to notice it is wrong.
The confidence problem
Current LLMs are trained to always produce an answer. The softmax probabilities we read off the top of the stack are poorly calibrated signals — a high logprob does not mean the model is more likely to be correct, only that the completion is consistent with the training distribution. A human operating under uncertainty hedges, asks a clarifying question, or walks away. A model does none of these things unless we specifically train it to.
Post-hoc patches help but do not solve this. Retrieval-augmented generation adds evidence; output filters catch surface failures; chain-of-thought can expose reasoning gaps. None of them give the model a principled way to say “this is not a question I should answer.” The core design is still an answer-producing machine with no off switch.
What vision models already know
Selective prediction is not a new idea. In computer vision we have had reject-option classifiers for decades — the classifier is allowed to return “no prediction” on examples where it would otherwise be unreliable, and the system pays a controlled coverage cost in exchange for higher accuracy on the calls it does make.
My PhD work on this side (ISVC 2022, MVA 2025 journal extension) was about learning per-class reject thresholds rather than global ones. The key observation is that errors in a classifier’s decision space are not uniformly distributed — they concentrate in specific regions, usually near class boundaries where several classes overlap. If you can identify those regions, you can build a system that corrects course on the clean regions and abstains on the dirty ones. You lose some coverage, but you raise selective accuracy substantially, and you do it in a way that is calibrated and controllable at deployment time.
That is the principle I want to carry forward: partition the decision space honestly, and act differently on each side.
Why language is harder
In vision, uncertainty is mostly spatial or distributional — the image is blurry, the object is out of distribution, the boundary between classes is genuinely ambiguous. In language it is harder, because uncertainty is context-dependent and compositional. A question that is trivial given the right passage can be impossible without it. A question that is impossible for a 7B model can be straightforward for a larger one. The same surface string routes through wildly different difficulty depending on what the model has already seen and what it can still retrieve.
This matters because “I don’t know” is a coarse label. It hides a three-way distinction that an operational system actually needs to make:
- The model could answer if it thought longer — more reasoning, more chain-of-thought, a better decoding strategy.
- The model could answer if it had more context — something retrieval could plausibly fetch.
- The question is out of reach — ill-posed, unanswerable, or beyond what the current stack can produce no matter how hard it works.
Collapsing these three into a single “uncertain” bucket throws away actionable information. A system that knows the difference can route: keep reasoning, trigger retrieval, or refuse.
Structured abstention
My EMNLP 2026 submission is about that second distinction — once the direct-answer branch has already failed, is the signal that says “retrieve could rescue this” the same signal that says “this should be rejected”? It is not. Cheap answer confidence, which works reasonably well for the first cut, degrades to near chance for the retrieve-versus-abstain decision once you condition on having already decided not to answer directly. Retriever-side signals help but collapse at usable operating points. The pattern is the same one the vision work predicted: these are different decision regions and they need to be learned separately.
The broader point is that we should stop expecting a single scalar confidence to do everything. Answer confidence is one decision. Recoverability is another. Out-of-scope detection is a third. Each of them has a different shape, a different signal, and a different operating point, and a production system has to handle all of them cleanly.
Why this matters for deployment
Structured abstention is not a nice-to-have for high-stakes settings. In defense and intelligence analysis, a confidently fabricated answer can drive a real decision before anyone has a chance to double-check it. In medical decision support, overconfident wrong answers erode clinician trust faster than any correct answer can rebuild it. In any setting where the cost of a wrong answer exceeds the cost of no answer, a model without a principled reject option is the wrong tool.
The research direction I care about most is not making models more accurate on average — that problem is already well funded. It is making models that know when they should not answer at all, and that can tell you why.
A note on signal and noise
The barrier to producing a paper has dropped dramatically in the last two years. LLM-assisted writing, easier access to compute, lower friction to every stage of the research pipeline — all good things in isolation. The practical consequence is a flood of incremental and redundant work that overwhelms reviewers and buries the papers that actually move the field. The review system was not designed for this volume, and it shows.
I think the fix is higher review standards and better triage, not fewer submissions. LLM-assisted reviewing — used to filter and prioritize, not to replace human judgment — is one tool that could help. The community itself could benefit from the same principle this post is about: the ability to say “this does not clear the bar” with less friction and more consistency. Selective acceptance is, in a sense, a reject-option problem too.