Research Scientist / Applied Scientist · LLM evaluation & reliability

Nick Kashani Motlagh

I build models that know when not to answer.

  • Answer
  • Abstain
  • Refine

PhD candidate at Ohio State's Computer Vision Lab, defending August 2026. My dissertation, Answering Under Uncertainty, studies the three places direct answering breaks down: when a prediction is unreliable (abstention), when an input is ambiguous relative to available evidence (evidence use), and when a draft answer should get a second look before it is returned (revision).

Now Current ARR manuscript: when should a QA system trust its draft answer, revise it with retrieved evidence, or decline to answer? Title and details withheld during anonymous review.

Peer-reviewed
4 papers
QA examples
25K+
Award
ISVC Best Paper
Job market
Aug 2026

PhD dissertation · defending August 2026

Answering Under UncertaintyAbstention, Ambiguity, and Revision

The dissertation addresses three places where answering breaks down: unreliable output confidence, input ambiguity relative to available evidence, and uncertainty about whether revising a draft answer with retrieved evidence will make it better or worse.

  1. 01 Output uncertainty

    Is the current prediction reliable enough to return?

    Natural reject option

    Abstention when no rejection cost or coverage target is given: per-class thresholds that maximize selected accuracy while requiring the rejected region to behave like genuine confusion.

    Abstain ISVC 2022 Best Paper · MVA 2025 journal extension

  2. 02 Input ambiguity

    Does available evidence move the model toward the intended meaning?

    Measuring evidence use

    ImageCoMMuTE-style metrics for multimodal translation: does the correct image lower the model's uncertainty for the correct translation, relative to a misleading image? The metrics test image dependence directly instead of inferring it from aggregate scores.

    Use evidence WMT 2024

  3. 03 Post-answer revision

    Will a second look make the answer better or worse?

    When revision helps

    Compares a model's direct answer with its evidence-revised answer on the same questions, so routing policies can weigh the chance that revision fixes a wrong answer against the chance it breaks a right one.

    Revise ARR submission in preparation

Current manuscript

Retrieval-augmented selective QA — title withheld for review

An ARR submission on retrieval-augmented selective QA. The work measures when revising a draft answer with retrieved evidence makes it better and when it makes it worse, then uses those measurements to decide whether the system should answer, revise, or abstain instead of relying on confidence alone.

No acceptance is claimed here. The working title and manuscript details are withheld to preserve anonymous review.

ARR submission in preparation

Built

Code and data.

Training code, evaluation harnesses, calibration utilities, and datasets tied to the papers.

Repo

learning-idk

Companion code for ISVC 2022 / MVA 2025: per-class reject-option classification with binomial threshold search.

  • Python
  • PyTorch
  • selective prediction
  • Tune per-class thresholds on ImageNet, iNaturalist, or custom datasets.
  • Export coverage/accuracy curves and selective-accuracy plots.
  • Apply learned reject thresholds to existing PyTorch classifier outputs.

Repo

calibration

PyTorch calibration utilities for histogram binning, global temperature scaling, and class-wise temperature scaling.

  • Python
  • PyTorch
  • calibration
  • Run global and class-wise temperature scaling on classifier logits.
  • Produce reliability diagrams and ECE / class-wise ECE reports.
  • Import as a dependency when reproducing the imagery-aware contrastive MT evaluation code.

Repo

construction-site-satellite-imagery-collection

Companion code for OpenStreetMap-guided temporal satellite imagery collection and annotation.

  • Python
  • OpenStreetMap
  • remote sensing
  • Generate OpenStreetMap-guided download manifests for temporal imagery.
  • Label construction phases with the bundled annotation app.
  • Export train/val/test splits for change-detection baselines.
All code and data
Selected work

Papers and reports.

Peer-reviewed papers, public code and data, and current manuscript notes with careful status labels.

ARR submission in preparation / 2026

Retrieval-Augmented Selective QA (Title Withheld for Anonymous Review)

N. Kashani Motlagh and collaborators

Measures when evidence-based revision fixes a draft answer and when it breaks one, so routing policies can decide to answer, revise, or abstain.

  • Compares a model's direct answer with its evidence-revised answer on the same questions, measuring how often revision fixes a wrong answer, breaks a right one, or changes nothing.
  • Shows that answer confidence and the expected value of revision are separate routing signals: a policy needs both to decide whether to answer, revise, or abstain.

Why it matters

Evaluated on NQ-Open, TriviaQA, and PopQA — 25K+ held-out examples; manuscript in preparation.

  • selective prediction
  • adaptive QA
  • retrieval-augmented generation
  • abstention

ISVC 2022 / 2022

Learning When to Say "I Don't Know"

N. Kashani Motlagh, J. Davis, T. Anderson, J. Gwinnup

Per-class reject thresholds estimated from validation statistics, improving selective accuracy and coverage over global thresholding.

  • Best Paper at ISVC 2022; later extended in the MVA 2025 journal version.
  • Reported +0.4% selective-accuracy and +1.3% coverage gains on ImageNet over global thresholding.

Why it matters

Best Paper

  • vision
  • reject option
  • selective accuracy
  • ImageNet

Runnable code

Machine Vision and Applications / 2025

Naturally Constrained Reject Option Classification

N. Kashani Motlagh, J. Davis, T. Anderson, J. Gwinnup

Journal extension of ISVC 2022 Best Paper, evaluating per-class binomial reject thresholds on ImageNet and remote-sensing datasets.

  • Invited journal extension of the ISVC 2022 Best Paper; adds long-tailed wildlife and remote-sensing splits with class-conditional threshold analysis.
  • Per-class binomial thresholds outperform global thresholding on ImageNet and remote-sensing splits.

Why it matters

+1.3% coverage

  • vision
  • reject option
  • calibration
  • ImageNet
  • remote sensing

Runnable code

WMT 2024 / 2024

Assessing the Role of Imagery in Multimodal Machine Translation

N. Kashani Motlagh, J. Davis, T. Anderson, J. Gwinnup, G. Erdmann

Contrastive evaluation of WMT 2024 multimodal MT systems shows measurable dependence on paired visual context.

  • Introduced imagery-aware contrastive probes for testing whether translations change under mismatched visual context.
  • Benchmarked nine multimodal MT systems under matched and mismatched visual context, with wide variance in how much each system relies on the image.

Why it matters

+7% image-grounding score

  • multimodal MT
  • vision-language models
  • evaluation
  • WMT

Runnable code

All publications
Experience

Roles.

Research, teaching, internships, and applied evaluation work.

Role

Technical Analyst II — DCS Corp (sponsored by Air Force Research Laboratory)

Dayton, OH / May 2025 — Present

  • Train and evaluate abstention-augmented LLM policies for retrieval-augmented QA, measuring when evidence-based revision fixes a draft answer and when it breaks one.
  • Build evaluation harnesses and calibration dashboards for comparing LLM policy variants across coverage, utility, and out-of-distribution behavior.

Role

Graduate Research Associate — Computer Vision Lab

Ohio State University · Columbus, OH / Aug 2021 — Present

  • Build selective-prediction systems for vision, multimodal, and language tasks, advised by Prof. Jim Davis.
  • Designed imagery-aware contrastive metrics for multimodal machine translation (WMT 2024), measuring whether translations depend on paired visual context.

Role

Graduate Teaching Associate — Machine Learning & NLP

Ohio State University · Columbus, OH / Aug 2023 — Present

  • Support 80+ students per offering in machine learning, computer vision, and natural language processing courses.
  • Lead recitations and office hours; grade assignments and maintain auto-graded labs across ML, computer vision, NLP, and LLM topics.

Role

Graduate Research Intern — Air Force Research Laboratory (U.S. CUI)

Dayton, OH / Summers 2022–2024

  • Summer 2024: Adapted and trained JEPA and MAE transformers in a distributed Slurm/Singularity setup for multimodal EO/SAR representation learning, improving low-data downstream performance over supervised baselines.
  • Summer 2023: Developed Reject Option Beam Search to improve machine translation quality at large beam widths.
About

How the research has developed.

From selective prediction in vision to adaptive question answering with LLMs, with stops along the way in multimodal translation and remote sensing.

My research is on selective prediction, calibration, and abstention: decision rules that determine when a model should return a prediction, request more evidence, or withhold an answer. I am interested in settings where accuracy alone is not enough — where the system must maintain a calibrated policy over coverage, routing, and failure modes.

I am a PhD candidate at Ohio State, graduating August 2026, advised by Jim Davis. I am first author on all four of my published papers, with a fifth manuscript in preparation. The technical thread starts with per-class reject thresholds for image classifiers (ISVC 2022 Best Paper, MVA 2025 journal extension), extends to contrastive evaluation for multimodal machine translation (WMT 2024), and now focuses on adaptive LLM question answering, where uncertainty depends on both the question and the capabilities of the model-retriever-corpus stack.

My current ARR submission studies retrieval-augmented selective QA; the title is withheld to preserve anonymous review. The paper asks when revising a draft answer with retrieved evidence makes it better and when it makes it worse. Confidence estimates whether the draft is already correct; a second, separate signal estimates whether revision will improve it — and a routing policy needs both.

I write training code in Python/PyTorch, build evaluation harnesses and calibration dashboards, and run distributed experiments on Slurm/Singularity clusters.

I am looking for Research Scientist, Applied Scientist, and ML Engineer roles starting August 2026, after my defense. I am strongest on teams working on LLM evaluation, calibration, retrieval-augmented systems, selective prediction, or safety/reliability infrastructure. Based in Columbus, OH, open to relocation and remote. U.S. citizen with five summers of AFRL cleared experience; cleared and federal roles are welcome.

  1. 2021–24

    Selective prediction for vision

    Class-conditional reject thresholds for image classifiers, estimated from validation statistics and evaluated with coverage/selective-accuracy tradeoffs.

    ISVC 2022 Best Paper; MVA 2025 journal extension.

  2. 2024

    Multimodal machine translation

    Contrastive evaluation for measuring whether multimodal MT systems use paired image evidence rather than benefiting only from image-conditioned training.

    WMT 2024.

  3. 2025–26

    Selective QA with retrieval

    Evaluates when evidence-based revision improves a draft answer and when it degrades one, driving answer / revise / abstain decisions for retrieval-augmented QA.

    ARR submission package in preparation.

Work with me

Research Scientist, Applied Scientist, and ML Engineer roles.

I expect to complete my PhD in August 2026 and am open to coordinating start timing where needed. Best fit: teams working on LLM evaluation, calibration, selective prediction, retrieval-augmented QA, or reliability infrastructure. I am comfortable with experiment design, PyTorch training code, distributed cluster runs, evaluation harnesses, and metrics/reporting layers. U.S. citizen with five summers of cleared AFRL experience; cleared and federal roles welcome.

At a glance

  • PhD, The Ohio State University — defending August 2026
  • Research Scientist / Applied Scientist / ML Engineer · selective prediction, calibration, LLM evaluation
  • First author on 4 published papers · 1 manuscript in preparation · ISVC 2022 Best Paper
  • Current ARR manuscript on retrieval-augmented selective QA · title withheld for review
  • Python · PyTorch · HuggingFace · Slurm/Singularity · RAG evaluation
  • U.S. citizen · five summers cleared work · Columbus OH, open to relocation / remote

Recent roles

  1. Technical Analyst II — DCS Corp (sponsored by Air Force Research Laboratory)

    Dayton, OH / May 2025 — Present
  2. Graduate Research Associate — Computer Vision Lab

    Ohio State University · Columbus, OH / Aug 2021 — Present
  3. Graduate Teaching Associate — Machine Learning & NLP

    Ohio State University · Columbus, OH / Aug 2023 — Present
  4. Graduate Research Intern — Air Force Research Laboratory (U.S. CUI)

    Dayton, OH / Summers 2022–2024