PhD candidate · Ohio State Computer Vision Lab

Nick Kashani Motlagh

I build machine learning systems that know when to answer, when to retrieve, and when to refuse.

First-author on all five of my publications: selective prediction for vision (ISVC 2022 Best Paper, MVA 2025 journal extension), multimodal machine translation (WMT 2024), and adaptive question answering with LLMs (EMNLP 2026 submission). Five summers at AFRL on CUI work. Python, PyTorch, transformers, retrieval-augmented generation, evaluation harnesses, distributed training on HPC.

Now Graduating August 2026. Available fall 2026 for applied ML and ML engineering roles. Columbus, OH — open to relocation and remote. U.S. citizen.

First-author pubs
5/5
Best Paper
ISVC 2022
Graduating
Aug 2026
Available
Fall 2026

EMNLP 2026 under review · adaptive QA · LLM abstention

Reject or Refine? When LLM agents should retrieve, reason, or refuse.

When an LLM-based QA system shouldn't answer directly, is the signal that says "retrieve" the same signal that says "abstain"? In a fixed model–retriever–corpus stack, the answer is no. Over 41k eval instances, a small class-weighted question-only controller reaches Recoverability AUC .678 ± .005 with Reject Recall .487 ± .059; cheap logprob baselines top out at AUC ≈ .55 and never reject (Reject Recall = 0). At matched coverage, the controller's Refine Recall advantage over cumulative-logprob is +.252 [.233, .272].

Under review at EMNLP 2026.

Dispatches

Latest news

View archive
  1. Submitted "Reject or Refine?" to EMNLP 2026 — on separating retrievable from unrecoverable uncertainty in adaptive QA.
  2. Joined DCS Corp (AFRL) as Technical Analyst II leading LLM reject-option training and evaluation.
  3. Completed LLM reject-option training and evaluation work at DCS Corp / AFRL.
  4. Released MVA 2025 journal extension on reject-option calibration.

Notebook

Latest writing

Browse posts
  1. Why LLMs Need Reject Options

    Large language models confidently fabricate answers. Structured abstention, borrowed from vision, offers a principled fix.

Built

Projects and open-source.

Training code, evaluation harnesses, and datasets I wrote and ship under my GitHub.

Repo

learning-idk

PyTorch toolkit for per-class reject-option training with binomial threshold search, dashboards, and CLI.

  • Python
  • PyTorch
  • CLI
  • selective prediction
  • Tune per-class thresholds on ImageNet, iNat, or custom datasets with one command.
  • Export coverage/accuracy curves and selective accuracy plots for reports.
  • Integrate abstention policies into existing Torch models via lightweight hooks.

Repo

calibration

Evaluation harness for multimodal MT, selective LLM routing, and visual-text calibration experiments.

  • Python
  • PyTorch
  • transformers
  • evaluation
  • Run imagery-aware contrastive probes against WMT-style checkpoints.
  • Log structured reports (HTML/Markdown) for rapid model comparisons.
  • Used to benchmark multimodal MT models for AFRL and academic deployments.

Repo

Construction-Site-Satellite-Imagery

Semi-automatic satellite data ingestion plus labeling UI for monitoring changing regions.

  • Python
  • OpenStreetMap
  • remote sensing
  • labeling UI
  • Generate OpenStreetMap-guided scrape manifests for temporal imagery.
  • Label construction phases with the included lightweight annotation app.
  • Export train/val/test splits for change-detection baselines.
All repos
Selected work

Publications.

Each paper links out to the PDF, code, or data where available.

EMNLP 2026 (under review) / 2026

Reject or Refine? Separating Retrievable from Unrecoverable Uncertainty in Adaptive QA

N. Kashani Motlagh, et al.

A small question-only controller reaches .68 Recoverability AUC with .49 Reject Recall over 41k QA instances. Logprob baselines top out at .55 AUC and never reject.

  • Question-only controller: Recoverability AUC .678 ± .005, Reject Recall .487 ± .059 over 41,145 eval instances.
  • Logprob baselines reach only .518–.553 AUC and have Reject Recall = 0 (they never reject).

Why it matters

Question-only controller: Recoverability AUC .678 ± .005, Reject Recall .487 ± .059 over 41,145 eval instances.

  • selective prediction
  • adaptive QA
  • retrieval-augmented generation
  • abstention

ISVC 2022 / 2022

Learning When to Say "I Don't Know"

N. Kashani Motlagh, J. Davis, T. Anderson, J. Gwinnup

Binomial modeling of per-class reject thresholds that boost selective accuracy while keeping abstentions calibrated. Extended in MVA 2025 journal version.

  • Won Best Paper at ISVC 2022; seeded follow-on work for the MVA 2025 journal extension.
  • Delivered +0.4% selective accuracy and +1.3% coverage gains on ImageNet over global thresholding.

Why it matters

Best Paper

  • vision
  • reject option
  • selective accuracy
  • ImageNet

Runnable artifacts

Machine Vision and Applications / 2025

Naturally Constrained Reject Option Classification

N. Kashani Motlagh, J. Davis, T. Anderson, J. Gwinnup

Invited journal extension of ISVC 2022 Best Paper. Per-class binomial thresholds scale to ImageNet, remote sensing, and long-tailed splits with stronger selective accuracy.

  • Invited journal extension of the ISVC Best Paper that introduces deployment-minded calibration tooling.
  • Per-class binomial thresholds outperform global temperature scaling on ImageNet and remote sensing splits.

Why it matters

+1.3% coverage

  • vision
  • reject option
  • calibration
  • ImageNet
  • remote sensing

Runnable artifacts

WMT 2024 / 2024

Assessing the Role of Imagery in Multimodal Machine Translation

N. Kashani Motlagh, J. Davis, T. Anderson, J. Gwinnup, G. Erdmann

Contrastive evaluation shows SOTA multimodal MT models leverage pixels beyond a regularization effect.

  • Introduced imagery-aware contrastive probes that isolate whether translations truly reference the paired visual context.
  • Benchmarked nine multimodal MT systems and showed genuine visual grounding across high-variance evaluation splits.

Why it matters

+7% image-grounding score

  • multimodal MT
  • vision-language models
  • evaluation
  • WMT

Runnable artifacts

All publications
Experience

Roles.

Research, teaching, and internships.

Role

Technical Analyst II — DCS Corp (sponsored by Air Force Research Laboratory)

Dayton, OH / May 2025 — Present

  • Trained abstention-augmented LLMs optimizing for downstream utility, achieving **8× improvement** in out-of-distribution settings over competing approaches.
  • Build evaluation harnesses and calibration dashboards that connect LLM policies to existing command-and-control tooling.

Role

Graduate Research Associate — Computer Vision Lab

Ohio State University · Columbus, OH / Aug 2021 — Present

  • Lead the lab's uncertainty-aware multimodal modeling portfolio under Prof. Jim Davis.
  • Designed imagery-aware contrastive metrics for **multimodal machine translation** (WMT 2024), showing that state-of-the-art models depend on visual evidence rather than treating images as regularizers.

Role

Graduate Teaching Associate — Machine Learning & NLP

Ohio State University · Columbus, OH / Aug 2023 — Present

  • Support **80+ students** per offering in machine learning, computer vision, and natural language processing courses.
  • Run recitations, office hours, and targeted study plans, and maintain auto-graded labs (including introductory LLM labs) with an emphasis on calibration, safety, and responsible deployment.

Role

Graduate Research Intern — Air Force Research Laboratory (U.S. CUI)

Dayton, OH / Summers 2022–2024

  • Summer 2024: Adapted and trained **JEPA and MAE transformers** in a distributed Slurm/Singularity setup for multimodal EO/SAR representation learning, achieving superior low-data performance over supervised methods.
  • Summer 2023: Developed **Reject Option Beam Search** to improve machine translation quality at large beam widths.
About

How the research has developed.

From selective prediction in vision to adaptive question answering with LLMs, with stops along the way in multimodal translation and remote sensing.

I build machine learning systems that know when to stop guessing. In production — medical imaging, defense, content moderation, agentic LLM stacks — a model that answers confidently and wrong is worse than a model that says “not this one, route somewhere else.” Most ML systems have no mechanism for that. My work adds one.

I am a PhD candidate at Ohio State, graduating August 2026, advised by Jim Davis. I am first author on all five of my publications. The technical through-line is selective prediction and abstention: I started with per-class reject thresholds for image classifiers (ISVC 2022 Best Paper, MVA 2025 journal extension), carried the idea into multimodal machine translation to test whether MT systems actually use visual evidence or just treat images as a regularizer (WMT 2024), and moved into LLM-based QA, where what counts as “uncertain” depends on context, phrasing, and whether retrieval can rescue the question.

My EMNLP 2026 submission is about that retrieve-versus-abstain boundary. Over 41k QA instances in a fixed model–retriever–corpus stack, a small class-weighted question-only controller reaches .68 Recoverability AUC with .49 Reject Recall. Cheap logprob baselines top out at .55 AUC and never reject. The paper’s point is not a new model — it is that recoverability and answer confidence are genuinely different signals, and routing systems should learn them separately.

Day to day I write Python and PyTorch, run experiments on HPC clusters, build evaluation harnesses, and ship the tooling that holds research together. I have spent a lot of the last year on agentic LLM workflows and retrieval-augmented systems, which turn out to be a natural fit for the routing and abstention problems I already think about.

I am looking for applied ML or ML engineering roles starting fall 2026, after an August defense. Research-adjacent roles welcome. I am based in Columbus, OH, open to relocation and remote. I am a U.S. citizen with five summers of AFRL CUI experience and I am comfortable in DoD environments.

  1. 2021–24

    Selective prediction for vision

    Per-class reject thresholds that let image classifiers back out of ambiguous regions instead of forcing a guess.

    ISVC 2022 Best Paper; MVA 2025 journal extension.

  2. 2024

    Multimodal machine translation

    Measured whether state-of-the-art multimodal MT systems actually use visual evidence, or treat images as a regularizer.

    WMT 2024.

  3. 2025–26

    Reject or refine for LLM QA

    Once the direct answer looks unreliable, can retrieval rescue the question or should the model abstain? A class-weighted controller can separate the two; confidence alone cannot.

    EMNLP 2026 submission, under review.

Work with me

Available fall 2026 — open to applied ML and ML engineering roles.

Interested in teams shipping LLM systems where uncertainty, calibration, and evaluation are first-class concerns. Happy to talk about abstention for agents, RAG routing, or selective prediction. Research-adjacent roles welcome.

At a glance

  • PhD, The Ohio State University — graduating August 2026
  • Available fall 2026 · applied ML / ML engineering
  • First author on 5/5 publications · ISVC 2022 Best Paper
  • Python, PyTorch, transformers, RAG, distributed training
  • Five summers at AFRL · U.S. citizen · Columbus, OH (open to relocation / remote)

Recent roles

  1. Technical Analyst II — DCS Corp (sponsored by Air Force Research Laboratory)

    Dayton, OH / May 2025 — Present
  2. Graduate Research Associate — Computer Vision Lab

    Ohio State University · Columbus, OH / Aug 2021 — Present
  3. Graduate Teaching Associate — Machine Learning & NLP

    Ohio State University · Columbus, OH / Aug 2023 — Present
  4. Graduate Research Intern — Air Force Research Laboratory (U.S. CUI)

    Dayton, OH / Summers 2022–2024