Instruction-tuned LLMs equipped with abstention heads deliver 8× utility on out-of-distribution tasks.
8× OOD utilityHi, I'm Nick 👋
I train AI systems to know when they don't know
Language models make things up, and most of the time they have no idea they're doing it. My research at Ohio State trains models to either resolve their uncertainty or back out entirely, as fast as possible, instead of confidently guessing.
On the market: Research Scientist & ML Engineering roles · PhD May 2026 · Columbus, OH · Will relocate
- Preparing EMNLP 2026 submission on structured LLM abstention and failure diagnosis.
- Joined DCS Corp (AFRL) as Technical Analyst II leading LLM reject-option training and evaluation.
- Delivered AFRL LLM reject-option training with 8× utility on OOD tasks.
Recent work
A few projects I'm proud of — each with paper, code, and data.
Invited journal extension of ISVC 2022 Best Paper. Per-class binomial thresholds scale to ImageNet, remote sensing, and long-tailed splits with stronger selective accuracy.
+1.3% coverageContrastive evaluation shows SOTA multimodal MT models leverage pixels beyond a regularization effect.
+7% image-grounding scoreExperience
Research & industryTechnical Analyst II — DCS Corp (sponsored by Air Force Research Laboratory)
Dayton, OH
- Train and evaluate instruction-tuned LLMs with reject-option heads for analyst workflows, improving out-of-distribution utility by **8×** over competing approaches.
- Build evaluation harnesses and calibration dashboards that connect LLM policies to existing command-and-control tooling.
Graduate Research Associate — Computer Vision Lab
Ohio State University · Columbus, OH
- Lead the lab’s uncertainty-aware multimodal modeling portfolio under Prof. Jim Davis.
- Designed imagery-aware contrastive metrics for **multimodal machine translation** (WMT 2024), showing that state-of-the-art models depend on visual evidence rather than treating images as regularizers.
- Developed binomial per-class **reject-option training** for ImageNet, remote sensing, and long-tailed datasets (ISVC 2022 Best Paper; MVA 2025 extension), improving selective accuracy of vision transformers by **+0.4%** and coverage by **+1.3%**.
- Integrated these methods into open-source toolkits and analyst-facing evaluation pipelines.
Graduate Teaching Associate — Machine Learning & NLP
Ohio State University · Columbus, OH
- Support **80+ students** per offering in machine learning, computer vision, and natural language processing courses.
- Run recitations, office hours, and targeted study plans, and maintain auto-graded labs (including introductory LLM labs) with an emphasis on calibration, safety, and responsible deployment.
Graduate Research Intern — Air Force Research Laboratory (U.S. CUI)
Dayton, OH
- Summer 2024: Adapted and trained **JEPA and MAE transformers** in a distributed Slurm/Singularity setup for multimodal EO/SAR representation learning, achieving superior low-data performance over supervised methods.
- Summer 2023: Developed **Reject Option Beam Search** to improve machine translation quality at large beam widths.
- Summer 2022: Pioneered an end-to-end training algorithm for Naturally Constrained Reject Option Classification.
Undergraduate Research Intern — Air Force Research Laboratory (U.S. CUI)
Dayton, OH
- Summer 2021: Devised an **ensemble distillation** method to improve model performance on ambiguous instances.
- Summer 2020: Constructed a semi-automated system for **temporal satellite imagery collection** (ICCV 2021 workshop), later released as the Construction-Site-Satellite-Imagery dataset.
Undergraduate Research Associate — Computer Vision Lab
Ohio State University · Columbus, OH
- Engineered semi-automatic labeling workflows for remote sensing change detection, creating Python tooling that bootstrapped datasets for uncertainty-aware modeling.
Summer Research Intern — Sii Canada / Concordia University
Montreal, QC
- Built anomaly detection dashboards that translated large-scale behavioral telemetry into prioritized experiments, highlighting early lessons on uncertainty estimation.
Undergraduate Teaching Associate — Discrete Structures & Algorithms
Ohio State University · Columbus, OH
- Mentored discrete structures and algorithms cohorts through recitations, office hours, and targeted study plans emphasizing analytical rigor.
Skills
Core Research
Models & Architectures
Tools & Infrastructure
Domains
Service
Publications
All with runnable codeAFRL Technical Report · 2025
Selective LLM Training with Reject Options
Instruction-tuned LLMs equipped with abstention heads deliver 8× utility on out-of-distribution tasks.
Machine Vision and Applications · 2025
Naturally Constrained Reject Option Classification
Invited journal extension of ISVC 2022 Best Paper. Per-class binomial thresholds scale to ImageNet, remote sensing, and long-tailed splits with stronger selective accuracy.
WMT 2024 · 2024
Assessing the Role of Imagery in Multimodal Machine Translation
Contrastive evaluation shows SOTA multimodal MT models leverage pixels beyond a regularization effect.
ISVC 2022 · 2022
Best PaperLearning When to Say "I Don't Know"
Binomial modeling of per-class reject thresholds that boost selective accuracy while keeping abstentions calibrated. Extended in MVA 2025 journal version.
ICCV 2021 Workshop on LUAI · 2021
A Framework for Semi-automatic Collection of Temporal Satellite Imagery for Analysis of Dynamic Regions
Semi-automated scraping plus OpenStreetMap cues to assemble temporal satellite datasets that feed downstream change-detection.
Open Source
Tools & datasetsEvaluation harness for multimodal MT, selective LLM routing, and visual-text calibration experiments.
Semi-automatic satellite data ingestion plus labeling UI for monitoring changing regions.
PyTorch toolkit for per-class reject-option training with binomial threshold search, dashboards, and CLI.
About
I am a PhD candidate at Ohio State, finishing in May 2026, advised by Jim Davis. My research is about making machine learning systems that know when to stop guessing. The question first came up in 2021, when Jim and I were staring at t-SNE plots of image classifiers. The models were making confused, unreliable predictions in regions where class clusters overlapped — but scattered among the noise were pockets of clean, well-separated examples where the model was consistently right. That pattern stuck with me. I wanted to know whether we could learn which parts of the decision space are actually trustworthy, and build systems that act accordingly.
The simplest way I can explain my work is this: I teach AI to stop making things up. Everyone who has used ChatGPT has seen it confidently produce something false. In a casual conversation that is annoying. In production — medical imaging, defense systems, content moderation — a model that guesses wrong with high confidence is worse than one that gives no answer at all. The core problem is that most ML systems are trained to always produce output, with no mechanism to say “I am not sure about this.” My research gives them that mechanism.
The technical framing is selective prediction and abstention. I study how models can recognize when they have landed in a dirty region of the decision space — where the data is ambiguous, overlapping, or out of distribution — and either find the action that gets them to a clean state or back out and abstain as fast as possible. When a model abstains, the question gets routed to a human or a more capable system. A model that says “I don’t know” and defers is more reliable than one that forces an answer it cannot support.
The problem becomes more interesting in large language models. In image classification, a dirty region is relatively static — class boundaries overlap and that is that. In language, what counts as dirty depends on context, phrasing, and the specific knowledge required. A question that is unanswerable given one prompt can become straightforward with a small amount of additional reasoning or retrieval. My current work trains models to distinguish between these cases: questions they could resolve with more computation, questions that need external context, and questions that are genuinely beyond reach. The goal is structured uncertainty — not a single “I don’t know” reflex, but a diagnosis of why the model is uncertain and a routing decision for what should happen next.
On a given day I write Python and PyTorch, run experiments on HPC clusters, and build the tooling that holds research together. I have recently been spending a lot of time on agentic AI workflows, which turn out to be a natural fit for the routing and abstention problems I already think about. I am looking for research scientist or applied ML engineering roles after graduation. I am a U.S. citizen and comfortable working with CUI and DoD requirements.