I am currently a PhD student at the College of William and Mary advised by Dr. Antonio Mastropaolo. I am also fortunate to be supervised by Anh Totti Nguyen, Hy Truong Son, and Thiago Serra. My research focuses on multimodal AI and trustworthy AI. I am especially interested in (1) quantifying and understanding the limitations and biases of LLMs/VLMs and (2) making LLM systems more interpretable in high-stake domains such as medical and healthcare.

My works have been accepted at premier venues such as ACL, NAACL, Interspeech, etc. My most recent project, VLMs are Biased, has been featured on Hacker News and garnered attention from Meta's SuperIntelligence Lab and Google DeepMind.

I was also an intern at the Machine Learning Research team at CodaMetrix in Summer 2024 and Summer 2025, where I developed LLM agents that (1) extract medical entities from EHR notes and (2) evaluate and correct entities extracted by human experts and other LLMs.

Selected Publications

♠ denotes equal contribution

S-Chain: Structured Visual Chain-of-Thought for Medicine

Khai Le-Duc, Phuong T.H. Trinh, Duy Minh Ho Nguyen, Tien-Phat Nguyen, Nghiem Tuong Diep, An Ngo, Tung Vu, Trinh Vuong, Anh-Tien Nguyen, Nguyen Dinh Mau, Van Trung Hoang, Khai-Nguyen Nguyen, Hy Nguyen, Chris Ngo, Anji Liu, Nhat Ho, Anne-Christin Hauschild, Khanh Xuan Nguyen, Thanh Nguyen-Tang, Pengtao Xie, Daniel Sonntag, James Zou, Mathias Niepert, Anh Totti Nguyen

Under Review

paper

We present S-Chain, a large dataset of 12,000 expert-labeled medical images with step-by-step visual reasoning. Training models with S-Chain improves their accuracy (+8.16% points on average) and interpretability.

Vision-Language Models are Biased

An Vo^♠, Khai-Nguyen Nguyen^♠, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, Daeyoung Kim

AI for Math Workshop @ ICML 2025, Under Review

paper / website / dataset / code

We demonstrate that state-of-the-art LLMs are strongly biased toward well-known patterns and propose VLMBias, a VQA benchmark of 1.4k counterfactual images focusing on evaluating visual biases in VLMs.

Sentiment Reasoning for Healthcare

Khai-Nguyen Nguyen^♠, Khai Le-Duc^♠, Bach Phan Tat, Duy Le, Long Vo-Dang, Truong-Son Hy

ACL 2025, Industry Track (Oral)

paper / code

We demonstrate that chain-of-thought distillation improves LLMs performance in sentiment analysis and enables LLMs to produce human-like explanation.

Medical Spoken Named Entity Recognition

Khai Le-Duc, David Thulke, Hung-Phong Tran, Long Vo-Dang, Khai-Nguyen Nguyen, Truong-Son Hy, Ralf Schluter

NAACL 2025, Industry Track (Oral)

paper / dataset

We propose a multilingual dataset for the medical named entity recognition task.

Resource-Efficient & Effective Code Summarization

Saima Afrin, Joseph Call, Khai-Nguyen Nguyen, Oscar Chaparro, Antonio Mastropaolo

FORGE 2025

paper / code

We show that Code LLMs finetuned on QLoRA/LoRA achieve comparable performance to their full-parameter finetuned versions on code summarization.

Real-time Speech Summarization for Medical Conversations

Khai Le-Duc^♠, Khai-Nguyen Nguyen^♠, Long Vo-Dang, Truong-Son Hy

Interspeech 2024 (Oral)

paper / dataset

We improve cascaded medical speech summarization LLMs using high-quality synthetic data.

Getting away with more network pruning: From sparsity to geometry and linear regions

Jeffrey Cai^♠, Khai-Nguyen Nguyen^♠, Nishant Shrestha, Aidan Good, Ruisen Tu, Xin Yu, Shandian Zhe, Thiago Serra

Workshop on Sparsity in Neural Networks @ ICLR 2023, CPAIOR 2023

paper / code

We propose a mathematical theorem of the geometric properties of neural networks and apply it to model pruning.