I am currently a PhD student at the College of William and Mary advised by Dr. Antonio Mastropaolo. I am also fortunate to be supervised by Anh Totti Nguyen, Hy Truong Son, and Thiago Serra. My research focuses on multimodal AI and trustworthy AI. I am especially interested in (1) evaluating and understanding LLMs/MLLMs and (2) making LLM systems more interpretable in high-stake domains such as medical and healthcare.

My works have been accepted at premier venues such as ACL, NAACL, Interspeech, etc. My most recent project, VLMs are Biased, has been featured on Hacker News and garnered attention from Meta's SuperIntelligence Lab, Google Gemini team, and Gary Marcus.

I was also an intern at the Machine Learning Research team at CodaMetrix in Summer 2024 and Summer 2025, supervised by Cheng Li. There, I developed LLM agents that (1) extract medical entities from EHR notes and (2) evaluate and correct entities extracted by human experts and other LLMs.


Selected Publications

♠ denotes equal contribution

S-Chain: Structured Visual Chain-of-Thought for Medicine
S-Chain: Structured Visual Chain-of-Thought for Medicine
Khai Le-Duc, Phuong T.H. Trinh, Duy Minh Ho Nguyen, Tien-Phat Nguyen, Nghiem Tuong Diep, An Ngo, Tung Vu, Trinh Vuong, Anh-Tien Nguyen, Nguyen Dinh Mau, Van Trung Hoang, Khai-Nguyen Nguyen, Hy Nguyen, Chris Ngo, Anji Liu, Nhat Ho, Anne-Christin Hauschild, Khanh Xuan Nguyen, Thanh Nguyen-Tang, Pengtao Xie, Daniel Sonntag, James Zou, Mathias Niepert, Anh Totti Nguyen
Under Review
We present S-Chain, a large dataset of 12,000 expert-labeled medical images with step-by-step visual reasoning. Training models with S-Chain improves their accuracy (+8.16% points on average) and explainability.
VLMs Are Biased
Vision-Language Models are Biased
An Vo, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, Daeyoung Kim
AI for Math Workshop @ ICML 2025, Under Review
We demonstrate that state-of-the-art LLMs are strongly biased toward well-known patterns with our proposed benchmark VLMBias, a VQA benchmark of 1.4k counterfactual images focusing on evaluating visual biases in VLMs.
Sentiment Reasoning for Healthcare
Sentiment Reasoning for Healthcare
Khai-Nguyen Nguyen, Khai Le-Duc, Bach Phan Tat, Duy Le, Long Vo-Dang, Truong-Son Hy
ACL 2025, Industry Track (Oral)
We demonstrate that chain-of-thought distillation improves LLMs performance in sentiment analysis and enables LLMs to produce human-like explanation.
Resource-Efficient & Effective Code Summarization
Resource-Efficient & Effective Code Summarization
Saima Afrin, Joseph Call, Khai-Nguyen Nguyen, Oscar Chaparro, Antonio Mastropaolo
FORGE 2025
We show that Code LLMs finetuned on QLoRA achieve comparable performance to their full-parameter finetuned versions on code summarization.
Real-time Speech Summarization for Medical Conversations
Real-time Speech Summarization for Medical Conversations
Khai Le-Duc, Khai-Nguyen Nguyen, Long Vo-Dang, Truong-Son Hy
Interspeech 2024 (Oral)
We improve cascaded medical speech summarization LLMs using high-quality synthetic data.
Network Pruning
Getting away with more network pruning: From sparsity to geometry and linear regions
Jeffrey Cai, Khai-Nguyen Nguyen, Nishant Shrestha, Aidan Good, Ruisen Tu, Xin Yu, Shandian Zhe, Thiago Serra
Workshop on Sparsity in Neural Networks @ ICLR 2023, CPAIOR 2023
We propose a mathematical theorem on the upper bound of the expressiveness of a neural network based on their geometric properties and apply it to model pruning.