Schools

Dublin Student Presents AI Deception Research At Conference

EHS junior Mrinal Agarwal presented a paper on a new method for AI deception detection at the NeurIPS 2025 conference.

EHS junior Mrinal Agarwal presented a paper on a new method for AI deception detection at the NeurIPS 2025 conference.
EHS junior Mrinal Agarwal presented a paper on a new method for AI deception detection at the NeurIPS 2025 conference. (Mrinal Agarwal)

DUBLIN, CA — An Emerald High School junior recently presented artificial intelligence research to researchers and academics at NeurIPS 2025, one of the world’s leading machine learning conferences.

Mrinal Agarwal served as the lead author on the paper WOLF: Werewolf-based Observations for LLM Deception and Falsehoods, which introduces new benchmarks for studying misinformation in large language models like ChatGPT. His new framework uses the social deduction party game “Werewolf,” in which some players are secretly designated “werewolves” who must eliminate other players. Players use debate and vote to eliminate suspected werewolves.

Agarwal’s paper describes how to use a similar setup with LLMs to find out whether information presented is correct. “Rather than asking a model once whether a statement is true or false, Werewolf creates a system in which models are forced to have long drawn out conversations where they have incentives to mislead, withhold information, redirect suspicion, while appearing honest,” he explained.

Find out what's happening in Dublinfor free with the latest updates from Patch.

“Every statement is logged and analyzed: the speaker records whether they were being deceptive and why, while other agents judge whether they believe the statement and how suspicious it seems. This lets us pinpoint when deception happens, what kind it is (lying outright vs. omission or misdirection).”

According to Agarwal, research has shown that AI can be talented at misleading consumers, but not talented in detecting deception. In his experiments, deceptive agents avoided being identified by other models for the majority of experiments.

Find out what's happening in Dublinfor free with the latest updates from Patch.

“As AI systems are increasingly used in multi-agent settings, like automated negotiations, moderation systems, or decision support, the ability to detect strategic dishonesty is lagging behind the ability to produce it. Werewolf gives us a controlled way to study that imbalance over time before implementing these systems into the real world, rather than treating deception as a basic one-off classification problem,” he said.

WOLF is currently just a benchmark, though Agarwal is working on a website to showcase the project and allow users to run models through it. He is also working on a separate paper focused on LLM security, and has developed a training-free method that monitors how a model reacts to inputs in order to detect attempts to manipulate a model’s behavior.

Agarwal is also president of the EHS math club, a competitive debater, and has qualified on the American Invitational Mathematics Examination.


SEE ALSO:

Get more local news delivered straight to your inbox. Sign up for free Patch newsletters and alerts.