Evaluating large language model agents for automation of atomic force microscopy

Mandal, Indrajeet; Soni, Jitendra; Zaki, Mohd; Smedskjaer, Morten M.; Wondraczek, Katrin; Wondraczek, Lothar; Gosvami, Nitya Nand; Krishnan, N. M. Anoop

Evaluating large language model agents for automation of atomic force microscopy

Indrajeet Mandal, Jitendra Soni, Mohd Zaki, Morten M. Smedskjaer, Katrin Wondraczek, Lothar Wondraczek, Nitya Nand Gosvami () and N. M. Anoop Krishnan ()
Additional contact information
Indrajeet Mandal: Indian Institute of Technology Delhi
Jitendra Soni: Indian Institute of Technology Delhi
Mohd Zaki: Indian Institute of Technology Delhi
Morten M. Smedskjaer: Aalborg University
Katrin Wondraczek: Leibniz Institute of Photonic Technology
Lothar Wondraczek: University of Jena
Nitya Nand Gosvami: Indian Institute of Technology Delhi
N. M. Anoop Krishnan: Indian Institute of Technology Delhi

Nature Communications, 2025, vol. 16, issue 1, 1-15

Abstract: Abstract Large language models (LLMs) are transforming laboratory automation by enabling self-driving laboratories (SDLs) that could accelerate materials research. However, current SDL implementations rely on rigid protocols that fail to capture the adaptability and intuition of expert scientists in dynamic experimental settings. Here, we show that LLM agents can automate atomic force microscopy (AFM) through our Artificially Intelligent Lab Assistant (AILA) framework. Further, we develop AFMBench—a comprehensive evaluation suite challenging LLM agents across the complete scientific workflow from experimental design to results analysis. We find that state-of-the-art LLMs struggle with basic tasks and coordination scenarios. Notably, models excelling at materials science question-answering perform poorly in laboratory settings, showing that domain knowledge does not translate to experimental capabilities. Additionally, we observe that LLM agents can deviate from instructions, a phenomenon referred to as sleepwalking, raising safety alignment concerns for SDL applications. Our ablations reveal that multi-agent frameworks significantly outperform single-agent approaches, though both remain sensitive to minor changes in instruction formatting or prompting. Finally, we evaluate AILA’s effectiveness in increasingly advanced experiments—AFM calibration, feature detection, mechanical property measurement, graphene layer counting, and indenter detection. These findings establish the necessity for benchmarking and robust safety protocols before deploying LLM agents as autonomous laboratory assistants across scientific disciplines.

Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://www.nature.com/articles/s41467-025-64105-7 Abstract (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-64105-7

Ordering information: This journal article can be ordered from
https://www.nature.com/ncomms/

DOI: 10.1038/s41467-025-64105-7

Access Statistics for this article

Nature Communications is currently edited by Nathalie Le Bot, Enda Bergin and Fiona Gillespie

More articles in Nature Communications from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().