AI & Scientific Discovery Workshop

Registration: Please refer to NAACL 2025 registration website

Just as coding assistants have dramatically increased productivity for coding tasks over the last two years, researchers in the NLP community have begun to explore methods and opportunities ahead for creating scientific assistants that can help with the process of scientific discovery and increase the pace at which novel discoveries are made.

Historically, major results in AI and scientific discovery have been restricted to problem-specific methods—such as DeepMind AlphaFold and RosettaFold, which are systems specifically designed for protein folding, or multistage discovery pipelines designed for use in identification of novel materials. Over the last year, language models are being used to create problem-general scientific discovery assistants that are not restricted to narrow problem domains or formulations. Such applications hold opportunity for assisting researchers in broad domains, or scientific reasoning more generally. Beyond providing assistance, a growing body of work has begun to focus on the prospect of creating largely autonomous scientific discovery agents that can make novel discoveries with minimal human intervention.

We have also observed rising interest in evaluating systems that help with or perform scientific discovery. Difficulties with evaluation persist; as novel scientific discoveries (by definition) haven't been made yet, it is hard to formulate and evaluate a benchmark that provides insights about valuable machine-generated contributions that boost the discovery process. The community has begun to answer this with benchmarks focused on facets of the discovery process, such as data-driven discovery, experiment replication, reviewing, idea generation, or proxy tasks such as end-to-end discovery in virtual environments, all of which begin to address the challenges of evaluating progress towards generating novel discoveries.

These recent developments highlight the possibility of rapidly accelerating the pace of scientific discovery in the near term. Given the influx of researchers into this expanding field, this workshop proposes to serve as a vehicle for bringing together a diverse set of perspectives from this quickly expanding subfield, helping to disseminate the latest results, standardize evaluation, foster collaboration between groups, and allow discussing aspirational goals for 2025 and beyond.

Speakers

Kexin Huang

Stanford University

(AI for biomedical and therapeutic discoveries)

Heng Ji

UIUC

(Literature-based Discovery)

Peter Clark

Ai2

(Automated Research Assistants)

Marinka Zitnik

Harvard University

(Interpretable Drug Discovery)

Call for Papers

We welcome submissions on all topics related to AI and Scientific Discovery including but not limited to:

Literature-based Discovery
Agent-centered Approaches
Automated Experiment Execution
Automated Replication
Data-driven Discovery
Discovery in Virtual Environments
Discovery with Humans in the Loop
Assistants for Scientific Writing

Organizers

Eric Horvitz

Chief Scientific Officer (Microsoft)

Doug Downey

Director of Semantic Scholar(Ai2)

Tom Hope

Asst. Professor (HUJI)
Research Scientist (Ai2)

Tushar Khot

Lead Research Scientist (Ai2)

Bhavana Dalvi Mishra

Lead Research Scientist (Ai2)

Bodhisattwa Prasad Majumder

Research Scientist (Ai2)

Harsh Trivedi

Stony Brook University

Peter Jansen

Assoc. Prof. (University of Arizona)

Submission Guidelines

We welcome three types of papers: archival workshop papers, non-archival papers, and non-archival cross-submissions. Only regular workshop papers will be included in the workshop proceedings. Regular workshop submissions (both archival and non-archival) should be in PDF format and made through the OpenReview website set up for this workshop link. In line with the ACL main conference policy, camera-ready versions of regular workshop papers will be given one additional page of content. Non-archival cross-submissions should be made through the form [link].

Archival regular workshop papers: Authors should submit a paper up to 8 pages (both short and long papers are welcome), with unlimited pages for references, following the ACL author guidelines. The reported research should be substantially original. All submissions will be reviewed in a single track, regardless of length. Accepted papers will be presented as posters by default, and best papers may be given the opportunity for a brief talk to introduce their work. Reviewing will be double-blind, and thus no author information should be included in the papers; self-reference that identifies the authors should be avoided or anonymised. Accepted papers will appear in the workshop proceedings. Preference for oral presentation slots in the workshop will be given to archival papers.
Non-archival regular workshop papers: This is the same as the option above, but these papers will not appear in the proceedings and will typically only receive poster presentation slots. Non-archival submissions in this category will still undergo the review process. This is appropriate for nearly finished work that is intended for submission to another venue at a later date.
Non-archival cross-submissions: We also solicit cross-submissions, i.e., papers on relevant topics that have already appeared in other venues (e.g., workshop or conference papers at NLP, ML, or cognitive science venues, among others). Accepted papers will be presented at the workshop, with an indication of original venue, but will not be included in the workshop proceedings. Cross-submissions are ideal for related work which would benefit from exposure to the audience working on Scientific Discovery. Papers in this category do not need to follow the ACL format, and the submission length is determined by the original venue. The paper selection will be solely determined by the organizing committee in a non-blind fashion. These papers will typically receive poster presentation slots.

In addition, we welcome papers on relevant topics that are under review or to be submitted to other venues (including the ACL 2024 main conference). These papers must follow the regular workshop paper format and will not be included in the workshop proceedings. Papers in this category will be reviewed by workshop reviewers.

Note to authors: For archival and non-archival regular workshop submissions, while you submit your paper through OpenReview (link), please select the "Track" properly based on the guidelines. For cross-submissions, please fill out this form ([link]) and do NOT submit through OpenReview.

For questions about the submission guidelines, please contact workshop organizers via aisd-organizers@googlegroups.com.

Important Dates

Paper Submission Deadline	Feb 6, 2025 (All deadlines are 11:59 PM AoE time.)
Decision Notifications	Feb 27, 2025
Camera Ready Paper Deadline	Mar 10, 2025
Workshop Date	May 3, 2025

Schedule (Tentative)

08:55 AM	Opening Remarks
09:00 AM	Keynote Talk 1: Marinka Zitnik TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools [Abstract] [Speaker Bio] Abstract: Precision therapeutics require multimodal adaptive models that generate personalized treatment recommendations. We introduce TxAgent, an AI agent that leverages multi-step reasoning and real-time biomedical knowledge retrieval across a toolbox of 211 tools to analyze drug interactions, contraindications, and patient-specific treatment strategies. TxAgent evaluates how drugs interact at molecular, pharmacokinetic, and clinical levels, identifies contraindications based on patient comorbidities and concurrent medications, and tailors treatment strategies to individual patient characteristics. It retrieves and synthesizes evidence from multiple biomedical sources, assesses interactions between drugs and patient conditions, and refines treatment recommendations through iterative reasoning. It selects tools based on task objectives and executes structured function calls to solve therapeutic tasks that require clinical reasoning and cross-source validation. The ToolUniverse consolidates 211 tools from trusted sources, including all US FDA-approved drugs since 1939 and validated clinical insights from Open Targets. TxAgent outperforms leading LLMs, tool-use models, and reasoning agents across five new benchmarks: DrugPC, BrandPC, GenericPC, TreatmentPC, and DescriptionPC, covering 3,168 drug reasoning tasks and 456 personalized treatment scenarios. It achieves 92.1% accuracy in open-ended drug reasoning tasks, surpassing GPT-4o and outperforming DeepSeek-R1 (671B) in structured multi-step reasoning. TxAgent generalizes across drug name variants and descriptions. By integrating multi-step inference, real-time knowledge grounding, and tool-assisted decision-making, TxAgent ensures that treatment recommendations align with established clinical guidelines and real-world evidence, reducing the risk of adverse events and improving therapeutic decision-making. Bio: Marinka Zitnik (https://zitniklab.hms.harvard.edu) is an Associate Professor of Biomedical Informatics at Harvard Medical School, at Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University and at Broad Institute of MIT and Harvard. Zitnik investigates foundations of AI that contribute to the scientific understanding of medicine and therapeutic design, eventually enabling AI to learn and innovate on its own. Her research won best paper and research awards, including the Kavli Fellowship of the National Academy of Sciences, Kaneb Fellowship award at Harvard Medical School, NSF CAREER Award, awards from the International Society for Computational Biology, International Conference in Machine Learning, Bayer Early Excellence in Science, Amazon Faculty Research, Google Faculty Research, Roche Alliance with Distinguished Scientists, and Sanofi iDEA-iTECH Award. Zitnik founded Therapeutics Data Commons, a global open-science initiative to access and evaluate AI across stages of development and therapeutic modalities, and she served as the faculty lead of the AI4Science initiative.
09:45 AM	Keynote Talk 2: Kexin Huang AI Agents for Accelerating Scientific Discovery: From Hypothesis Generation to Experimental Design [Abstract] [Speaker Bio] Abstract: Scientific research is inherently slow, complex, and highly specialized, relying on an iterative cycle of hypothesis generation and validation. Can we create an AI-powered biologist to automate these processes at scale? In this talk, I will outline the key ingredients for building such a system. I will first introduce Biomni, a generalist AI agent that integrates large language models with a vast ecosystem of specialized biological tools, software, and databases. By linking AI-driven reasoning with real-world execution, Biomni enables a broad range of biomedical applications. Building on this foundation, I will introduce BioDiscoveryAgent, an AI system for automated hypothesis generation that optimizes experimental design by identifying the most promising perturbations for screening. By leveraging literature-based reasoning, it significantly outperforms traditional Bayesian optimization approaches. Finally, beyond hypothesis generation, rigorous validation is essential. I will present POPPER, an AI-driven sequential falsification framework that automates hypothesis validation with statistical guarantees, ensuring scientific robustness at scale. Bio: Kexin Huang is a fourth-year PhD student in Computer Science at Stanford University, advised by Prof. Jure Leskovec. His research focuses on leveraging AI to drive novel, deployable, and interpretable biomedical discoveries, while also tackling fundamental AI challenges such as multi-modal modeling, uncertainty quantification, and agentic reasoning. His work has been published in Nature Medicine, Nature Biotechnology, Nature Chemical Biology, Nature Biomedical Engineering, Nature, and machine learning conferences including NeurIPS, ICML, ICLR, and UAI. He has received numerous best paper awards at NeurIPS/ICML workshops, ISMB, and ASHG, with cover article in Nature Biotechnology and Cell Patterns. His research has been featured in major media outlets such as Forbes, WIRED, and MIT Technology Review. He has also contributed to machine learning research at leading companies and institutions, including Genentech, GSK, Pfizer, IQVIA, Flatiron Health, Dana-Farber Cancer Institute, and Rockefeller University.
10:30 AM	Break 1
11:00 AM	Keynote Talk 3: Heng Ji AI Plays Medicinal Chemist and Material Scientist [Abstract] [Speaker Bio] Abstract: There exist approximately 166 billion small molecules, with 970 million deemed druglike. Similarly there is a vast pool of molecule candidates for new materials. This scarcity underscores the urgent need for innovative approaches, calling upon the NLP community to contribute significantly to medicine and material science. However, the challenges are manifold. Existing large language models (LLMs) alone are insufficient due to their tendency to generate erroneous claims confidently. Moreover, traditional knowledge bases do not adequately address the issue. This gap persists because chemistry language diverges significantly from natural language, demanding specialized domain knowledge, joint molecule and language modeling, and critical thinking. Using drug discovery, personalized drug synergy, and material discovery as case studies, I will present our approaches to tackle these challenges and turn an AI agent into a Medicinal Chemist or Material Scientist. I will share preliminary results from animal testing conducted on drug variants, and newly discovered material variants for efficient Organic Photovoltaic Devices proposed by AI algorithms. Bio: Heng Ji is a Tenured Full Professor and Associate Director for Research of Siebel School of Computing and Data Science, and a faculty member affiliated with Electrical and Computer Engineering Department, Coordinated Science Laboratory, and Carl R. Woese Institute for Genomic Biology of University of Illinois Urbana-Champaign. She is an Amazon Scholar. She is the Founding Director of Amazon-Illinois Center on AI for Interactive Conversational Experiences (AICE). She received Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing, especially on Multimedia Multilingual Information Extraction, Knowledge-enhanced Large Language Models and Vision-Language Models, and AI for Science. The awards she received include Outstanding Paper Award at ACL2024, two Outstanding Paper Awards at NAACL2024, "Young Scientist" by the World Laureates Association in 2023 and 2024, "Young Scientist" and a member of the Global Future Council on the Future of Computing by the World Economic Forum in 2016 and 2017, "Women Leaders of Conversational AI" (Class of 2023) by Project Voice, "AI's 10 to Watch" Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, PACLIC2012 Best paper runner-up, "Best of ICDM2013" paper award, "Best of SDM2013" paper award, ACL2018 Best Demo paper nomination, ACL2020 Best Demo Paper Award, NAACL2021 Best Demo Paper Award, Google Research Award in 2009 and 2014, IBM Watson Faculty Award in 2012 and 2014 and Bosch Research Award in 2014-2018. She served as the associate editor for IEEE/ACM Transaction on Audio, Speech, and Language Processing, and the Program Committee Co-Chair of many conferences including NAACL-HLT2018 and AACL-IJCNLP2022. She was elected as the North American Chapter of the Association for Computational Linguistics (NAACL) secretary 2020-2023.
11:45 AM	Oral Presentation: 1 LLM-Assisted Translation of Legacy FORTRAN Codes to C++: A Cross-Platform Study [Abstract] Abstract: TBD
12:00 PM	Oral Presentation: 2 Scideator: Iterative Human-LLM Scientific Idea Generation and Novelty Evaluation Grounded in Research-Paper Facet Recombination [Abstract] Abstract: TBD
12:15 PM	Oral Presentation: 3 What Can Large Language Models Do for Sustainable Food? [Abstract] Abstract: TBD
12:30 PM	Lunch Break
14:00 PM	Keynote Talk 4: Peter Clark Towards (Semi-)Autonomous Scientific Discovery [Abstract] [Speaker Bio] Abstract: Automated/assisted scientific discovery is one of the most exciting emerging areas of AI, fueled by the promise - or at least hope - that language models can overcome some of the show-stopping obstacles of the past. At Ai2 we are pursuing this topic, developing both a sophisticated research assistant for humans, and prototyping several (semi-)autonomous scientific discovery systems. First I'll describe our research assistant with an extended example dialog, showing how the user can use tools to search the literature, identify relevant data, run software experiments, and analyze the results, illustrating the vision we are pursuing. Following this, I'll describe our (largely) autonomous prototypes, showing how they iteratively design and perform experiments on their own (for select computer science tasks: probing language models, building software agents, and improving transformer architectures). In particular, I'll highlight where they succeed what their fundamental limitations are. Finally I'll speculate on what it would take to go further, from today's systems that typically search a somewhat bounded and delineated space, to "big science" where research is conducted over many iterations and months, where the search spaces themselves are regularly redefined, and where predictive theories about the world gradually evolve. Bio: Dr. Peter Clark is a Senior Research Director and founding member of the Allen Institute for AI (AI2), and also served as Interim CEO from 2022-2023. He leads AI2's Aristo Project, a team of 15 people developing AI agents that can systematically reason, explain, and continually improve over time, in particular in the context of scientific discovery. He received his Ph.D. in 1991 and has worked in AI for over 30 years. He has published over 300 papers, and has received several awards, including five Best Paper awards (AAAI, EMNLPx3, AKBC), a Boeing Associate Technical Fellowship (2004), and Senior Membership of AAAI.
14:45 PM	Oral Presentation: 4 Language Modeling by Language Models with Genesys [Abstract] Abstract: TBD
15:00 PM	Oral Presentation: 5 Towards AI-assisted Academic Writing [Abstract] Abstract: TBD
15:15 PM	Oral Presentation: 6 Large Language Models Reflect Human Citation Patterns with a Heightened Citation Bias [Abstract] Abstract: TBD
15:30 PM	Break 2
16:00 PM	In-Person Poster Session

Accepted Papers (Archival)

Variable Extraction for Model Recovery in Scientific Literature
Chunwei Liu, Enrique Noriega-Atala, Adarsh Pyarelal, Clayton T Morrison, Mike Cafarella
[Archival]
How Well Do Large Language Models Extract Keywords? A Systematic Evaluation on Scientific Corpora
Nacef Ben Mansour, Hamed Rahimi, Motasem Alrahabi
[Archival]
A Human-LLM Note-Taking System with Case-Based Reasoning as Framework for Scientific Discovery
Douglas B Craig
[Archival]
Towards AI-assisted Academic Writing
Daniel J. Liebling, Malcolm Kane, Madeleine Grunde-McLaughlin, Ian Lang, Subhashini Venugopalan, Michael Brenner
[Archival]
Evaluating and Enhancing Large Language Models for Novelty Assessment in Scholarly Publications
Ethan Lin, Zhiyuan Peng, Yi Fang
[Archival]
LLM-Assisted Translation of Legacy FORTRAN Codes to C++: A Cross-Platform Study
Nishath Rajiv Ranasinghe, Shawn M. Jones, Michal Kucer, Ayan Biswas, Daniel O'Malley, Alexander Most, Selma Liliane Wanna, Ajay Sreekumar
[Archival]
FlavorDiffusion: Predicting Food Pairings and Chemical Interactions Using Diffusion Models
Junpyo Seo
[Archival]

Accepted Papers (Non-Archival Previously Published Papers)

Note: These papers have been previously accepted at other venues, and are highly relevant to AI & Scientific Discovery.

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, Huan Sun
[Non-Archival Published]
Large Language Models Reflect Human Citation Patterns with a Heightened Citation Bias (NAACL 2025)
Andres Algaba, Carmen Mazijn, Vincent Holst, Floriano Tori, Sylvia Wenmackers, Vincent Ginis
[Non-Archival Published]
MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses (ICLR 2025)
Zonglin Yang, Wanhao Liu, Ben Gao, Tong Xie, Yuqiang Li, Wanli Ouyang, Soujanya Poria, Erik Cambria, Dongzhan Zhou
[Non-Archival Published]
Efficient Evolutionary Search Over Chemical Space with Large Language Models (ICLR 2025)
Haorui Wang*, Marta Skreta*, Cher Tian Ser, Wenhao Gao, Lingkai Kong, Felix Strieth-Kalthoff, Chenru Duan, Yuchen Zhuang, Yue Yu, Yanqiao Zhu, Yuanqi Du, Alan Aspuru-Guzik, Kirill Neklyudov, Chao Zhang
[Non-Archival Prepublished]
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models (ICLR 2025)
Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, Peter Clark
[Non-Archival Published]
DiscoveryWorld: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents (NeurIPS 2024 Spotlight)
Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, Peter Clark
[Non-Archival Published]
Hypothesis Generation with Large Language Models (EMNLP 2024)
Yangqiaoyu Zhou, Haokun Liu, Tejes Srivastava, Hongyuan Mei, Chenhao Tan
[Non-Archival Published]

Accepted Papers (Non-Archival)

Note: These papers are non-archival, and not included in the official proceedings.

VISION: A Modular AI Assistant for Natural Human-Instrument Interaction at Scientific User Facilities
[Non-Archival]
Automatic Evaluation Metrics for Artificially Generated Scientific Research
[Non-Archival]
WithdrarXiv: A Large-Scale Dataset for Retraction Study
[Non-Archival]
FARM: Functional Group-Aware Representations for Small Molecules
[Non-Archival]
Scideator: Iterative Human-LLM Scientific Idea Generation and Novelty Evaluation Grounded in Research-Paper Facet Recombination
[Non-Archival]
What Can Large Language Models Do for Sustainable Food?
[Non-Archival]
Learning to Generate Research Idea with Dynamic Control
[Non-Archival]
Map2Text: New Content Generation from Low-Dimensional Visualizations
[Non-Archival]
Data Driven Design as a Challenge Task for Few- and Zero-Shot Information Extraction
[Non-Archival]
Language Modeling by Language Models with Genesys
[Non-Archival]

AI & Scientific Discovery Workshop

NAACL 2025 workshop, Albuquerque, New Mexico

May 3, 2025

Speakers

Stanford University

(AI for biomedical and therapeutic discoveries)

UIUC

(Literature-based Discovery)

Ai2

(Automated Research Assistants)

Harvard University

(Interpretable Drug Discovery)

Call for Papers

Organizers

Chief Scientific Officer (Microsoft)

Director of Semantic Scholar(Ai2)

Asst. Professor (HUJI) Research Scientist (Ai2)

Lead Research Scientist (Ai2)

Lead Research Scientist (Ai2)

Research Scientist (Ai2)

Stony Brook University

Assoc. Prof. (University of Arizona)

Submission Guidelines

Important Dates

Schedule (Tentative)

Accepted Papers (Archival)

Accepted Papers (Non-Archival Previously Published Papers)

Accepted Papers (Non-Archival)

Asst. Professor (HUJI)
Research Scientist (Ai2)