Motivation: Laboratory accidents, including explosions, radiation leaks, and chemical spills, have caused severe harm to both life and property. While Large Language Models (LLMs) and Vision Language Models (VLMs) are increasingly used in lab research, particularly in automated labs, they often lack safety awareness, raising concerns about their reliability in safety-critical tasks. Inaccurate or incomplete guidance from LLMs could lead to disastrous consequences, posing the crucial question: Can LLMs be trusted to make safer decisions than humans in lab environments?
Dataset Summary: To address this question, we propose
LabSafety Bench, a specialized evaluation framework designed to assess the reliability and safety awareness of LLMs in laboratory environments. First, we propose a new taxonomy for lab safety, aligned with US Occupational Safety and Health Administration (OSHA) protocols.
Second, we curate a set of 765 multiple-choice questions guided by this taxonomy to ensure comprehensive coverage of safety concerns across various domains. Of these, 632 are text-only questions, while 133 are text-with-image questions.
Each question is classified as either "easy" or "hard", depending on whether it can be answered correctly using only pre-university knowledge. Additionally, for each question, we provide a step-by-step explanation that has been verified by human experts to ensure accuracy and clarity.
Experimental Results Summary: We evaluate the performance of 17 foundation models on
LabSafety Bench, 7 open-weight LLMs, 4 open-weight VLMs, and 6 proprietary models.
Additionally, we test the performance of undergraduate and graduate students who have received lab safety training in their respective disciplines on a sampled dataset from LabSafety Bench.
The results show that GPT-4o achieves the highest accuracy on
LabSafety Bench, reaching 86.27%. However, most open-source 7B LLMs or VLMs only achieve around 60% accuracy, which is comparable to the student evaluators' performance at 65.52%, while the best student evaluator achieves 83% accuracy.
Contribution: To the best of our knowledge, this is the first study about the trustworthiness of LLMs in lab safety
contexts, expanding beyond the current focus on whether a model’s output is harmful, factual,
biased, or privacy-infringing. While our work highlights the unreliability of current models in lab safety, similar challenges
exist in other high-stakes LLM application scenarios that require precise decision-making and adherence to safety standards. For example, when LLMs are involved in household robotics, medical
device operations, or industrial machinery control.