Logo LabSafety Bench

Benchmarking LLMs on Safety Issues in Scientific Labs

1University of Notre Dame,
2IBM Research
Lab Safety Taxonomy

Our proposed new taxonomy for Lab Safety, serving as a framework to guide the data curation process for Logo LabSafety Bench .

Introduction

Motivation: Laboratory accidents, including explosions, radiation leaks, and chemical spills, have caused severe harm to both life and property. While Large Language Models (LLMs) and Vision Language Models (VLMs) are increasingly used in lab research, particularly in automated labs, they often lack safety awareness, raising concerns about their reliability in safety-critical tasks. Inaccurate or incomplete guidance from LLMs could lead to disastrous consequences, posing the crucial question: Can LLMs be trusted to make safer decisions than humans in lab environments?

Dataset Summary: To address this question, we propose Logo LabSafety Bench, a specialized evaluation framework designed to assess the reliability and safety awareness of LLMs in laboratory environments. First, we propose a new taxonomy for lab safety, aligned with US Occupational Safety and Health Administration (OSHA) protocols. Second, we curate a set of 765 multiple-choice questions guided by this taxonomy to ensure comprehensive coverage of safety concerns across various domains. Of these, 632 are text-only questions, while 133 are text-with-image questions. Each question is classified as either "easy" or "hard", depending on whether it can be answered correctly using only pre-university knowledge. Additionally, for each question, we provide a step-by-step explanation that has been verified by human experts to ensure accuracy and clarity.

Experimental Results Summary: We evaluate the performance of 17 foundation models on Logo LabSafety Bench, 7 open-weight LLMs, 4 open-weight VLMs, and 6 proprietary models. Additionally, we test the performance of undergraduate and graduate students who have received lab safety training in their respective disciplines on a sampled dataset from LabSafety Bench. The results show that GPT-4o achieves the highest accuracy on Logo LabSafety Bench, reaching 86.27%. However, most open-source 7B LLMs or VLMs only achieve around 60% accuracy, which is comparable to the student evaluators' performance at 65.52%, while the best student evaluator achieves 83% accuracy.

Contribution: To the best of our knowledge, this is the first study about the trustworthiness of LLMs in lab safety contexts, expanding beyond the current focus on whether a model’s output is harmful, factual, biased, or privacy-infringing. While our work highlights the unreliability of current models in lab safety, similar challenges exist in other high-stakes LLM application scenarios that require precise decision-making and adherence to safety standards. For example, when LLMs are involved in household robotics, medical device operations, or industrial machinery control.

Logo LabSafety Bench Dataset

Overview

Logo LabSafety Bench is a specialized evaluation framework designed to assess the reliability and safety awareness of LLMs/VLMs in laboratory environments. First, we propose a new taxonomy for lab safety, aligned with US Occupational Safety and Health Administration (OSHA) protocols. We categorize lab safety issues into four main groups: Hazardous Substances, Emergency Response, Responsibility and Compliance, and Equipment and Material Handling. Some of these categories are further divided into subcategories based on the specific discipline or area of focus. Next, we gather an extensive corpus focused on lab safety based on the taxonomy. Human experts then identify key knowledge points within these materials, which are used to generate questions and options with the assistance of GPT-4o. Since the initial set of questions may include overly simplistic incorrect options, we prompt GPT-4o to refine these options, making them more challenging and less obvious. Finally, human experts review the questions to ensure each one has precisely one correct answer, resulting in the final version of the questions. Through this process, we curate a set of 765 multiple-choice questions to ensure comprehensive coverage of safety concerns across various domains. Of these, 632 are text-only questions, while 133 are text-with-image questions. Each question is classified as either ``easy'' or ``hard,'' depending on whether it can be answered correctly using only pre-university knowledge. Additionally, for each question, we provide a step-by-step reasoning that has been verified by human experts to ensure accuracy and clarity. You can download the dataset on Hugging Face Dataset.

To conduct human evaluation, we construct 4 questionnaires for launching human evaluation in four subjects, Chemistry, Biology, Physics, and General Lab Safety. Each of them contains the related categories of this subject. To achieve this, we sample 25 instances from each subject, resulting in a total of 100 samples for human evaluation, which is also called sampled LabSafety Bench.

Examples

Statistics

Notable statistics of Logo LabSafety

Experiment Results

Model List

Model Category Version Creator Source Link
Llama3-8B Open-weight LLM - Meta HuggingFace
Llama3-70B Open-weight LLM - Meta HuggingFace
Vicuna-7B Open-weight LLM v1.5 LMSYS HuggingFace
Vicuna-13B Open-weight LLM v1.5 LMSYS HuggingFace
Mistral-7B Open-weight LLM v0.3 Mistral AI HuggingFace
Mistral-8x7B Open-weight LLM v0.1 Mistral AI HuggingFace
Galactica-6.7B Open-weight LLM - Meta HuggingFace
InstructBlip-7B Open-weight VLM - Salesforce HuggingFace
Qwen-VL-Chat Open-weight VLM - Alibaba HuggingFace
InternVL2-8B Open-weight VLM - OpenGVLab HuggingFace
Llama3.2-11B Open-weight VLM - Meta HuggingFace
Gemini-1.5-Flash Proprietary model - Google DeepMind Google API
Gemini-1.5-Pro Proprietary model - Google DeepMind Google API
Claude-3-Haiku Proprietary model 20240307 Anthropic Anthropic API
Claude-3.5-Sonnet Proprietary model 20240620 Anthropic Anthropic API
GPT-4o-mini Proprietary model 2024-07-18 OpenAI OpenAI API
GPT-4o Proprietary model 2024-08-06 OpenAI OpenAI API

Leaderboard on Logo LabSafety Bench

Leaderboard: LLMs on Text-only Questions

Accuracy scores on the Text-only subset (632 samples) of Logo LabSafety Bench across different categories.

# Model ALL BH CH RH PH RS EWM EU ES PPE ER
1 Llama3-70B 78.32 76.92 78.75 71.73 87.30 79.75 73.33 76.50 70.00 87.95 78.64
2 Llama3-8B 65.19 69.23 65.93 68.18 63.49 68.35 69.33 60.11 75.00 60.24 69.90
3 Vicuna-7B 36.08 41.35 36.63 32.95 26.98 34.18 44.00 39.47 20.00 39.76 32.04
4 Vicuna-13B 46.52 50.96 45.42 43.94 34.82 46.84 52.00 45.36 60.00 59.04 46.60
5 Mistral-7B 58.39 59.62 61.17 56.82 47.62 63.29 62.67 55.19 55.00 60.24 59.22
6 Mistral-8x7B 62.82 65.38 65.93 53.41 60.32 60.76 58.47 55.19 55.00 61.45 67.96
7 Galactica-6.7B 33.54 42.31 30.77 34.09 25.40 32.91 33.33 31.15 35.00 37.35 33.01

Leaderboard: VLMs on Text-with-image Questions

Accuracy scores on the Text-with-image subset (133 samples) of Logo LabSafety Bench across different categories.

# Model ALL BH CH RH PH RS EWM EU ES PPE ER
1 InstructBlip-7B 26.32 37.50 20.55 31.25 26.67 27.91 37.50 31.58 0.00 40.00 17.65
2 Qwen-VL-Chat 64.66 62.50 65.75 75.00 60.00 65.12 62.50 55.26 0.00 60.00 76.47
3 InternVL2-8B 72.93 62.50 79.45 75.00 46.67 74.42 66.47 60.53 100.0 88.00 76.47
4 Llama3.2-11B 72.18 75.00 71.23 75.00 53.33 76.74 62.50 63.16 100.0 80.00 82.35

Leaderboard: Models on Both Types of Questions

Accuracy scores on the whole dataset of Logo LabSafety Bench across different categories. LabSafety Bench.

# Model ALL BH CH RH PH RS EWM EU ES PPE ER
1 Gemini-1.5-Flash 74.32 75.89 76.01 71.15 70.51 77.87 79.52 72.85 80.95 75.00 73.33
2 Gemini-1.5-Pro 79.91 79.46 81.93 75.00 76.19 83.61 81.82 81.58 90.00 82.41 84.17
3 Claude-3-Haiku 77.65 81.25 78.95 69.23 77.78 81.25 83.13 72.85 85.19 80.56 78.74
4 Claude-3.5-Sonnet 83.04 86.61 86.93 83.33 81.48 83.61 87.04 83.68 87.04 85.00 88.05
5 GPT-4o-mini 79.74 79.74 79.79 71.79 79.51 81.87 90.74 80.71 83.67 78.41 87.50
6 GPT-4o 84.96 84.92 82.86 83.33 84.31 84.36 90.56 85.74 100.0 87.04 90.86
7 Top3-Human* 75.67 78.26 86.67 74.07 71.11 84.62 82.67 41.67 75.00 66.67 88.00
Top-3 Human*: Since some human participants are junior researchers who may not fully represent the true capabilities of experienced experts, we selected the top-3 scorers and calculated their accuracy across each subcategory.
Hazardous Substances: BH: Biological Hazards, CH: Chemical Hazards, RH: Radiation Hazards, PH: Physical Hazards.
Responsibility and Compliance: RS: Responsibility for Safety, EWM: Environmental and Waste Management.
Equipment and Material Handling: EU: Equipment Usage, ES: Electricity Safety, PPE: Personal Protective Equipment.
Emergency Response: ER: Emergency Response.

Results on Existing Foundation Models

GPT-4o Error Analysis

BibTeX

@misc{zhou2024labsafetybenchbenchmarkingllms,
      title={LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs}, 
      author={Yujun Zhou and Jingdong Yang and Kehan Guo and Pin-Yu Chen and Tian Gao and Werner Geyer and Nuno Moniz and Nitesh V Chawla and Xiangliang Zhang},
      year={2024},
      eprint={2410.14182},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.14182}, 
}