$\mu$ -Bench: Vision-Language Benchmark for Microscopy Understanding

Alejandro Lozano *
Department of Biomedical Data Science
Stanford University
Stanford, CA 94305
[email protected]
&Jeffrey Nirschl *
Department of Pathology
Stanford University
Stanford, CA 94305
[email protected]
&James Burgess
ICME
Stanford University
Stanford, CA 94305
[email protected]
&Sanket Rajan Gupte
Department of Computer Science
Stanford University
Stanford, CA 94305
[email protected]
&Yuhui Zhang
Department of Computer Science
Stanford University
Stanford, CA 94305
[email protected]
&Alyssa Unell
Department of Computer Science
Stanford University
Stanford, CA 94305
[email protected]
&Serena Yeung-Levy
Department of Biomedical Data Science
Stanford University
Stanford, CA 94305
[email protected]

Abstract

Recent advances in microscopy have enabled the rapid generation of terabytes of image data in cell biology and biomedical research. Vision-language models (VLMs) offer a promising solution for large-scale biological image analysis, enhancing researchers’ efficiency, identifying new image biomarkers, and accelerating hypothesis generation and scientific discovery. However, there is a lack of standardized, diverse, and large-scale vision-language benchmarks to evaluate VLMs’ perception and cognition capabilities in biological image understanding. To address this gap, we introduce $\mu$ -Bench, an expert-curated benchmark encompassing 22 biomedical tasks across various scientific disciplines (biology, pathology), microscopy modalities (electron, fluorescence, light), scales (subcellular, cellular, tissue), and organisms in both normal and abnormal states. We evaluate state-of-the-art biomedical, pathology, and general VLMs on $\mu$ -Bench and find that: i) current models struggle on all categories, even for basic tasks such as distinguishing microscopy modalities; ii) current specialist models fine-tuned on biomedical data often perform worse than generalist models; iii) fine-tuning in specific microscopy domains can cause catastrophic forgetting, eroding prior biomedical knowledge encoded in their base model. iv) weight interpolation between fine-tuned and pre-trained models offers one solution to forgetting and improves general performance across biomedical tasks. We release $\mu$ -Bench under a permissive license to accelerate the research and development of microscopy foundation models.

1 Introduction

Refer to caption — Figure 1: Data samples from $\mu$ -Bench, covering perception (left) and cognition (right) tasks across subcellular, cellular, and tissue levels tasks across electron, fluorescence, and light microscopy.

Microscopy is a cornerstone of biomedical research [1], enabling detailed study of biological structures at multiple scales [2]. Advances in cryo-electron microscopy, high-throughput fluorescence microscopy, and whole-slide imaging allow the rapid generation of terabytes of image data, which are essential for fields such as cell biology, biomedical research, and pathology [3]. These data span multiple length scales, allowing researchers to examine atomic/molecular, subcellular/cellular, and cell/tissue-level structures with high precision [4]. A crucial first step in microscopy analysis is interpreting and reasoning about the significance of image findings [5]. However, this requires domain expertise and comprehensive knowledge of biology, normal/abnormal states, and the capabilities and limitations of microscopy techniques. The increased volume and velocity of microscopy data compound the challenge of manual microscopy image interpretation.

Text is an intuitive interface for interactive analysis, and thus, vision-language models (VLMs) are one promising approach to assist with image interpretation. Auto-regressive VLMs can follow instructions and respond to questions using free text, which is intuitive for users without a computational background. A biomedical VLM that has distilled knowledge from diverse microscopy images with input from multiple domain experts could also democratize image interpretation for specialized techniques. In addition, they could relate image findings to recent literature, relevant genes, small molecule therapeutics, and diseases and potentially accelerate hypothesis generation and discovery [6, 7, 8, 9, 10, 11, 12, 12, 13, 14].

Foundation VLMs trained on large corpora of image-text data have revolutionized many areas of artificial intelligence (AI), excelling in a variety of natural language processing (NLP) [15, 16], computer-vision (CV) [17, 18], and vision-language [19, 20, 21] tasks. It is well known that biomedical data are significantly under-represented in internet-based image datasets, and a critical first question is how well generalist VLMs trained on natural image-text data perform on out-of-domain biomedical vision-language tasks.

However, there are no diverse, large-scale vision-language benchmarks to evaluate generalist or specialist VLMs in microscopy image interpretation. Whereas segmentation benchmarks are common[22], and recent efforts have curated impressive image-drug phenotype multimodal datasets[23], the current state of microscopy vision-language benchmarks is limited. Existing microscopy vision-language benchmarks often focus on single-domain diagnostic capabilities (e.g., histopathology [24, 25]) rather than describing and understanding [24]. Although these are significant contributions, they lack the diversity of images and content needed to evaluate the performance of VLMS across tasks, scales, microscopy modalities, organisms, and biological processes. Lastly, in contrast to the general vision-language domain, current biomedical evaluations lack comprehensive characterization across both perception (recognizing specific objects, counting, localization, and color) and cognition abilities (integrating knowledge and perceptual to deduce more complex answers) [26]. This gap hinders the progress of develo** robust VLMs tailored for biomedical research.

To address the need for robust VLMs tailored for biomedical research, we present two contributions:

1.

$\mu$ -Bench A high-quality vision-language benchmark of 17,235 microscopy images from a diverse collection of unpublished and published datasets with newly added expert annotations and a permissive license. $\mu$ -Bench spans 22 perception and cognition tasks across light, fluorescence, and electron microscopy. It includes closed VQA, captioning, object detection, and segmentation tasks across 8 microscopy sub-modalities and 24 staining techniques, representative of 12 scientific domains.
2.

Characterization We leverage $\mu$ -Bench to characterize state-of-the-art generalist and domain-specific biomedical VLMs. We show several findings: even the best-performing VLMs have high error rates across all microscopy tasks (do not generalize well); specialist biomedical VLMs often underperform general VLMs; specialist fine-tuning in specific domains can cause catastrophic forgetting of biological knowledge that existed in the base model; and a simple weight ensembling strategy can mitigate the forgetting problem.

2 Related Work

Biomedical Vision-Language Benchmarks. While previous works have developed various biomedical vision-language benchmarks that have been instrumental in advancing diagnostic capabilities, they present three main problems: 1) Task simplicity: Most biomedical computer vision benchmarks predominantly focus on supervised tasks such as classification and segmentation [27, 28, 29] rather than vision-language tasks. As shown in Table 8, biomedical vision-language benchmarks are less common. 2) Lack of diversity: existing vision-language datasets are usually limited to diagnostic imaging such as radiology or pathology [30, 31]; there is a lack of benchmarks for basic research microscopy. 3) Limited accessibility: PMC-15M is a large image-text dataset used to train BiomedCLIP [3], which includes diverse microscopy images, domains, and organisms but is not publicly accessible for training or evaluation.

Vision-Language Models in Biomedicine. Vision-language models (VLMs) can be generally categorized into two types: 1) Contrastive models, such as CLIP [19] and ALIGN [32], which use contrastive learning to create shared image-text embeddings, facilitating tasks like zero-shot classification and text-image retrieval; and 2) Auto-regressive models, such as Flamingo [20] and GPT-4 [33], which integrate image embeddings with large language models (LLMs) to perform zero-shot tasks, follow instructions, and reason about content. These VLMs have significant potential to advance biomedicine [34, 5, 6, 7, 35].

However, these models are primarily trained on general datasets with limited biomedical coverage, leading to suboptimal performance on biomedical tasks [36, 37]. To address this gap, specialized VLMs have been developed by fine-tuning generalist models on biomedical data. Notable examples include BiomedCLIP [3] train on images from PubMed, and histopathology vision-language models such as PLIP [38] and CONCH [39], which were trained on Twitter or various pathology sources. Despite the increase in vision-language foundation models for microscopy, there are no diverse benchmarks to evaluate the performance of image-based perception and reasoning about microscopy images across diverse scales and modalities. Our work on $\mu$ -Bench addresses this issue by providing a comprehensive benchmark that includes a variety of important and diverse biological processes, organisms, microscopy modalities, domains, and tasks to support the evaluation of microscopy foundation models.

3 Dataset collection methodology

Recognizing the need for an expert-level benchmark in microscopy for comprehensive biological and biomedical understanding, we developed a benchmark to assess the perception and cognition capabilities of VLMs in microscopy image analysis following the methodology shown in Figure 2. At a high level, the pipeline consists of two main components: (i) A biomedical expert categorized potential tasks and collected diverse microscopy datasets across multiple scientific domains, focusing on evaluating perception capabilities. (ii) We then complement $\mu$ -Bench by crowdsourcing questions from a larger group of microscopists using a web application.

3.1 Perception Dataset Curation

Dataset Review and Selection Open data repositories, including Zenodo, Dataverse, Dryad, and BBBC, among others, were searched for microscopy biomedical image datasets. Data with permissive licenses (CC BY 4.0) allowing derivatives and redistribution were prioritized. A cell biologist and pathologist reviewed the images to ensure high quality (e.g. absence of artifacts or distortion). Diverse datasets were selected to include important biological processes (e.g., cell cycle), organelles (mitochondria, nucleus), and cell/tissue types. Efforts were made to include diverse biological structures, microscopy modalities, and fields of study, however, the field of basic microscopy research is broad with future efforts intended to fill in gaps in coverage.

Standardization The original datasets had different organizational structures and file formats and often very little metadata. Information regarding the scientific discipline (domain), microscopy method, staining, pixel calibration, and the organism was determined by expert review or by consulting the original publication. The base experimental metadata was supplemented with manual annotation of multiple bio-ontology identifiers (SNOMED, BTO, FMA, LOINC, UBERON, etc.) to connect the image data with rich biology concepts and relationships knowledge graphs in the future. All images were converted into lossless PNG files at their original resolution with metadata in a paired json file. An MD5 checksum was computed for the image data, and each image was assigned a 128-bit unique identifier. The image-json pairs were converted into an Apache Arrow file for public distribution and ease of use through Hugging Face datasets [40].

VQA task generation We used the standardized metadata to create closed VQA questions that test capabilities at different levels: easier coarse-grained perception and challenging fine-grained perception (examples are shown in Figure 1).

The coarse-grained perception split tests basic image properties: the broad category of scientific discipline, the type of microscope, or the stain/contrast agent. These groups are visually distinct (e.g. fluorescence vs. electron microscopy) and relatively straightforward even for non-biologists, but provide important context to biology image interpretation. Although easier, these tasks are important to assess whether VLMs have commonsense knowledge of biology and microscopy.

The fine-grained perception split is more challenging. Within each category of scientific discipline or microscopy modality, there are image classes or features that need to be recognised to perform image interpretation. Dataset-specific tasks include the identification of cell type, subcellular organelles, cell cycle phase, and other biological processes that are visually distinct and important for reasoning about biological images. Solving fine-grained perception relies on finer-grained visual features, and is more challenging for humans.

We formulate both coarse-grained and fine-grained perception as closed visual question answering (VQA). We chose this over open VQA as it’s simpler to analyze, and doesn’t rely on LLMs for automatic evaluation. To generate VQA options in coarse-grained perception, we designed a tree encompassing microscopy modalities, scientific domains, and staining techniques, which enables sampling fine-grained options within concepts (e.g., selecting IHC(DAB) and IHC(RED) as likely stain options for question regarding light microscopy).

Localization task generation We also generate a spatial localization benchmark split, which requires predicting the bounding box or segmentation mask for a cell, nucleus or organelle (examples are in Tables 1). Understanding position and layout enables modeling spatial relationships and context, and is fundamental to image understanding. Datasets with segmentation were converted to allow instance segmentation, semantic segmentation, and object detection (bounding box and centroid).

Quality control Throughout all processing, we validate the schema of each data instance to ensure a consistent format and catch/fix errors before adding them to $\mu$ -BenchṪhe schema includes: modality: identification of the microscopy modality (BF, EF, or EM); submodality: identification of the microscopy sub-modality (e.g., confocal, phase contrast, or scanning electron microscopy); domain: determination of the field of study (e.g., cell biology, histology, or pathology); subdomain: identification of the sub-field (e.g., cancer research, neurobiology, or infectious diseases); staining: recognition of the staining technique (e.g., H&E, DAPI, or IHC).

3.2 Cognitive Dataset Curation

While perception datasets evaluate the fundamental capabilities of VLMs for microscopy image analysis, they fall short in assessing their ability to use perception to reason about objects. We curated a cognitive dataset to evaluate more advanced aspects like knowledge and reasoning. The cognitive dataset includes questions related to gene pathways, metabolic pathways, cell signaling and signal transduction, cell physiology and function, protein-protein interactions, cell-cell interactions, unique properties of the cell of origin or cell type in the image, cytoskeleton and cell structure or morphology, and drug mechanisms of action. These categories cover fundamental biological concepts and cellular processes to more deeply evaluate VLMs’ knowledge in understanding microscopy images.

Cognition Dataset Collection

We began by providing detailed guidelines for question creation to experts (see section C.5), which ensured consistency and quality across the dataset. Using an internal chat-like web interface, we asked domain experts to submit questions reflecting their daily research activities. We encouraged a focus on questions that required challenging image-based reasoning, domain expertise, interpretation of experimental results, or hypothesis generation.

In addition to submitting questions, experts provided crucial context regarding experimental details, image acquisition, organisms, treatments, and image descriptions. With this comprehensive information, GPT-4V generated answers to the submitted questions. These answers were subsequently reviewed by experts, who evaluated the accuracy and interpretation of the responses.

Multiple-Choice Question Transformation The collected pairs (image, question, GPT-4V answer, feedback) were transformed into multiple-choice questions using GPT-4. This transformation was guided by a carefully designed prompt ( fig. 6), verified by a cell biologist and a pathologist, to ensure the questions are challenging and reflective of real-world scenarios faced by biomedical researchers. Each transformed question includes an image, a question, and six candidate choices. One choice is correct, while the other five are distractors generated by GPT-4, where one choice is “None of the above.” Domain experts verified the validity of the generated questions and manually corrected a small number of questions. Finally, we ensured that correct answers were uniformly distributed among answer choices A to F and formalized the entire dataset.

4 Dataset Description

Perception Dataset Statistics For our perception benchmark, we collected a total of 17,235 microscopy images from 24 distinct public datasets (see Table LABEL:table:sources) with permissive licensing, prioritizing open CC-BY licenses. To the best of our knowledge, $\mu$ -Bench Perception is the most diverse microscopy vision-language benchmark, spanning light (LM), fluorescence (FM), and electron microscopy (EM), covering 8 microscopy sub-modalities (see Figure 3), 91 unique cells, tissues, and structures over 24 unique staining techniques (see Figure 12). The perception benchmark subset spans this diversity through closed VQA, object detection, and segmentation.

Cognition Dataset Statistics For our cognition benchmark, we collected 54 microscopy images and 121 questions from experts in the field. Entries were received from 6 users across 5 different institutions. The $\mu$ -Bench Cognition dataset encompasses 3 modalities (fluorescence, electron, light) with 12 sub-modalities, 2 domains (pathology and biology) with 14 sub-domains, and 3 scales (nano, micro, macro), covering a diverse range of topics such as pathology, immunology, and virology. Distributions are shown in Appendix Table 14.

5 VLM benchmarking and results

5.1 Benchmarking approach

Data artifacts like $\mu$ -Bench enable studying model behavior within specialist domains. Since our benchmark covers a wide range of biomedical tasks, we can, for the first time, compare biomedical perception and cognition capabilities across microscopy imagining modalities. In this section, we show the utility of $\mu$ -Bench by reporting empirical findings on a range of VLMs.

Table 1: Macro-average accuracy (with bootstrap confidence interval) for coarse-grained and fine-grained perception and cognition (reasoning) in

\mu

-Bench .

$\mu$ -Bench Perception (Coarse-Grained) Perception (Fine-Grained) Cognition (Reasoning) Model Accuracy ( $\pm$ CI) Model Accuracy ( $\pm$ CI) Model Accuracy ( $\pm$ CI) GPT-4o 62.68 ( $\pm$ 0.35) GPT-4o 51.73 ( $\pm$ 0.82) GPT-4o 62.00 ( $\pm$ 9.00) CogVLM 52.05 ( $\pm$ 0.35) BiomedCLIP 34.65 ( $\pm$ 0.75) QwenVLM 41.00 ( $\pm$ 10.00) QwenVLM 49.85 ( $\pm$ 0.35) CONCH 33.64 ( $\pm$ 0.72) CogVLM 41.00 ( $\pm$ 10.00) BiomedCLIP 47.57 ( $\pm$ 0.34) ALIGN 31.9 ( $\pm$ 0.72) OpenCLIP 38.33 ( $\pm$ 8.33) ALIGN 40.7 ( $\pm$ 0.34) CLIP 30.09 ( $\pm$ 0.71) ALIGN 31.00 ( $\pm$ 9.00) OpenCLIP 36.34 ( $\pm$ 0.33) OpenCLIP 29.36 ( $\pm$ 0.69) CLIP 28.00 ( $\pm$ 9.00) PaliGemma 36.29 ( $\pm$ 0.33) CogVLM 28.18 ( $\pm$ 0.70) PaliGemma 25.00 ( $\pm$ 8.00) CLIP 35.41 ( $\pm$ 0.34) QuiltNet 27.85 ( $\pm$ 0.69) BiomedCLIP 25.00 ( $\pm$ 8.00) PLIP 31.11 ( $\pm$ 0.32) QwenVLM 27.81 ( $\pm$ 0.70) CONCH 18.00 ( $\pm$ 7.00) CONCH 27.84 ( $\pm$ 0.31) PLIP 25.49 ( $\pm$ 0.68) Random 17.00 ( $\pm$ 7.00) QuiltNet 26.58 ( $\pm$ 0.31) PaliGemma 21.29 ( $\pm$ 0.64) PLIP 17.00 ( $\pm$ 7.00) Random 18.34 ( $\pm$ 0.27) Random 19.13 ( $\pm$ 0.60) QuiltNet 13.00 ( $\pm$ 6.00)

^✛ General autoregressive VLMs General contrastive VLMS Pathology contrastive VLMS
Biomedical contrastive VLMS.

First, we categorized VLMs into two groups: generalist models trained on natural images and language, and ‘specialist’ models, fine-tuned on biomedical data. Within generalist models, we also distinguish between contrastive and auto-regressive models.

Generalist Contrastive (GC) VLMs

We evaluate ALIGN [32], OpenCLIP [41], and CLIP [19] as the canonical contrastive models for natural images. Notably, OpenCLIP and CLIP serve as foundational models for numerous specialist biomedical VLMs.

Generalist autoregressive (GA) VLMs

We evaluate with GPT-4o [33], a state-of-the-art enterprise VLM. For open source models, we test CogVLM [42], QwenVLM [21], and PaliGemma [43] for their strong performance on general domain tasks, instruction-following capabilities, and, for QwenVLM and PaliGemma only, their ability to perform object detection.

Specialist contrastive (SC) VLMs

Our specialist model selection had two constraints: choosing the best-performing models and preferring minimal architectural changes from their base generalist versions, allowing performance analysis based on variations in training mixtures. We selected BiomedCLIP [3] since it is a strong model that covers all biomedical imaging modalities in our benchmark (trained on 15 million image-text pairs collected from PubMedCentral). Additionally, we included three pathology VLMs: PLIP (CLIP trained on H&E) [38], QuiltNet (CLIP trained on H&E and IHC) [44], and CONCH (CoCa trained on H&E and IHC) [45], with training dataset sizes of 208k, 1 million, and 1.2 million, respectively. While CONCH and BiomedCLIP are based on OpenCLIP and CoCa [39] respectively, they modify the architecture or training strategy.

Evaluation

The Closed VQA component of $\mu$ -Bench was evaluated with accuracy, generating confidence intervals (CI) via bootstrap [46] (section F.2). Object detection was evaluated for models with object detection capabilities (PaliGemma and QwenVLM) in open VQA format using the GRIT localization metric [47] as adopted by prior works [21].

5.2 Results

All models have high error rates

Table 1 shows the accuracy performance across perception and cognition. Even the top-performing model (GPT-4o) has high error rates, with an accuracy of 62.6% on coarse-grained perception, 51.7% on fine-grained perception, and 62.0% on cognition tasks. On average GPT-4o also outperforms BiomedCLIP (the best biomedical SC VLM) by a minimum of 15% in all evaluation dimensions and CONCH (the best pathology SC VLM) in pathology-specific perception tasks, showing a difference of 39.37% on coarse-grained and 19.40% in fined-grained tasks (as illustrated in Table 1. However, finer subgroup analysis (Figure 8) shows that GP-4o does not excel in all perception tasks, including domain identification (coarse-grained) and single-molecule imaging, normal vs abnormal classification, and non-neoplastic histopathology interpretation (fine-grained). The model architecture and training data for GPT-4 are closed source, making it challenging to conclude from these results. However, GPT-4o’s high error rates, its substantial gap compared to SC models, and performance variation across task subgroups highlight that $\mu$ -Bench is challenging and is not saturated by state-of-the-art general, biomedical, and pathology models.

Specialist biomedical models are often worse than non-specialist models While specialist models are explicitly developed for the biomedical domain, they often underperform non-specialized open-source models. For example, in both coarse-grained perception and cognition tasks (Table 1), GA models (CogVLM and QwenVLM) outperform the best SC model (BiomedCLIP) by 4.4% and 16.0% margins respectively. While GA models have a different training objective, larger training mixture, and more model parameters, a similar trend is observed with GC models (ALIGN, OpenCLIP, and CLIP) as they outperform all pathology VLMs in the same tasks by at least 9.5% (PLIP- ALIGN) and 20.3% (CONCH - OpenCLIP) respectively. This ranking is reversed in fine-grained perception tasks, where BiomedCLIP and CONCH perform best. Indeed, fine-grained perception closely resembles the data mixture used to fine-tune contrastive specialist models [25]. This characterization shows weakness in current microscopy biomedical model development, which we investigate next.

Specialist training can cause catastrophic forgetting Generalist contrastive models like (OpenCLIP and CLIP) surprisingly outperform their fine-tuned counterparts (PILP and QuiltNet) in coarse-grained perception and cognition (Table 1). Specifically, PILP and QuiltNet are fine-tuned directly from OpenCLIP and CLIP using only pathology data closest to $\mu$ -Bench fine-grained perception tasks. Although it improves performance on pathology-specific fine-grained tasks (Figure 4), it degrades performance on all other tasks (Table 1).

$\mu$ -Bench characterization drives robust model development To address catastrophic forgetting identified in our multi-level evaluation, we ensemble base model weights (OpenCLIP / CLIP) with fine-tuned model weights (PLIP/QuiltNet) to create merged models (PLIP+OpenCLIP / QuiltNet+CLIP), as suggested in [48]. As shown in Figure 5, when comparing merged models to their fine-tuned counterparts, perception performance increases across all of $\mu$ -Bench (y-axis), including pathology-specific tasks (x-axis).

$\mu$ -Bench supports probing design decisions for biomedical VLMs We have shown that $\mu$ -Bench offers valuable insights into microscopy biomedical VLMs and hope it encourages further evaluations of design choices. Data diversity is one factor: Table 1 illustrates that BiomedCLIP, trained across all microscopy modalities in $\mu$ -Bench, surpasses specialist models, albeit with a smaller margin for fine-grained tasks compared to CONCH, which uses pathology data. Regarding model architecture and training strategy, generalist autoregressive models (CogVLM and QwenVLM) outperform contrastive models (ALIGN and CLIP) in coarse-grained perception, but the opposite is true for fine-grained perception. For object localization, PaliGemma outperformed QwenVLM (Table 5) on $\mu$ -Bench, though both performed poorly, and no specialist models support detection. Future research could explore prompting strategies, data curation, and new methods to mitigate catastrophic forgetting.

6 Conclusion

Benchmarks drive advancements in machine learning by providing a standard to measure progress and allowing researchers to identify weaknesses in current approaches. Thus, the lack of biomedical vision-language benchmarks limits the ability to develop and evaluate specialist VLMs. We address this gap in microscopy by introducing the most extensive collection of vision-language tasks spanning perception and cognition. We use $\mu$ -Bench to establish, for the first time, the performance of some of the most capable VLMs available and find high error rates of 30%, highlighting room for improvement. We demonstrate how $\mu$ -Bench can be leveraged to generate new insights. Lastly, we share $\mu$ -Bench to enable researchers to measure progress in microscopy foundation models.

References

[1] Arno P Merkle and Jeff Gelb. The ascent of 3d x-ray microscopy in the laboratory. Microscopy Today, 21(2):10–15, 2013.
[2] Michael Weber and Jan Huisken. Multidisciplinarity is critical to unlock the full potential of modern light microscopy. Frontiers in Cell and Developmental Biology, 9:739015, 2021.
[3] Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915, 2023.
[4] Laurence Foss. The end of modern medicine: Biomedical science under a microscope. SUNY Press, 2002.
[5] Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai. NEJM AI, 1(3):AIoa2300138, 2024.
[6] Shanghua Gao, Ada Fang, Yepeng Huang, Valentina Giunchiglia, Ayush Noori, Jonathan Richard Schwarz, Yasha Ektefaie, Jovana Kondic, and Marinka Zitnik. Empowering biomedical discovery with ai agents. arXiv preprint arXiv:2404.02831, 2024.
[7] Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60, 2023.
[8] Morgan Schwartz, Uriah Israel, Xuefei Wang, Emily Laubscher, Changhua Yu, Rohit Dilip, Qilin Li, Joud Mari, Johnathon Soro, Kevin Yu, et al. Scaling biological discovery at the interface of deep learning and cellular imaging. Nature Methods, 20(7):956–957, 2023.
[9] Leonel Malacrida. Phasor plots and the future of spectral and lifetime imaging. Nature Methods, 20(7):965–967, 2023.
[10] Damian Dalle Nogare, Matthew Hartley, Joran Deschamps, Jan Ellenberg, and Florian Jug. Using ai in bioimage analysis to elevate the rate of scientific discovery as a community. Nature methods, 20(7):973–975, 2023.
[11] Anne E Carpenter, Beth A Cimini, and Kevin W Eliceiri. Smart microscopes of the future. Nature methods, 20(7):962–964, 2023.
[12] Xinyang Li, Yuanlong Zhang, Jiamin Wu, and Qionghai Dai. Challenges and opportunities in bioimage analysis. Nature Methods, 20(7):958–961, 2023.
[13] Talley Lambert and Jennifer Waters. Towards effective adoption of novel image analysis methods. Nature Methods, 20(7):971–972, 2023.
[14] Marco Y Hein, Duo Peng, Verina Todorova, Frank McCarthy, Kibeom Kim, Chad Liu, Laura Savy, Camille Januel, Rodrigo Baltazar-Nunez, Sophie Bax, et al. Global organelle profiling reveals subcellular localization and remodeling at proteome scale. bioRxiv, pages 2023–12, 2023.
[15] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[16] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[17] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
[18] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
[19] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[20] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
[21] **ze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and **gren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
[22] Jun Ma, Ronald Xie, Shamini Ayyadhury, Cheng Ge, Anubha Gupta, Ritu Gupta, Song Gu, Yao Zhang, Gihun Lee, Joonkee Kim, et al. The multimodality cell segmentation challenge: toward universal solutions. Nature methods, pages 1–11, 2024.
[23] Srinivas Niranj Chandrasekaran, Beth A Cimini, Amy Goodale, Lisa Miller, Maria Kost-Alimova, Nasim Jamali, John G Doench, Briana Fritchman, Adam Skepner, Michelle Melanson, et al. Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations. Nature Methods, pages 1–8, 2024.
[24] Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286, 2020.
[25] Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine, 29(9):2307–2316, 2023.
[26] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, **rui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
[27] Juan C Caicedo, Allen Goodman, Kyle W Karhohs, Beth A Cimini, Jeanelle Ackerman, Marzieh Haghighi, CherKeng Heng, Tim Becker, Minh Doan, Claire McQuin, et al. Nucleus segmentation across imaging experiments: the 2018 data science bowl. Nature methods, 16(12):1247–1253, 2019.
[28] Joel Saltz, Rajarsi Gupta, Le Hou, Tahsin Kurc, Pankaj Singh, Vu Nguyen, Dimitris Samaras, Kenneth R Shroyer, Tianhao Zhao, Rebecca Batiste, et al. Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell reports, 23(1):181–193, 2018.
[29] Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, Annette Kopp-Schneider, Bennett A Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M Summers, et al. The medical segmentation decathlon. Nature communications, 13(1):4128, 2022.
[30] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225, 2017.
[31] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks. nature, 542(7639):115–118, 2017.
[32] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
[33] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[34] Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259–265, 2023.
[35] Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Augmenting large language models with chemistry tools. Nature Machine Intelligence, pages 1–11, 2024.
[36] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
[37] Zihao Zhao, Yuxiao Liu, Han Wu, Yonghao Li, Sheng Wang, Lin Teng, Disheng Liu, Xiang Li, Zhiming Cui, Qian Wang, et al. Clip in medical imaging: A comprehensive survey. arXiv preprint arXiv:2312.07353, 2023.
[38] Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine, 29(9):2307–2316, 2023.
[39] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
[40] Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugger, Clément Delangue, Théo Matussière, Lysandre Debut, Stas Bekman, Pierric Cistac, Thibault Goehringer, Victor Mustar, François Lagunas, Alexander Rush, and Thomas Wolf. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
[41] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
[42] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
[43] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
[44] Wisdom Ikezogwo, Saygin Seyfioglu, Fatemeh Ghezloo, Dylan Geva, Fatwir Sheikh Mohammed, Pavan Kumar Anand, Ranjay Krishna, and Linda Shapiro. Quilt-1m: One million image-text pairs for histopathology. Advances in Neural Information Processing Systems, 36, 2024.
[45] Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Andrew Zhang, Long Phi Le, et al. Towards a visual-language foundation model for computational pathology. arXiv preprint arXiv:2307.12914, 2023.
[46] Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap. Chapman and Hall/CRC, 1994.
[47] Tanmay Gupta, Ryan Marten, Aniruddha Kembhavi, and Derek Hoiem. Grit: General robust image task benchmark. arXiv preprint arXiv:2204.13653, 2022.
[48] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7959–7971, 2022.
[49] Andrea Acevedo, Anna Merino, Santiago Alférez, Ángel Molina, Laura Boldú, and José Rodellar. A dataset of microscopic peripheral blood cell images for development of automatic recognition systems. Data Brief, 30(105474):105474, June 2020.
[50] James Burgess, Jeffrey J Nirschl, Maria-Clara Zanellati, Alejandro Lozano, Sarah Cohen, and Serena Yeung-Levy. Orientation-invariant autoencoders learn robust representations for shape profiling of cells and organelles. Nat. Commun., 15(1), February 2024.
[51] Yong Wu, Mansoureh Eghbali, Jimmy Ou, Rong Lu, Ligia Toro, and Enrico Stefani. Quantitative determination of spatial protein-protein correlations in fluorescence confocal microscopy. Biophys. J., 98(3):493–504, February 2010.
[52] Andrii Iudin, Paul K Korir, Sriram Somasundharam, Simone Weyand, Cesare Cattavitello, Neli Fonseca, Osman Salih, Gerard J Kleywegt, and Ardan Patwardhan. Empiar: the electron microscopy public image archive. Nucleic Acids Research, 51(D1):D1503–D1511, 2023.
[53] Philipp Eulenberg, Niklas Köhler, Thomas Blasi, Andrew Filby, Anne E Carpenter, Paul Rees, Fabian J Theis, and F Alexander Wolf. Reconstructing cell cycle and disease progression using deep learning. Nature communications, 8(1):463, 2017.
[54] Michael Held, Michael H A Schmitz, Bernd Fischer, Thomas Walter, Beate Neumann, Michael H Olma, Matthias Peter, Jan Ellenberg, and Daniel W Gerlich. CellCognition: time-resolved phenotype annotation in high-throughput live cell imaging. Nat. Methods, 7(9):747–754, September 2010.
[55] Elima Hussain, Lipi B Mahanta, Himakshi Borah, and Chandana Ray Das. Liquid based-cytology pap smear dataset for automated multi-class diagnosis of pre-cancerous and cervical cancer lesions. Data Brief, 30(105589):105589, June 2020.
[56] Sebastiano Battiato, Alessandro Ortis, Francesca Trenta, Lorenzo Ascari, Mara Politi, and Consolata Siniscalco. Pollen13k: A large scale microscope pollen grain image dataset. 2020 IEEE International Conference on Image Processing (ICIP), pages 2456–2460, 2020.
[57] Changhun Jung, Mohammed Abuhamad, David Mohaisen, Kyungja Han, and DaeHun Nyang. Wbc image classification and generative models based on convolutional neural network. BMC Medical Imaging, 22(1):94, 2022.
[58] Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne M Melchers, Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Zöllner. Multi-class texture analysis in colorectal cancer histology. Sci. Rep., 6:27988, June 2016.
[59] Jakob Nikolas Kather, Johannes Krisam, Pornpimol Charoentong, Tom Luedde, Esther Herpel, Cleo-Aron Weis, Timo Gaiser, Alexander Marx, Nektarios A Valous, Dyke Ferber, Lina Jansen, Constantino Carlos Reyes-Aldasoro, Inka Zörnig, Dirk Jäger, Hermann Brenner, Jenny Chang-Claude, Michael Hoffmeister, and Niels Halama. Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS Med., 16(1):e1002730, January 2019.
[60] Jeffrey J Nirschl, Andrew Janowczyk, Eliot G Peyster, Renee Frank, Kenneth B Margulies, Michael D Feldman, and Anant Madabhushi. A deep-learning classifier identifies patients with clinical heart failure using whole-slide images of h&e tissue. PLoS One, 13(4):e0192726, April 2018.
[61] Nathan H Cho, Keith C Cheveralls, Andreas-David Brunner, Kibeom Kim, André C Michaelis, Preethi Raghavan, Hirofumi Kobayashi, Laura Savy, Jason Y Li, Hera Canaj, James Y S Kim, Edna M Stewart, Christian Gnann, Frank McCarthy, Joana P Cabrera, Rachel M Brunetti, Bryant B Chhun, Greg Dingle, Marco Y Hein, Bo Huang, Shalin B Mehta, Jonathan S Weissman, Rafael Gómez-Sjöberg, Daniel N Itzhak, Loïc A Royer, Matthias Mann, and Manuel D Leonetti. OpenCell: Endogenous tagging for the cartography of human cellular organization. Science, 375(6585):eabi6983, March 2022.
[62] Korsuk Sirinukunwattana, Josien P. W. Pluim, Hao Chen, Xiaojuan Qi, PhengAnn Heng, Yun Bo Guo, Li Yang Wang, Bogdan J. Matuszewski, Elia Bruni, Urko Sanchez, Anton B¨ohm, Olaf Ronneberger, Bassem Ben Cheikh, Daniel Racoceanu, Philipp Kainz, Michael Pfeiffer, Martin Urschler, David R. J. Snead, and Nasir M. Rajpoot. Gland segmentation in colon histology images: The glas challenge contest, 2016.
[63] Ziqi Tang, Kangway V Chuang, Charles DeCarli, Lee-Way **, Laurel Beckett, Michael J Keiser, and Brittany N Dugger. Interpretable classification of alzheimer’s disease pathologies with a convolutional neural network pipeline. Nat. Commun., 10(1):2173, May 2019.
[64] Daniel R Wong, Ziqi Tang, Nicholas C Mew, Sakshi Das, Justin Athey, Kirsty E McAleese, Julia K Kofler, Margaret E Flanagan, Ewa Borys, Charles L White, 3rd, Atul J Butte, Brittany N Dugger, and Michael J Keiser. Deep learning from multiple experts improves identification of amyloid neuropathologies. Acta Neuropathol. Commun., 10(1):66, April 2022.
[65] Gong-Her Wu, Charlene Smith-Geater, Jesús G Galaz-Montoya, Yingli Gu, Sanket R Gupte, Ranen Aviner, Patrick G Mitchell, Joy Hsu, Ricardo Miramontes, Keona Q Wang, Nicolette R Geller, Cathy Hou, Cristina Danita, Lydia-Marie Joubert, Michael F Schmid, Serena Yeung, Judith Frydman, William Mobley, Chengbiao Wu, Leslie M Thompson, and Wah Chiu. CryoET reveals organelle phenotypes in huntington disease patient iPSC-derived and mouse primary neurons. Nat. Commun., 14(1):692, February 2023.
[66] Sadid A Hasan, Yuan Ling, Oladimeji Farri, Joey Liu, Henning Müller, and Matthew P Lungren. Overview of imageclef 2018 medical domain visual question answering task. In CLEF (Working Notes), 2018.
[67] Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Joey Liu, Dina Demner-Fushman, and Henning Müller. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. CLEF (working notes), 2(6), 2019.
[68] Asma Ben Abacha, Mourad Sarrouti, Dina Demner-Fushman, Sadid A Hasan, and Henning Müller. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. In Proceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes. 21-24 September 2021, 2021.
[69] Noor Mohamed Sheerin Sitara and Kavitha Srinivasan. Ssn mlrg at vqa-med 2021: An approach for vqa to solve abnormality related queries using improved datasets. In CLEF (working notes), pages 1329–1335, 2021.
[70] Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018.
[71] Olga Kovaleva, Chaitanya Shivade, Satyananda Kashyap, Karina Kanjaria, Joy Wu, Deddeh Ballah, Adam Coy, Alexandros Karargyris, Yufan Guo, David Beymer Beymer, et al. Towards visual dialog for radiology. In Proceedings of the 19th SIGBioMed workshop on biomedical language processing, pages 60–69, 2020.
[72] Yefan Huang, Xiaoli Wang, Feiyan Liu, and Guofeng Huang. Ovqa: A clinically generated visual question answering dataset. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2924–2938, 2022.
[73] Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE, 2021.
[74] Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415, 2023.
[75] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrap** language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
[76] Souradip Chakraborty, Ekaba Bisong, Shweta Bhatt, Thomas Wagner, Riley Elliott, and Francesco Mosconi. Biomedbert: A pre-trained biomedical language model for qa and ir. In Proceedings of the 28th international conference on computational linguistics, pages 669–679, 2020.

7 Appendix

Appendix A Limitations

$\mu$ -Bench is a diverse benchmark for evaluating vision-language models on microscopy image and text data. The dataset is intended for testing purposes only, not for training. The moderate size will allow many academic researchers to assess the performance of models trained on natural images or other biology datasets.

While $\mu$ -Bench covers various biological length scales, microscopy modalities, scientific disciplines, and organisms, not all domains and modalities are equally represented. This partially reflects the usage patterns of the field, with human samples (cell lines or tissues) being more common in biomedical research. Brightfield and fluorescence microscopy images are also more prevalent in $\mu$ -Bench compared to electron microscopy. This means that the results on $\mu$ -Bench may not apply equally well to all model organisms or electron microscopy images. The VLM performance may be lower in these areas due to the data being rare in both natural image datasets and uncommon in biomedical datasets.

We strive for a high-quality dataset and involve cell biologists and pathologists during dataset creation, quality control, and interpretation of results. However, the field of biology is diverse, and no individual or small group can reliably stay up-to-date with all aspects of biological/biomedical research. For example, we included a botany dataset (ICPR 2020 Pollen) to show a commitment to including diverse scientific disciplines, but we currently do not have expert plant biologist contributors.

$\mu$ -Bench represents the first vision-language benchmark for microscopy covering all major biology length scales and modalities. We see $\mu$ -Bench as a living benchmark we intend to grow, although we acknowledge the aforementioned limitations. Future versions will prioritize incorporating new data from diverse organisms and microscopy modalities to improve representation across all length scales, microscopy modalities, and organisms. We will also benefit from community engagement by involving domain experts from diverse fields and obtaining data to balance currently under-represented areas.

Appendix B Ethical Compliance and Acknowledgements

B.1 Ethics Statement

Safe and ethical use of biomedical data: We prioritize safe and ethical research practices while creating $\mu$ -Bench . All public datasets with patient-derived histopathology images had already been de-identified by the dataset’s original authors in compliance with applicable privacy laws and institutional guidelines. The public histopathology image data and metadata were reviewed, and it was determined that it was not possible to identify any individual from the de-identified data. The Stanford Institutional Review Board guidelines were reviewed and discussed, and the use of the images was determined not to be human subjects research.

The $\mu$ -Bench cognition images and questions were voluntarily submitted by a small (<10) number of biology/pathology users alpha-testing a free web chat application. There was no intervention, experiment vs. control group, or research question during the alpha testing. There was no greater risk of using the app than other internet apps. At registration, users agreed to the service terms, which included releasing image and text data under CC-BY-SA-4.0.

Consent and data usage: We thank and respect the original dataset authors and use data according to the original copyright and license. $\mu$ -Bench was developed with both academic and commercial research in mind. Many datasets are a version of CC-BY-4.0 to allow both academic and commercial usage. However, some data restricts commercial applications via non-commercial clauses or CC-BY-NC-4.0-related licenses. While creating $\mu$ -Bench we significantly improved the data by performing expert review, quality control/standardization, expert labeling with biomedical ontology codes, and creating multiple-choice VQA questions and captions for each image. When the original dataset license is CC-BY-4.0, we release our $\mu$ -Bench versions under a permissive CC-BY-SA-4.0 to foster a transparent and collaborative benchmark. For the subset with CC-BY-NC-SA-4.0, we respect the original license and release these data under CC-BY-NC-SA-4.0.

Bias and data diversity: We recognize that AI models, including VLMs, can perpetuate or exacerbate biases in the training data, and incorporating diverse data may mitigate these biases. Diversity is key to understanding biological processes and how they vary across biological sex, ethnicity, or other factors. When possible, we annotate cell lines with age, sex, ethnicity, and other metadata from Cellosaurus or the Cell Line Ontology. We consciously include diverse microscopy images from multiple institutions across various organisms, modalities, and biological states to ensure $\mu$ -Bench provides an accurate and fair performance assessment. However, images of human samples and brightfield/fluorescence microscopy are more common in the field and thus over-represented in $\mu$ -Bench .

Potential negative societal impacts: AI models trained on biomedical data have the potential for far-reaching impacts on society, both positive and negative. Potential negative impacts include biased performance across different demographic groups and reinforcing existing disparities in research and healthcare. We are committed to mitigating these risks by ensuring the dataset’s diversity and continuously reviewing the ethical implications of our work. Additionally, we will engage with the broader research community to identify and address any emerging ethical concerns.

We are committed to ongoing improvements of $\mu$ -Bench to prioritize diverse and representative microscopy data. We will review and update $\mu$ -Bench in response to evolving ethical standards and technological advancements.

B.2 Conflicts of Interest Disclosures

The authors have no conflicts of interest to disclose.

B.3 Funding/Support

This research was supported by NIH grants (NIH#P30AG066515 to JJN), the Chan Zuckerberg Initiative Neurodegeneration Challenge Pairs Pilot Project to SYL (2020-221724, 5022), the Wu Tsai Knight Initiative Innovation grant (#KIG 102) to SYL, Arc Institute to AL, Quad Fellowship to JB, and NSF Graduate Research Fellowship (#DGE-2146755) and Stanford Graduate Fellowship to AU. SYL is a Chan Zuckerberg Biohub – San Francisco Investigator.

B.4 Acknowledgments

We thank all the domain experts for testing the web app and submitting questions that were eventually used in $\mu$ -Bench cognition. We highlight the contributions from Pedro Guedes-Dias, Andrew Moore, and Julian Perez. We appreciate feedback from Josiah Aklilu and Orr Zohar on earlier manuscript versions.

Appendix C Benchmark Details

C.1 Instructions for downloading the benchmark

The benchmark can be downloaded via HuggingFace Datasets at the following url.

C.2 $\mu$ -Bench overview

$\mu$ -Bench is organized into three main categories:

1.

$\mu$ -Bench Perception (Coarse-grained): Containing basic questions about the type of biomedical field of study (domain), subdomain, microscopy modality, submodality, and stain.
2.

$\mu$ -Bench Perception (Fine-grained): Identification or questions regarding a biological cell, cellular process, subcellular or tissue structures, biological state (normal/abnormal), etc.
3.

$\mu$ -Bench Cognition (Reasoning): Expert-generated questions that typically require visual-based reasoning or integrating knowledge about the micrograph’s composition and subject to deduce an answer.
4.

$\mu$ -Bench Object Detection (Localization): Bounding box detection of common biological objects, with an easy and hard data split.

C.3 $\mu$ -Bench statistics

Table 2 shows descriptive statistics for $\mu$ -Bench Perception while Table 4 shows statistics for $\mu$ -Bench Reasoning.

C.4 $\mu$ -Bench Perception

$\mu$ -Bench Perception contains 17,235 microscopy images collected from 25 unique datasets of 96 different cell and tissue types across light, fluorescence, and electron microscopy. If defined, only images from each dataset’s test set are selected; otherwise, a random 15 percent split is created. Each cell/tissue class is sub-sampled to a maximum of 200 sub-classes per class (if the cell type is less than the maximum, the full set is used). Each cell/tissue type is densely annotated (then propagated to each instance of the same type) by an expert, as shown in Figure 18

Table 10 provides summary statistics of domain coverage. Overall, the benchmark covers 8,637 biology images and 8,678 pathology images across 12 subdomains. Similarly, Table 11 shows summary statistics of microscopy modalities covered by $\mu$ -Bench perception, including 10,864 images for light microscopy, 5,618 for fluorescence microscopy, and 833 images for electron microscopy across 8 microscopy imaging submodalities.

$\mu$ -Bench Perception (Coarse-grained): Hierarchical metadata for each of the 17,235 perception images and task-specific templates (shown in Table 17) are used to create 5 coarse-grained questions and captions regarding microscopy modality, submodality, domain, subdomain, and staining technique. The use of hierarchical metadata enables the generation of options within each hierarchical level. For example, for microscopy submodality, we leveraged microscopy modality metadata (shown in Figure13 ) to randomly sample submodality options within a microscopy modality (e.g. differential interference contrast microscopy within light microscopy). A total of 86,175 (17,235x5) coarse-grained questions are generated using this approach, leveraging domain metadata Figure 14 as well as staining metadata Figure 15). Section H provides 10 random examples of coarse-grained data points (two per type).

Table 2:

\mu

-Bench Perception dataset statistics summary

Aspect	Count
Datasets	25
Domains	2
Subdomains	12
Modalities	3
Submodalities	8
Coarse-grained perception tasks	5
Fine-grained perception tasks	18
Classification tasks	13
Segmentation and Object Detection tasks	5

$\mu$ -Bench Perception (Fine-grained): Metadata for each of the 25 unique datasets and custom prompts (Table 2) are used to generate 13 unique tasks, which are 13 (17,235 unique question-image pairs) closed VQA style questions. of VQA fine-grained tasks.

Perception dataset: source details: Table LABEL:table:sources lists the dataset names, licenses, and corresponding DOIs or URLs for dataset used. The majority of these datasets were sourced from open data repositories, such as Zenodo, Dataverse, Dryad, BBBC, EMPIAR, Kaggle, and various project websites.

Perception dataset: forming questions for evaluation Each sample has a question and set of candidate answers. We describe how to evaluate this VQA task in appendix D.

Table 3: Provenance of the datasets for the

\mu

-Bench Perception dataset. The dataset name is provided along with the original license and URL to the dataset (or DOI where available).

Dataset	License	Link
Acevedo et al 2020 [49]	CC-BY-4.0	DOI
PCST-Contour [50]	CC-BY-4.0	DOI
PCST-Eccentricity [50]	CC-BY-4.0	DOI
PCST-Texture [50]	CC-BY-4.0	DOI
Colocalization benchmark [51]	CC-BY-NC-SA-3.0	DOI
EMPIAR SBF-SEM [52]	CC0	URL, URL, URL, URL, URL
BBBC048 (Brightfield) [53]	CC-BY-NC-SA-3.0	URL
BBBC048 (Darkfield) [53]	CC-BY-NC-SA-3.0	URL
BBBC048 (Epifluorescence) [53]	CC-BY-NC-SA-3.0	URL
CellCognition (Golgi) [54]	Attribution	URL
CellCognition (H2B) [54]	Attribution	URL
CellCognition (Mt) [54]	Attribution	URL
Pap Smear 2019 [55]	CC-BY-4.0	DOI
ICPR2020 Pollen [56]	Non-commercial	DOI
Jung et al 2022 [57]	CC-BY-NC-4.0	URL
Kather et al 2016 [58]	CC-BY-4.0	DOI
Kather et al 2018 [59]	CC-BY-4.0	DOI
Kather et al 2018 Val7K [59]	CC-BY-4.0	DOI
Nirschl et al 2018 [60]	CC-BY-4.0	DOI
OpenCell [61]	CC-BY-4.0	URL
GlaS Challenge 2015 [62]	Non-commercial	URL
Tang et al 2019 [63]	CC-BY-4.0	DOI
Wong et al 2022 [64]	CC-BY-4.0	DOI
Wu et al 2023 [65]	CC-BY-SA-4.0	DOI
Fluorescence Cells & Structures	CC-BY-4.0	New Data

C.5 $\mu$ -Bench Cognition

Cognition dataset: generation details

The cognition dataset was collected during alpha-testing of a web application chat interface. A group of 6 biology and pathology experts were invited to interact with a web application as they wished during their daily routines (invitation is shown in Figure 20). There was no specific research question, intervention, or experimental vs. control group; this was a free service. User registration was voluntary and required reviewing and accepting the terms of service. The terms of service discussed that this was not a research study and that the risk of harm is insignificant and would be similar to any VLM chatbot. The terms of service indicated that uploading images indicated the user had copyright or permission, that images did not contain offensive content, and that the images and text could be released with a CC-BY-SA-4.0 license.

Figure 7) shows a screenshot of the web application interface. A typical usage involved uploading a microscopy image, providing context about the image (experiment details), and asking a question. The web app processed the submission using GPT-4V and provided an answer in real time. Users were encouraged to provide feedback on the answer and a correct answer (if known). Users were encouraged to ask questions that required complex visual reasoning, advanced biomedical knowledge, or could be considered challenging for humans. Random samples for cognition questions are shown in Section J

Table 4:

\mu

-Bench Cognition dataset statistics summary

Aspect	Count
Submitters	6
Institutions	5
Domains	2
Subdomains	10
Modalities	3
Submodalities	12

C.6 $\mu$ -Bench Object detection

For datasets with segmentation annotations, we copy the segmentation annotations and also convert them to object detection annotations, giving 3641 images [50, 54, 61, 65, 62].

We define an easy and hard split. The easy split contains the ‘cell’ class from Burgess et al. [50] dataset and ‘nucleus’ class from the Held et al. [54]. Here the image contains only the target object, meaning that simple foreground / background segmentation would work well. The hard split contains the ‘nucleus’ class from Burgess et al. [50], the ‘nucleus’ class from opencell [61], the ‘mitochondria’ class from Wu et al. [65], and the ‘gland’ class from Sirinukunwattana et al. [62]. Here, the target object must be separated from surrounding visual information, and this is challenging.

C.7 Comparison with a supervised linear model trained on DINOv2 Features

Although some datasets have been used in the biomedical computer vision community [58, 59], others do not have well-established performance baselines. Thus, we evaluate a supervised linear model’s baseline classification performance for the fine-grained and coarse-grained perception tasks. This establishes that the questions in our benchmark are solvable using image features.

The images in $\mu$ -Bench represent the test subset of the original data. The original data were split into train, validation, and test subsets according to the published splits. If no published splits were available, which was the case for many previously unused datasets, we created train/val/test splits (0.7/0.1/0.2) stratifying samples by the class label. We use DINOv2 S14 features (384 dims) because they are robust and can train strong linear models on many natural image classification datasets. We extract DinoV2-S14 features from the train/val subsets and train a logistic regression classifier to determine the performance of a linear classifier for these tasks. We use PCA dimensionality reduction (0.95% var) and whitening on the features with Optuna for hyperparameter tuning. Baseline performance was determined using the best dummy classifier, predicting based on a random sampling of prior probabilities, the most common class, etc.

Figure 6 illustrates that most datasets provide sufficient signal to train a robust linear classifier with DinoV2 feature representations primarily learned from natural images. Our results show a weighted average accuracy of 0.86 across all classification tasks. The lowest performance was observed in the classification of mitosis stages in darkfield microscopy (50.96%). In contrast, the highest performance was seen in the classification of synthetic white blood cells in bright microscopy and the organisms and structures in electron microscopy, with a weighted average accuracy of 100%. Lastly, 58.33 % of the tasks achieved a balanced accuracy greater than 80%. These results show a weighted average of 86.71% across all classification tasks with the upper and lower performance range we could expect from a VLM.

Appendix D Evaluation Details

D.1 VQA evaluation details

Three out of four tasks are formulated as multiple-choice visual question answering: coarse-grained perception, fine-grained perception, and cognition. We describe their evaluation here, while the fourth task of object detection is described separately in section E.1. For each sample, $i$ , we have an image, $\textbf{x}_{i}$ a question string, $q$ , and a set of $k$ candidate answer strings $\{a_{ij}\}_{j=1}^{k}$ .

For the autoregressive models, $f_{A}$ , we first generate a query string, $t$ , from the question and candidate answer strings, $t(q,\{a_{ij}\}_{j=1}^{k})$ , using the template in Figure 19. The prompt text instructs the response to start with a letter for one of the multiple choice answers, (‘A’, ‘B’, $\dots$ ). We pass the prompt string and image to the model and decode the output, $y=f_{A}(x_{i},t)$ , where $y$ is the output string. We have two strategies for matching the response string, $y$ , to the candidate answers $\{a_{ij}\}_{j=1}^{k}$ . First, for each $j$ , we check whether the answer string is in the output: $a_{ij}\in y$ with lowercase string matching. This is important because models do not always follow the instructions to output the multiple choice letter and instead return the answer. If there were no matches, then we extract the first character from $y$ . If it is one of (‘A’, ‘B’, $\dots$ ) – as instructed in the prompt – then that it is assigned to the corresponding answer, $a_{ij}$ . Otherwise, we mark the answer as incorrect.

The contrastive models have a vision encoder $E_{x}$ and text caption encoder $E_{c}$ . We first compute the image embedding, $z_{x_{i}}=E_{x}(\textbf{x}_{i})$ . Then for each candidate answer, $a_{ij}$ , we form a caption, $c$ , as $c(a_{ij})$ , using a text template that is suitable for CLIP-like models. The templates are in table 17 for the fine-grained perception task, and table 18 for the coarse-grained perception task. For the cognition task, each caption is the concatenation of the question string, $q$ , and the candidate answer string, $a_{ij}$ . We get the embedding for each caption, $z_{c_{ij}}=E_{c}(c_{ij})$ for $j\in[1,k]$ , and then compute the cosine similarity score for each caption, $s_{ij}=z_{c_{ij}}\cdot z_{x_{i}}$ for $j\in[1,k]$ . The $j$ with the largest $s_{ij}$ is the final prediction. If argmax( $s_{ij}$ ) has the same index as the correct answer the question is marked as correct, incorrect otherwise.

Code for evaluation is made public through our repository: eVLLM.

D.2 Object detection evaluation details

We evaluate the two models that support localization: QwenVL [21] and PaliGemma [43], and follow their user guide for prompting. For ‘QwenVL’ the prompt is ‘‘Detect {class_name}’’. For PaliGemma the prompt is ‘‘Detect {class_name}; {class_name}’’, where the repeated class name indicates that multiple instances may be predicted. Our early experiments found that PaliGemma would sometimes fail to localize any instances using detection prompting, but would localize them with segmentation prompting. So if PaliGemma does return zero instances, we prompt for segmentation ‘‘Segment {class_name}; {class_name}’’ and extract the bounding box. The Burgess et al. dataset has two classes, so we prompt the model one at a time. Both models output detection predictions as a string with a standardized structure, which we parse using regex.

We use the GRIT localization metric [47] because it is well-motivated and has previously been used in VLM evaluations by QwenVL [21]. The score is:

\displaystyle\sum_{i=1}^{M}\frac{IoU_{i}}{P+G_{missed}}

(1)

There are $M$ ground truth boxes, and $P$ predicted boxes, which are matched using the Hungarian algorithm on the IoU metric. $G_{missed}$ is the number of predicted boxes not matched to a ground truth box. Intuitively, this metric measures the average IoU for matched boxes, while penalizing making too many predictions using $G_{missed}$ (similar to the precision metric). Note that we cannot use the more typical mAP score from object detection because they depend on a threshold for controlling the false-positive rate, which these VLMs do not support.

Appendix E Additional Benchmarking Results

E.1 Object Detection Results

table 5 summarizes object detection for the datasets having object localization annotations.

Table 5: Object detection results for all datasets with detection annotations and for all models that support object detection.

Dataset	Class	PaliGemma	QwenVLM
Easy
PCST-Contour	cell	76.5	82.1
PCST-Eccentricity	cell	76.6	84.7
PCST-Texture	cell	78.0	85.9
CellCognition (Golgi)	nucleus	72.4	46.7
CellCognition (H2B)	nucleus	80.4	68.1
CellCognition (Mt)	nucleus	72.6	51.6
Hard
PCST-Contour	nucleus	31.7	6.6
PCST-Eccentricity	nucleus	30.9	6.2
PCST-Texture	nucleus	30.2	6.4
OpenCell	nucleus	0.0	0.2
Wu et al 2023	mitochondria	22.8	30.0
GlaS Challenge	gland	2.7	5.8

Overall, the localization scores are very poor, which is expected since both models are generalist and object localization has received relatively less attention in autoregressive VLMs. Looking at the splits:

•

Easy split. Although both models have higher scores for the ‘cell’ class in Burgess et al., they still fall below 80, and the task is extremely easy. The story is similar for ‘nucleus’ in Held et al., but for QwenVLM, the scores are even lower.
•

Hard split. All models perform poorly on the hard split. In Burgess et al ‘nucleus’, PaliGemma scores around 30, however qualitative inspection shows that in most cases, the bounding box predicts the entire cell, which encapsulates the nucleus. Similarly, in Wu et al., the mitochondria class scores more than 20 for both models, but qualitative inspection shows that the prediction is usually a box around the entire image. We find the same pattern in ‘Sirinukunwattana et al.’ for ‘gland’ detection.

E.2 Weight ensembling details

In the results section 5.2, we consider PLIP, which was fine-tuned from OpenCLIP, and QuiltNet, which was fine-tuned from CLIP, both using pathology data. Since we have benchmark results for all these models, our evaluation can evaluate the impact of fine-tuning on pathology data (moving from OpenCLIP/CLIP to PLIP/QuiltNet). We showed that pathology fine-tuning can improve performance on the pathology subset of $\mu$ -Bench (the fine-grained perception), where we filter for all samples from the histopathology or H&E imaging modality. However, the fine-tuned models have worse overall performance on $\mu$ -Bench(which includes the pathology subset).

We proposed to create ‘merged models’ by combining the base model (OpenCLIP or CLIP) with their fine-tuned models (PLIP or QuiltNet) with weight merging. Specifically, following [48], for a base model with weights $\theta_{B}$ and tuned model $\theta_{T}$ (which have the same architecture), the merged model weights $\theta_{M}$ are :

\displaystyle\theta_{M}=\alpha\cdot\theta_{B}+(1-\alpha)\cdot\theta_{T}

That is, we are linearly interpolating each model weight independently, with a single fixed constant $\alpha$ . We arbitrarily set $\alpha=0.5$ in our experiments, but tuning that constant could lead to better overall results.

We now show more comprehensive results in table 1, which is the main results table but includes our merged models, M-PLIP and M-QuiltNet; table 7 is the same but with the pathology-only split. In all cases (M-PLIP and M-QuiltNet), the merged models outperform the tuned models (PLIP and QuiltNet), while also outperforming the base models (OpenCLIP and CLIP) in almost all cases. For fine-grained perception (the task that is most in distribution for PLIP and QuiltNet training), the merged models become among the strongest performing overall models, and on pathology-specific fine-grained perception, the merged models outperform BiomedCLIP.

Table 6: Macro-average accuracy (with bootstrap confidence interval) for coarse-grained and fine-grained perception and cognition (reasoning) in

\mu

-Bench . Robust models (byproduct of merging fine-tuned models with their respective base models) are also included.

$\mu$ -Bench Perception (Coarse-Grained) Perception (Fine-Grained) Cognition (Reasoning) Model Accuracy ( $\pm$ CI) Model Accuracy ( $\pm$ CI) Model Accuracy ( $\pm$ CI) GPT-4o 62.68 ( $\pm$ 0.35) GPT-4o 51.73 ( $\pm$ 0.82) GPT-4o 62.00 ( $\pm$ 9.00) CogVLM 52.05 ( $\pm$ 0.35) BiomedCLIP 34.65 ( $\pm$ 0.75) QwenVLM 41.00 ( $\pm$ 10.00) QwenVLM 49.85 ( $\pm$ 0.35) CONCH 33.64 ( $\pm$ 0.72) CogVLM 41.00 ( $\pm$ 10.00) BiomedCLIP 47.57 ( $\pm$ 0.34) M-PLIP* 32.99 ( $\pm$ 0.73) OpenCLIP 38.33 ( $\pm$ 8.33) M-PLIP* 43.25 ( $\pm$ 0.34) M-QuiltNet* 32.42 ( $\pm$ 0.71) M-PLIP* 34.17 ( $\pm$ 8.33) ALIGN 40.7 ( $\pm$ 0.34) ALIGN 31.9 ( $\pm$ 0.72) ALIGN 31.00 ( $\pm$ 9.00) OpenCLIP 36.34 ( $\pm$ 0.33) CLIP 30.09 ( $\pm$ 0.71) CLIP 28.00 ( $\pm$ 9.00) PaliGemma 36.29 ( $\pm$ 0.33) OpenCLIP 29.36 ( $\pm$ 0.69) M-QuiltNet* 25.83 ( $\pm$ 7.52) CLIP 35.41 ( $\pm$ 0.34) CogVLM 28.18 ( $\pm$ 0.70) BiomedCLIP 25.00 ( $\pm$ 8.00) M-QuiltNet* 31.26 ( $\pm$ 0.32) QuiltNet 27.85 ( $\pm$ 0.69) PaliGemma 25.00 ( $\pm$ 8.00) PLIP 31.11 ( $\pm$ 0.32) QwenVLM 27.81 ( $\pm$ 0.70) CONCH 18.00 ( $\pm$ 7.00) CONCH 27.84 ( $\pm$ 0.31) PLIP 25.49 ( $\pm$ 0.68) Random 17.00 ( $\pm$ 7.00) QuiltNet 26.58 ( $\pm$ 0.31) PaliGemma 21.29 ( $\pm$ 0.64) PLIP 17.00 ( $\pm$ 7.00) Random 18.34 ( $\pm$ 0.27) Random 19.13 ( $\pm$ 0.60) QuiltNet 13.00 ( $\pm$ 6.00)

^✛ General autoregressive VLMs General contrastive VLMS Pathology contrastive VLMS
Biomedical contrastive VLMS.

E.3 Model Performance on Pathology Specific Tasks

While prior evaluations show that general contrastive VLMs have some biology and pathology knowledge (a finding also reported in [25] and [44]), most specialist models analyzed in this work were fine-tuned. To this end, we analyzed the performance on pathology-only tasks and found similar rankings.

Table 7: Macro-average accuracy (with bootstrap confidence interval) for coarse-grained and fine-grained perception (pathology only tasks) and cognition (reasoning) in

\mu

-Bench . Robust models (byproduct of merging fine-tuned models with their respective base models) are also included.

$\mu$ -Bench Perception (Coarse-Grained) Perception (Fine-Grained) Cognition (Reasoning) Model Accuracy ( $\pm$ CI) Model Accuracy ( $\pm$ CI) Model Accuracy ( $\pm$ CI) GPT-4o 71.29 ( $\pm$ 0.45) GPT-4o 61.88 ( $\pm$ 1.13) GPT-4o 62.00 ( $\pm$ 9.00) QwenVLM 67.89 ( $\pm$ 0.47) CONCH 42.44 ( $\pm$ 1.13) QwenVLM 41.00 ( $\pm$ 10.00) CogVLM 59 ( $\pm$ 0.51) M-PLIP* 39 ( $\pm$ 1.12) CogVLM 41.00 ( $\pm$ 10.00) ALIGN 46.32 ( $\pm$ 0.50) M-QuiltNet* 36.95 ( $\pm$ 1.22) OpenCLIP 38.33 ( $\pm$ 8.33) BiomedCLIP 45.5 ( $\pm$ 0.51) BiomedCLIP 35.29 ( $\pm$ 1.09) M-PLIP* 34.17 ( $\pm$ 8.33) PaliGemma 43.34 ( $\pm$ 0.51) QuiltNet 33.28 ( $\pm$ 1.08) ALIGN 31.00 ( $\pm$ 9.00) M-PLIP* 42.27 ( $\pm$ 0.51) OpenCLIP 32.35 ( $\pm$ 1.09) CLIP 28.00 ( $\pm$ 9.00) M-QuiltNet* 37.68 ( $\pm$ 0.50) PLIP 32.02 ( $\pm$ 1.06) M-QuiltNet* 25.83 ( $\pm$ 7.52) OpenCLIP 37.59 ( $\pm$ 0.50) CLIP 28.63 ( $\pm$ 1.01) BiomedCLIP 25.00 ( $\pm$ 8.00) CLIP 31.92 ( $\pm$ 0.47) ALIGN 27.51 ( $\pm$ 0.99) PaliGemma 25.00 ( $\pm$ 8.00) CONCH 31.11 ( $\pm$ 0.32) QwenVLM 24.62 ( $\pm$ 0.97) CONCH 18.00 ( $\pm$ 7.00) QuiltNet 29.18 ( $\pm$ 0.45) CogVLM 23.8 ( $\pm$ 0.98) Random 17.00 ( $\pm$ 7.00) PLIP 22.72 ( $\pm$ 0.42) PaliGemma 22.77 ( $\pm$ 0.97) PLIP 17.00 ( $\pm$ 7.00) Random 18.15 ( $\pm$ 0.39) Random 17.74 ( $\pm$ 0.86) QuiltNet 13.00 ( $\pm$ 6.00)

^✛ General autoregressive VLMs General contrastive VLMS Pathology contrastive VLMS
Biomedical contrastive VLMS.

Appendix F Additional Benchmarking details

F.1 Model Details

Table table 9 provides a breakdown of model parameters and training data (with dataset size), specialist models include their base model.

F.2 Computing confidence intervals

Error bars represent 95% confidence intervals (CI) computed via nonparametric bootstrap** using the SciPy $stats\text{.}bootstrap$ function with 1000 resamplings and default settings. No data were excluded from the analyses.

F.3 Zero-shots results broken down by task

Figure 9 presents a breakdown of perception coarse-grained results by task. It reveals that autoregressive generalist models perform well across all tasks, as indicated by the overall averages. Notably, while GPT-4o dominates most tasks, PaliGemma excels in domain identification, achieving the best performance in that specific area.

Figure 10 shows a breakdown of biology-specific perception fine-grained results by task. While GPT-4o dominates in some tasks, other models excel in specific tasks. For instance, ALIGN outperforms all models in molecular colocalization, while BiomedCLIP has the best performance in mitochondrial morphology classification.

Figure 11 shows a breakdown of pathology-specific perception fine-grained results by task. This breakdown reveals that while GPT-4o has the best performance in three tasks, specialist models still outperform it in amyloid morphology [a] and Pap smear grading.

Appendix G Computing resources

One benefit of $\mu$ -Bench is that it is amenable to use by academic labs of all sizes, even those with limited resources. All evaluation tasks could be run on one NVIDIA a6000 (48GB VRAM), except for QWenVL, where we used one A100 (80GB VRAM). Inference required approximately 3.5 for all of $\mu$ -Bench running one dataset at a time (requiring one GPU at a time). The computations were performed on-premises university compute environment, with 1024 CPU cores that are AMD EPYC 9334, 2.70GHz.

Dataset Domains Source Images QA Pairs Creation VQA Answer Type VQA SEG CAP CLS OD Radiology VQA-Med-2018 [66] 1 PubMed Central^® 2,866 6,413 Automated Open/Close ✓ VQA-Med-2019 [67] 1 MedPix^® 4,200 15,292 Automated Open/Close ✓ VQA-Med-2020 [68] 1 MedPix^® 5,000 5,000 Automated Open/Close ✓ VQA-Med-2021 [69] 1 MedPix^® 5000 5000 Automated Open/Close ✓ VQA-RAD [70] 1 MedPix^® 315 3,515 Manual Open/Close ✓ RadVisDial (S) [71] 1 MIMIC-CXR 91,060 455,300 Automated Close ✓ RadVisDial (G) [70] 1 MIMIC-CXR 100 500 Manual Close ✓ OVQA [72] 1 EMRs 2,001 19,020 Automated Open/Close ✓ SLAKE [73] 1 Decathlon,NIH Chest X-ray, CHAOS 642 15,00 Manual Open/Close ✓ ✓ Pathology PathVQA [24] 1 PEIR Digital Library 4,998 32,799 Automated Open ✓ OpenPath [38] 1 Twitter (Now X) 208,414 208,414 Automated ✓ Biomedical PMC-VQA [74] 5+^* PubMed Central^® 149,000 227,000 Automated Close ✓ Biology Multimodality Cell Segmentation Challenge [22] 1 20 Laboratories 1,500 - ✓ $\mu$ -Bench (ours) 3 ^* Curated Datasets 17,356 Expert guided Open/Close ✓ ✓ ✓ ✓ ✓

Table 8: Comparison of

\mu

-Bench to existing composite medical and biomedical datasets. Only publicly available datasets were considered. 5+*: Mostly Radiology, Pathology, Microscopy, Signals and Generic biomedical illustrations.

Table 9: VLM Model Breakdown:

Model Total Params Base VLM Vision Encoder VE Params Text Encoder TE Params Training Data Size Contrastive VLMs CLIP 151.2M - ViT-B/32 86M DataComp-1B 13B ALIGN [32] 172.1M - EfficientNet 62.1M BERTbase 109.4M Internet 1.8B CoCa [39] 383M - ViT-B/16 86M 297M Internet 4.8B OpenCLIP 223.7M - ViT-B/32 86M DataComp-1B 13B BLIP [75] 223.7M - ViT-B/16 86M BERTbase 137.2M Internet 14M PLIP [38] 151.2M CLIP* ViT-B/32 86M OpenPath (X) 208.4K QuiltNet [44] 151.2M CLIP* ViT-B/16 86M GPT2 (77CL) Quilt 1M BiomedCLIP 195.M OpenCLIP ViT-B/16 86M BioMedBERT [76] 110M PMC-15M 15M CONCH [45] 395.2M CoCa ViT-B/16 86M 1.17M Auto-regressive VLM CogVLM [42] 17.6B - EVA2-CLIP-E Vicuna-1.5-7B 7B Multiple 1.5B Qwen-VL [21] 9.6B - ViT-bigG 1.9B QwenLM 7.7B Multiple 1.4B

Table 10:

\mu

-Bench Perception composition by imaging domain and subdomain

Domains	Images
Biology	8,637
Pathology	8,678
Subdomains	Images
Cell Biology	4979
Gastrointestinal and Liver Pathology	4194
Hemato-pathology	2600
Molecular Biology	2229
Neuro-pathology	1291
Botany	726
Cardiovascular Pathology	400
Neurobiology	256
Gynecologic Cytology	193
Parasitology	192
Developmental Biology	159
Biophysics	96

Table 11:

\mu

-Bench Perception composition by imaging modality and submodality

Modalities	Images
Light Microscopy	10864
Fluorescence Microscopy	5618
Electron Microscopy	833
Submodalities	Images
Brightfield Microscopy	10121
Epifluorescence Microscopy	2217
Synthetic	1800
Confocal Microscopy	1201
Darkfield Microscopy	743
Serial Blockface Scanning Electron Microscopy	577
Total Internal Reflection Fluorescence Microscopy	400
Cryo-electron Tomography	256

Table 12:

\mu

-Bench Perception Fine-grained tasks

Classification Tasks	Images
Cell cycle phase	3169
Normal and abnormal tissues in colorectal adenocarcinoma or normal gastrointestinal tissue	3114
White blood cell	2600
Cell phenotypes in synthetic images	1800
Amyloid beta pathology	1291
Subcellular structures	1105
Texture in colorectal cancer	1000
Organisms and structures in fluorescence microscopy images	934
Organisms and structures in electron microscopy images	833
Normal and abnormal pollen grains	700
Heart failure using cardiac histopathology images	400
Pre-cancerous and cervical cancer lesions in liquid-based cytology Pap smear images	193
Colocalization patterns	96
Segmentation and Object Detection Tasks	Images
Segmentation of white blood cells	1600
Segmentation of subcellular structures	1105
Cell segmentation in synthetic images	600
Mitochondria segmentation in CryoET images	256
Gland segmentation in benign and malignant colon histology images	80

Table 13:

\mu

-Bench Perception composition by imaging stain

Stains	Images
H&E	4594
Giemsa	2600
Synthetic	1916
No Stain	1742
DAPI	1105
H2B-mCherry	783
Propidium Iodide	743
Basic Fuchsin	700
IHC(DAB)	610
Uranyl Acetate	577
IHC(HDab)	491
AlexaFluor-tubulin	400
Papanicolaou	193
IHC(Red)	190
Hoechst 33342	183
GalT–EGFP	157
CellMask	125
Tetraspeck Beads	51
Lysotracker	46
Fluorescent Beads	26
Phal	25
GalT-GFP	24
Soluble GFP	21
Uniform Test Slide	11
LifeAct	2

Table 14:

\mu

-Bench Cognition composition by imaging domain and subdomain

Domains	Question
Biology	113
Pathology	8
Subdomains	Question
Cell Biology	45
Cell and molecular biology	28
Neurobiology	17
Developmental biology	8
Immunology	6
Neuropathology	6
Gastrointestinal pathology	2
Virology	2
Botany	2
Genetics	2

Table 15:

\mu

-Bench Cognition composition by imaging modality and submodality

Modalities	Questions
Fluorescence Microscopy	76
Light Microscopy	26
Electron Microscopy	17
Submodalities	Questions
Epifluorescence microscopy	37
Confocal microscopy	36
Brightfield microscopy	21
Cryo-electron tomography	8
Scanning electron microscopy	4
Transmission electron microscopy	4
Differential interference contrast microscopy	4
Mixed	3
Transmission electron microscopy (TEM)	1
Lattice light sheet	1
Synthetic	1
Lattice light-sheet microscopy	1

Table 16:

\mu

-Bench Perception Coarse-Grained Question and Caption templates

Type Question Caption Modality What is the most likely microscopy modality used to acquire this image? A microscopy image obtained through {modality}. Submodality What is the most likely microscopy submodality used to acquire this image? A microscopy image obtained through {submodality}. Domain What is the most likely field of study this micrograph would be used for? A microscopy image frequently studied in {domain}. Subdomain What is the most likely subfield of study this micrograph would be used for? A microscopy image frequently studied in {subdomain}. Stain What is the most likely technique used to stain this micrograph? A microscopy image stained with {stain}.

Table 17:

\mu

-Bench Perception Fine-grained: Question and caption templates for biology tasks

Dataset Question Caption PCST-Contour A synthetic fluorescence micrograph is displayed. What is the most likely description for the cells contour irregularities? A synthetic fluorescence microscopy image of a cell with {class} contours. PCST-Eccentricity A synthetic fluorescence micrograph displayed. What is the most likely description for the cells eccentricity phenotype? A synthetic fluorescence microscopy image of a cell with {class} eccentricity. PCST-Texture A synthetic fluorescence micrograph displayed. What is the most likely description for the cells cytoplasm texture? A synthetic fluorescence microscopy image of the cell cytoplasm with {class} texture. Colocalization benchmark A synthetic confocal fluorescence micrograph of small points in two different channels with different levels of colocalization. Given the image provided, what is the most accurate description for the colocalization patterns? A synthetic confocal fluorescence microscopy image of small points in two different channels with different levels of colocalization. The image displays {class} colocalization patterns. EMPIAR SBF-SEM An electron micrograph is shown. Based on the image, what is the most likely structure on the field of view? A Serial blockface scanning electron microscopy image shows {class} BBBC048 (Brightfield) A brightfield micrograph of jurkat cells acquired using flow cytometry (single cell). Based on the micrograph, what is the most likely cell phase? Brightfield microscopy imaging flow cytometry is used to visualize single Jurkat cells at different cell cycle phases. The image displays a cell in {class} stage of the cell cycle. BBBC048 (Darkfield) A darkfield micrograph of jurkat cells acquired using flow cytometry (single cell). Based on the micrograph, what is the most likely cell phase? A darkfield microscopy imaging flow cytometry is used to visualize single jurkat cells at different cell cycle phases. The image displays a cell in {class}. BBBC048 (Epifluorescence) A propidium iodide stained fluorescence micrograph of Jurkat cells acquired using flow cytometry (single cell). Based on the micrograph, what is the most likely cell phase? Epifluorescence microscopy imaging (flow cytometry) shows single Jurkat cells stained with propidium iodide at different cell cycle phases. The image displays a cell in {class}. CellCognition (Golgi) A fluorescence micrograph of Hela Kyoto cells stably expressing GalT-eGFP to label the Golgi apparatus. Based on the image what is the most likely the Golgi apparatus morphology? Fluorescence microscopy image of human Hela Kyoto cells stably expressing galactosyltransferase (GalT-eGFP) to label the Golgi apparatus showing {class} morphology. CellCognition (H2B) A fluorescence microscopy image of human Hela Kyoto cells with stable chromatin marker expression. Based on the image, what is the most likely cell cycle stage? Fluorescence microscopy image of human Hela Kyoto cells with stable chromatin marker expression. The micrograph displays a cell in {class} stage of the cell cycle. CellCognition (Mt) A fluorescence micrograph of Hela Kyoto cells stably expressing eGFP-labeled tubulin to label microtubules is shown. Based on the micrograph what is the most likely microtubule morphology? Fluorescence microscopy image of human Hela Kyoto cells with stable chromatin marker expression (eGFP) displays microtubules showing {class} morphology. ICPR2020 Pollen Basic fuchsin stained light micrograph of pollen grains. Based on the image, what is the most likely pollen class? A brightfield microscopy image of pollen grains shows {class} structures. Wu et al 2023 A cryo-electron tomography of mitochondria in neurons cultured in vitro is shown. Based on the image what is the most likely mitochondrial morphology? A cryo-electron tomography of mitochondria in neurons cultured in vitro shows {class}. Fluorescence Cells & Structures A fluorescence micrograph is shown. Based on the image, what is the most likely structure? A photomicrograph shows a fluorescence microscopy {class}.

Table 18:

\mu

-Bench Perception Fine-grained: Question and caption templates for pathology tasks

Dataset Question Caption Acevedo et al 2020 A Giemsa-stained light micrograph displaying human peripheral blood cells. As a blood cell recognition system, identify the correct cell type: A brightfield microscopy image of a peripheral blood smear stained with giemsa displaying {article} {class}. Jung et al 2022 Synthetically generated Giemsa-stained light micrograph of human peripheral blood cell. As a blood cell recognition system, identify the correct cell type: A synthetic microscopy image of a peripheral blood smear stained with giemsa, displays {article} {class}. Kather et al 2016 H&E stained light micrograph of human colorectal tissue. Based on the image, what is the most likely texture class? H&E stained light microscopy image of human colorectal tissue with {class}. Kather et al 2018 H&E stained light micrograph of human colorectal tissue. Based on the image, what is the most likely texture class? H&E stained light microscopy image of human colorectal tissue with {class}. Kather et al 2018 Val7K H&E stained light micrograph of human colorectal tissue. Based on the image, what is the most likely texture class? H&E stained light microscopy image of human colorectal tissue with {class}. Nirschl et al 2018 H&E stained light micrograph of human cardiac tissue. Based on the image, what is the most likely clinical chronic heart diagnosis? H&E stained light microscopy image of human cardiac tissue with {class} texture. Tang et al 2019 IHC stained light micrograph of extracellular amyloid-beta deposition in the human brain tissue. Based on the image, what is the most likely amyloid beta morphology pattern? Human brain tissue is stained with immunohistochemistry for amyloid-beta and imaged using brightfield microscopy. The micrograph displays {class} morphology. Wong et al 2022 IHC stained light micrograph of extracellular amyloid-beta deposition in the human brain tissue. Based on the image, what is the most likely amyloid beta morphology pattern? Human brain tissue is stained with immunohistochemistry for amyloid-beta and imaged using brightfield microscopy. The micrograph displays {class} morphology.

{mdframed}

[backgroundcolor=lightblue2]


light microscopy:
    - brightfield microscopy
    - phase contrast microscopy
    - differential interference contrast microscopy
    - darkfield microscopy
    - polarized light microscopy
    - mixed
    - synthetic
fluorescence microscopy:
    - confocal microscopy
    - epifluorescence microscopy
    - single-molecule localization microscopy (SMLM)
    - stimulated emission depletion microscopy (STED)
    - total internal reflection fluorescence microscopy (TIRF)
    - fluorescence recovery after photobleaching (FRAP)
    - fluorescence resonance energy transfer (FRET)
    - fluorescence in situ hybridization (FISH)
    - fluorescence correlation spectroscopy (FCS)
    - mixed
    - synthetic
electron microscopy:
    - atomic force microscopy
    - scanning electron microscopy
    - serial blockface scanning electron microscopy
    - cryo-electron microscopy
    - cryo-electron tomography
    - immuno-electron microscopy
    - mixed
    - synthetic

Figure 13: Modality with submodality YAML file.

{mdframed}

[backgroundcolor=lightblue2]


biology:
    - anatomy
    - biochemistry
    - biophysics
    - biotechnology
    - botany
    - cell and molecular biology
    - cell biology
    - cell cycle
    - conservation biology
    - developmental biology
    - ecology
    - evolutionary biology
    - genetics
    - immunology
    - marine biology
    - microbiology
    - neurobiology
    - parasitology
    - pharmacology
    - physiology
    - structural biology
    - systems biology
    - virology
    - zoology
dermatology:
    - infectious dermatology
    - medical dermatology
    - neoplastic dermatology
ophthalmology:
    - cornea and external eye ophthalmology
    - retinal surgery
    - diabetic retinopathy
    - neuro-ophthalmology
    - pediatric ophthalmology
    - neoplastic ophthalmology
    - medical ophthalmology
cytology:
    - gynecologic cytology
    - non-gynecologic cytology
    - fine needle aspiration cytology
pathology:
    - autopsy pathology
    - blood banking and transfusion medicine
    - bone and soft tissue pathology
    - breast pathology
    - cardiovascular pathology
    - clinical pathology
    - dermatopathology
    - endocrine pathology
    - forensic pathology
    - gastrointestinal pathology
    - genitourinary pathology
    - gynecologic pathology
    - head and neck pathology
    - hematopathology
    - hepatobiliary pathology
    - infectious disease pathology
    - molecular pathology
    - nephropathology
    - neuropathology
    - oral pathology
    - ophthalmic pathology
    - pancreatic pathology
    - pediatric pathology
    - pulmonary and pleural pathology
    - renal and medical kidney pathology
    - surgical pathology
radiology:
    - abdominal radiology
    - breast imaging
    - cardiothoracic radiology
    - emergency radiology
    - gastrointestinal radiology
    - genitourinary radiology
    - head and neck radiology
    - interventional radiology
    - musculoskeletal radiology
    - neuroradiology
    - nuclear radiology
    - pediatric radiology
    - vascular and interventional radiology

Figure 14: Domain with subdomain YAML file.

{mdframed}

[backgroundcolor=lightblue2]


light microscopy:
    - H&E
    - IHC(HDab)
    - IHC(Red)
    - Giemsa
    - PAS
    - Papanicolaou
    - Masson Trichrome
    - Toluidine Blue
    - Wright-Giemsa
    - Ziehl-Neelsen
    - Gram
    - Congo Red
    - Alcian Blue
    - Basic fuchsin
    - None
    - synthetic
fluorescence microscopy:
    - DAPI
    - Hoechst
    - propidium iodide
    - SYTOX
    - Alexa Fluor 350 # blue
    - Alexa Fluor 405
    - GFP
    - FITC
    - Cy2
    - Alexa Fluor 488
    - RFP
    - Cy3
    - H2B-mCherry
    - Texas Red
    - Alexa Fluor 555
    - Alexa Fluor 568
    - Cy5
    - Alexa Fluor 647
    - Alexa Fluor 660
    - AlexaFluor-tubulin
    - synthetic
    - GalTEGFP d
electron microscopy:
    - uranyl acetate
    - osmium tetroxide
    - lead citrate
    - phosphotungstic acid
    - tannic acid
    - sodium silicotungstate
    - sodium phosphotungstate
    - sodium metaperiodate
    - synthetic
    - None

Figure 15: Modality with submodality YAML file.

{mdframed}

[backgroundcolor=lightblue]


I’m creating a dataset to evaluate VLM understanding on biomedical images.
Could you convert this user input question into a multi-choice question
with 6 answer choices? One choice should be "None of the above",
and this choice should have a 1/6 chance of being correct.
Output a JSON format:
{"question": str, "choices": list, "answer": int (start from 0)}.


Context: {question["CONTEXT"]}
Input Question: {question["INPUT"]}
Correct Answer: {answer}

Figure 16: Prompt used to convert open VQA to closed VQA

{mdframed}

[backgroundcolor=lightblue]


Given this question, can you help me annotate the following fields?

## Modality and Submodality
{Modalities YAML}

## Domain and Subdomain
{Domains YAML}

## Scale (nano/subcellular, micro/cellular, macro/tissues)
{Scales Table}

## Content
+ Gene pathways
+ Metabolic pathways
+ Cell signalling and signal transduction
+ Cell physiology/function
+ Protein-protein interactions
+ Cell-cell interaction
+ Unique properties of the cell of origin/cell type in the image
+ Cytoskeleton and cell structure/morphology
+ Drug or small molecule mechanism of action
+ Other

## Relevant biological keywords
For example, brain, HeLa, mitochondria, GFP, etc

Output a JSON with {"modality": str,
                    "submodality": str,
                    "domain": str,
                    "subdomain": str,
                    "scale": str,
                    "content": str,
                    "keywords": list[str]}

Figure 17: Prompt use to classify questions post-hoc

{mdframed}

[backgroundcolor=lightblue]


metadata:
  height: 250
  width: 250
  name: 01145_6de79663_33375_split-test_chronic-heart-failure.png
  format: .png
  createdAt: ’2024-05-27T17:39:08.866Z’
  updatedAt: ’2024-05-27T17:39:08.866Z’
comments: []
custom_metadata:
  age: 47.0
  classes_to_idx:
    not_chronic_heart_failure: 0
    chronic_heart_failure: 1
  cvdo_id:
  - CVDO_0000569
  cvdo_name:
  - cardiomyopathy
  dataset_name: nirschl_et_al_2018
  dataset_slug: nirschl_et_al_2018
  domain: pathology
  ethnicity: Caucasian
  filename: 01145_6de79663_33375_split-test_chronic-heart-failure.png
  file_size: 141080
  institution:
  - upenn
  image_id: 6de79663-0392-4f20-b2ea-16ddf4e3b4e4
  image_md5: 97ee369881fcbfd18fb4fb8dcfe2ca17
  label: 1
  label_name: chronic_heart_failure
  label_subname: cardiomyopathy
  label_task: classification of heart failure using cardiac histopathology images
  last_updated: ’2024-05-27T17:39:08.868Z’
  license: CC-BY-4.0
  microns_per_pixel: 2.0
  modality: light microscopy
  ncbitaxon_id:
  - NCBITaxon_9606
  ncbitaxon_name:
  - Homo sapiens
  ncit_id:
  - NCIT_C50577
  normal_or_abnormal: abnormal
  original_filename: 33375_0_fal_20_0.png
  patient_id: ’33375’
  pato_id:
  - PATO_0000384
  sex: male
  snomedct_id:
  - SCTID_48447003
  split: test
  stain: H&E
  subdomain: cardiovascular pathology
  submodality: brightfield microscopy
  supported_tasks:
  - multi_class
  uberon_id:
  - UBERON_0000948
  uberon_name:
  - heart
  questions: null
tags:
- CVDO_0000569
- H&E
- Homo sapiens
- NCBITaxon_9606
- NCIT_C50577
- PATO_0000384
- SCTID_48447003
- UBERON_0000948
- brightfield microscopy
- cardiomyopathy
- cardiovascular pathology
- heart
- light microscopy
- pathology

Figure 18:

\mu

-Bench Perception: Example of densely annotated metadata for a single data point. Metadata is collected and reviewed by an expert.

{mdframed}

[backgroundcolor=lightblue3]


Answer with a single letter, no extra details.
Question: {question}
{options}

Figure 19: Prompts used to run inference with auto-regressive models

{mdframed}

[backgroundcolor=lightblue3]

You are invited to an alpha-testing phase of a vision-language chat app
for biologists. The application is free of charge, poses no risks that
would not be present with general internet usage, and you may stop use
at any time. You may use it for your daily research or however you wish.

Website:
    Create an account at: ###
    Acknowledge terms of service
    Registration and use of the app is consent to use the submitted
    image/text for model training and testing purposes. The user’s field
    of study and training level will be recorded. However, all personal
    information will remain confidential. The users will retain the
    copyright and ownership of the input image data. However, the terms
    of service allow permission to use and redistribute the image under
    a CC-BY-SA 4.0 license. The raw and/or curated image-text data may be
    used to create a public benchmark of real-world biology user-AI
    assistant instruction tuning dataset to benefit the biomedical
    computer vision community.

Main interface
1. Upload a biology or biomedical image.
    Describe the image as context for the model. For example:
    "Actin (orange) and mitochondria (cyan) in a micropatterned HeLa cell.
    This is a still image from time-lapse live cell imaging. Wild-type
    genotype and no drug treatment."

2. Prompt/question:
    For example: How would you describe the pattern of the
    organelle in cyan?
    What is the most likely organelle? What antibody marker or dye
    specifically labels this organelle for cell biology experiments?
    Feel free to challenge the model with difficult questions requiring
    complex reasoning. There is no need to limit questions to simple
    perceptual tasks such as classification ("what is in the image?")
    or visual question-answering. Have it reason and interpret
    images in a way that would challenge a new biology graduate student.
    Ask the model to do complex image-based reasoning about biological
    processes/pathways:

    "Given the gene knockout cell image provided, what biological pathway
    is most likely disrupted, if any?"

    "Are there any small molecules that reverse this phenotype?"

    "What diseases are associated with gene <my favorite gene>?"

    Determine whether the VLM can understand true biological signals vs.
    artifacts that confound analysis, See how the VLM handles diverse
    modalities and experiments (EM, fluorescence, brightfield, CLEM etc)
    as well as different cell types and tissues.
    Ask the model to generate new hypotheses based on an image or connect
    to relevant literature.

    After you have a response to your question, we encourage you to
    provide feedback on the VLM answer. Please give details on why the
    answer is correct/helpful or incorrect/not helpful. As needed,
    provide additional details as to whether the VLM
      1) understood the  question
      2) identified the biological feature(s) in the image
      3) provided a correct biological interpretation/answer.

Figure 20: Email Invitation: Invitation for collaboration send to submitters.

Appendix H $\mu$ -Bench Perception (Coarse-Grained) Closed VQA Data Samples

QUESTION TYPE: Modality
Question: What is the most likely microscopy modality used to acquire this image?
Options:
A. electron microscopy B. fluorescence microscopy C. light microscopy D. none of the above

QUESTION TYPE: Modality
Question: What is the most likely microscopy modality used to acquire this image?
Options:
A. light microscopy B. electron microscopy C. fluorescence microscopy D. none of the above

QUESTION TYPE: Submodality
Question: What is the most likely microscopy submodality used to acquire this image?
Options:
A. fluorescence resonance energy transfer (FRET) B. total internal reflection fluorescence microscopy (TIRF) C. mixed D. stimulated emission depletion microscopy (STED) E. confocal microscopy F. none of the above

QUESTION TYPE: Submodality
Question: What is the most likely microscopy submodality used to acquire this image?
Options:
A. serial blockface scanning electron microscopy B. atomic force microscopy C. synthetic D. cryo-electron microscopy E. mixed F. none of the above

QUESTION TYPE: Domain
Question: What is the most likely field of study this micrograph would be used for?
Options:
A. ophthalmology B. biology C. radiology D. pathology E. cytology F. none of the above

QUESTION TYPE: Subdomain
Question: What is the most likely subfield of study this micrograph would be used for?
Options:
A. fine needle aspiration cytology B. gynecologic cytology C. non-gynecologic cytology D. none of the above

QUESTION TYPE: Stain
Question: What is the most likely technique used to stain this micrograph?
Options:
A. Papanicolaou B. H&E C. PAS D. IHC(HDab) E. Giemsa F. none of the above

TASK: White blood cell classification
Prompt: A Giemsa-stained light micrograph displaying human peripheral blood cells. As a blood cell recognition system, identify the correct cell type:
Options:
A. eosinophil B. neutrophil C. immature granulocyte D. platelet E. erythroblast F. none of the above

TASK: Classification of cell contour irregularity phenotypes in synthetic images
Prompt: A synthetic fluorescence micrograph is displayed. What is the most likely description for the cell’s contour irregularities?
Options:
A. irregular B. intermediate C. smooth D. none of the above

TASK: Classification of cell contour irregularity phenotypes in synthetic images
Prompt: An electron micrograph is shown. Based on the image, what is the most likely structure on the field of view?
Options:
A. a Zebrafish retina B. Leishmania haptomonad C. a HeLa cell in metaphase D. Cardiac muscle E. a Tobacco leaf chloroplast F. none of the above

TASK: Classification of cell cycle phase
Prompt: A brightfield micrograph of Jurkat cells acquired using flow cytometry (single cell). Based on the micrograph, what is the most likely cell phase?
Options:
A. interphase (G2) B. interphase (G1) phase C. Telophase D. Anaphase E. Synthesis F. none of the above

TASK: Classification of cell cycle stages in live cell imaging data
Prompt: A fluorescence micrograph of Hela Kyoto cells stably expressing GalT-eGFP to label the Golgi apparatus. Based on the image what is the most likely the Golgi apparatus morphology?
Options:
A. Anaphase B. Diffuse C. Interphase D. Golgi twin E. Partly disassembled F. none of the above

TASK: Classification of pre-cancerous and cervical cancer lesions in liquid-based cytology Pap smear images
Prompt: Liquid-based cytology pap smear of human pre-cancerous or cervical cancer lesions. Based on the cytogram, what is the most likely finding?
Options:
A. Squamous cell carcinoma (SCC) B. Low-grade (LSIL) C. Negative (NILM) D. High-grade (HSIL) E. none of the above

TASK: Classification of normal and abnormal pollen grains
Prompt: Basic fuchsin stained light micrograph of pollen grains. Based on the image, what is the most likely pollen class?
Options:
A. abnormal Corylus avellana B. Non-pollen C. normal Alnus D. normal Corylus avellana E. none of the above

TASK: White blood cell classification
Prompt: Synthetically generated Giemsa-stained light micrograph of human peripheral blood cell. As a blood cell recognition system, identify the correct cell type:
Options:
A. Neutrophil B. Basophil C. Lymphocyte D. Eosinophil E. Monocyte F. none of the above

TASK: Texture classification in colorectal cancer
Prompt: H&E stained light micrograph of human colorectal tissue. Based on the image, what is the most likely texture class?
Options:
A. chronic heart failure B. not chronic heart failure C. none of the above

TASK: Classification of heart failure using cardiac histopathology images
Prompt: H&E stained light micrograph of human cardiac tissue. Based on the image, what is the most likely clinical chronic heart diagnosis?
Options:
A. chronic heart failure B. not chronic heart failure C. none of the above

TASK: Amyloid beta pathology classification
Prompt: IHC stained light micrograph of extracellular amyloid-beta deposition in the human brain tissue. Based on the image, what is the most likely amyloid beta morphology pattern?
Options:
A. Caa B. Cored C. Negative D. Diffuse E. none of the above

TASK: Classification of mitochondrial morphology in CryoET images
Prompt: A cryo-electron tomography of mitochondria in neurons cultured in vitro is shown. Based on the image what is the most likely mitochondrial morphology?
Options:
A. abnormal mitochondrial morphology B. normal mitochondrial morphology C. none of the above

Cognition Question Example
Question: What type of cells are labelled green?
Options:
A. Cholinergic neurons B. Glial cells C. Muscle fibers D. Epithelial cells F. Fibroblasts G. None of the above

Cognition Question Example
Question: In the provided fluorescence microscopy image of a section of a human pancreas stained with DAPI (dark blue), anti-insulin (light blue), anti-glucagon (red), and anti-somatostatin (green), are there more green-labeled or red-labeled features visible?
Options:
A. There are more green-labeled features (anti-somatostatin). B. There are more red-labeled features (anti-glucagon). C. The number of green-labeled features (anti-somatostatin) and red-labeled features (anti-glucagon) are equal. D. There are no green-labeled features (anti-somatostatin) visible. E. There are no red-labeled features (anti-glucagon) visible. F. None of the above

Cognition Question Example
Question: What are the dark structures in the image? How does it relate to the biological process demonstrated in the image?
Options:
A. The dark structures are mitochondria, involved in energy production during cell metabolism. B. The dark structures are lysosomes, which digest cellular waste as the cell divides. C. The dark structures are chloroplasts, which harvest light energy during photosynthesis in plant cells. D. The dark structures are vesicles, which transport materials to the cleavage furrow during cytokinesis. E. The dark structures are homologous chromosomes, separated into each daughter cell during mitosis. F. None of the above

Cognition Question Example
Question: What is the most likely structure labeled by the bright dots along the neurons?
Options:
A. Synapse parts B. N-cadherin complexes C. V-glut 1 and 2 transporters D. Postsynaptic NMDA receptors E. Excitatory synapses F. None of the above

Cognition Question Example
Question: What type of imaging is this?
Options:
A. This is an image of HT55 cancer cells and T cells. B. This is an image of neuronal cells. C. This is an image of bacterial colonies. D. This is an image of red blood cells. E. This is an image of muscle tissue. F. None of the above.

Cognition Question Example
Question: What percentage of this tumor is composed of calcium?
Options:
A. 5% B. 10% C. 20% D. 15% E. 1% F. None of the above

μ𝜇\muitalic_μ-Bench: Vision-Language Benchmark for Microscopy Understanding

Abstract

1 Introduction

2 Related Work

3 Dataset collection methodology

3.1 Perception Dataset Curation

3.2 Cognitive Dataset Curation

Cognition Dataset Collection

4 Dataset Description

5 VLM benchmarking and results

5.1 Benchmarking approach

Generalist Contrastive (GC) VLMs

Generalist autoregressive (GA) VLMs

Specialist contrastive (SC) VLMs

Evaluation

5.2 Results

All models have high error rates

6 Conclusion

References

7 Appendix

Appendix A Limitations

Appendix B Ethical Compliance and Acknowledgements

B.1 Ethics Statement

B.2 Conflicts of Interest Disclosures

B.3 Funding/Support

B.4 Acknowledgments

Appendix C Benchmark Details

C.1 Instructions for downloading the benchmark

C.2 μ𝜇\muitalic_μ-Bench overview

C.3 μ𝜇\muitalic_μ-Bench statistics

C.4 μ𝜇\muitalic_μ-Bench Perception

C.5 μ𝜇\muitalic_μ-Bench Cognition

C.6 μ𝜇\muitalic_μ-Bench Object detection

C.7 Comparison with a supervised linear model trained on DINOv2 Features

Appendix D Evaluation Details

D.1 VQA evaluation details

D.2 Object detection evaluation details

Appendix E Additional Benchmarking Results

E.1 Object Detection Results

E.2 Weight ensembling details

E.3 Model Performance on Pathology Specific Tasks

Appendix F Additional Benchmarking details

F.1 Model Details

F.2 Computing confidence intervals

F.3 Zero-shots results broken down by task

Appendix G Computing resources

Appendix H μ𝜇\muitalic_μ-Bench Perception (Coarse-Grained) Closed VQA Data Samples

Appendix I μ𝜇\muitalic_μ-Bench Perception (Fine-Grained) Closed VQA Data Samples

Appendix J μ𝜇\muitalic_μ-Bench Cognition Closed VQA Data Samples

$\mu$ -Bench: Vision-Language Benchmark for Microscopy Understanding

C.2 $\mu$ -Bench overview

C.3 $\mu$ -Bench statistics

C.4 $\mu$ -Bench Perception

C.5 $\mu$ -Bench Cognition

C.6 $\mu$ -Bench Object detection

Appendix H $\mu$ -Bench Perception (Coarse-Grained) Closed VQA Data Samples

Appendix I $\mu$ -Bench Perception (Fine-Grained) Closed VQA Data Samples

Appendix J $\mu$ -Bench Cognition Closed VQA Data Samples

Cognition Question Example
Question: What are the vesicular structures seen at the left and bottom borders? What is their function?
Options:
A. Early endosomes which are involved in sorting of endocytosed material. B. Electron opaque cross-sectioned kinetodesmal fibers for structural support. C. Granulo-fibrillar material involved in cellular structure. D. Mitochondria which provide energy to the cell. E. Axosome which gives rise to microtubules. F. None of the above.

Cognition Question Example
Question: Are synapses visible in this image?
Options:
A. Yes, synapses are clearly visible as distinct structures. B. No, synapses are not visible as the image lacks specific synaptic markers. C. Yes, synapses are visible where dendrites and axons come into close proximity. D. No, synapses cannot be identified due to the resolution limitation of fluorescence microscopy. E. Yes, the orange and gray colored areas clearly show synapses. F. None of the above.

Cognition Question Example
Question: Describe the spatial relationship between organelles in the cell.
Options:
A. Lysosomes are in two populations, one at the cell center and one at the periphery. B. The Golgi apparatus surrounds the mitochondria while the ER is dispersed in the cytosol. C. ER and lipid droplets cluster together at the cell periphery. D. Mitochondria are positioned mainly in the cell periphery, while lysosomes are located centrally. E. Lysosomes are located close to the nucleus while peroxisomes are scattered throughout the cytoplasm. F. None of the above.

Cognition Question Example
Question: Count the number of red cells in this image two days and four hours after initial co-culture in a 96 well plate.
Options:
A. 7500 cells B. None of the above C. 8500 cells D. 9000 cells E. 9500 cells F. 8000 cells