\title

Multimodal Data Integration for Precision Oncology: \\ Challenges and Future Directions \authorHuajun Zhou, \IEEEmembershipMember, IEEE, Fengtao Zhou, Chenyu Zhao, Yingxue Xu, Luyang Luo, \IEEEmembershipMember, IEEE, Hao Chen*, \IEEEmembershipSenior Member, IEEE. \thanksThis work was supported by the Hong Kong Innovation and Technology Fund (Project No. MHP/002/22) and Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. R6003-22 and C4024-22GF). (Corresponding author: Hao Chen.) \thanksHuajun Zhou, Fengtao Zhou, Chenyu Zhao, Yingxue Xu, and Luyang Luo are with the Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China. \thanksHao Chen is with the Department of Computer Science and Engineering, Department of Chemical and Biological Engineering and Division of Life Science, Hong Kong University of Science and Technology, Hong Kong, China. (e-mail: [email protected]). \maketitle

Abstract

The essence of precision oncology lies in its commitment to tailor targeted treatments and care measures to each patient based on the individual characteristics of the tumor. The inherent heterogeneity of tumors necessitates gathering information from diverse data sources to provide valuable insights from various perspectives, fostering a holistic comprehension of the tumor. Over the past decade, multimodal data integration technology for precision oncology has made significant strides, showcasing remarkable progress in understanding the intricate details within heterogeneous data modalities. These strides have exhibited tremendous potential for improving clinical decision-making and model interpretation, contributing to the advancement of cancer care and treatment. Given the rapid progress that has been achieved, we provide a comprehensive overview of about 300 papers detailing cutting-edge multimodal data integration techniques in precision oncology. In addition, we conclude the primary clinical applications that have reaped significant benefits, including early assessment, diagnosis, prognosis, and biomarker discovery. Finally, derived from the findings of this survey, we present an in-depth analysis that explores the pivotal challenges and reveals essential pathways for future research in the field of multimodal data integration for precision oncology.

{IEEEkeywords}

Multimodal Data Integration, Precision Oncology, Medical Imaging Analysis, Cancer.

1 Introduction

According to the estimation provided by the International Agency for Research on Cancer (IARC) of the World Health Organization (WHO), we witnessed 20 million new cases of cancer and, unfortunately, 9.7 million cancer-related deaths in 2022 [WHOdata]. Cancer patients usually have a high mortality rate within five years of cancer diagnosis and endure significant mental, financial, and physical burdens. In addition to being an important barrier to increasing life expectancy, cancer is associated with substantial societal and macroeconomic costs that vary in degree across cancer types, geography, and gender [chen2023estimates]. Precision oncology represents a pivotal paradigm in cancer treatment, aiming to tailor therapeutic approaches based on the distinctive characteristics of patients’ tumors. By customizing treatment plans to maximize efficacy while mitigating adverse effects, precision oncology holds immense promise in improving treatment outcomes and advancing the landscape of cancer care. Nevertheless, the intricacies inherent in the micro- and macro-environment of tumors, coupled with the diverse characteristics exhibited by different cancers, present a significant challenge in comprehending the complex nature of tumors and devising more effective therapies. Clinicians have long relied on medical imaging [bar2003clinical, makaju2018lung, morrow2011mri, tang2009computer] or lab test results [chernecky2012laboratory, sturgeon2008national] to gain critical insights into patients’ health conditions, enabling accurate diagnoses and informed treatment decisions. In recent years, the precision oncology community has witnessed a surge [zhuang20223d, zhou2023cross, xu2023multimodal, xu2021mufasa] due to the successful integration of a variety of heterogeneous data like medical imaging, clinical records, and omics data by leveraging multimodal data integration techniques. Specifically, medical imaging provides detailed visualizations of the internal structures and abnormalities to enable the characterization of tumors, and assessment of their size, location, and spread. Moreover, clinical records provide comprehensive insights into the patient’s past and present health status, diagnostic findings, treatment approaches, and disease progression. Furthermore, omics data provides a deeper understanding of the molecular alterations associated with cancer, including the identification of genetic mutations, gene expression patterns, protein modifications, etc. These heterogeneous data modalities provide valuable yet distinct insights into tumor characteristics, risk assessment, cancer progression, and treatment response. Effectively integrating these multimodal data offers promising opportunities for building a holistic understanding of tumors and advancing healthcare research, diagnostics, and personalized medicine, as shown in Fig. 1. However, constructing Artificial Intelligence (AI) models for effectively integrating multimodal data is a non-trivial task, requiring multi-faceted considerations on various critical aspects. These include understanding multimodal data characteristics, devising effective model architectures, formulating robust fusion strategies, and addressing potential challenges. Specifically, for samples with complete modalities, the primary objective is to effectively integrate heterogeneous knowledge in different modalities to improve the model’s performance. In the realm of multimodal data integration, there exist diverse fusion strategies possessing distinct advantages and drawbacks, calling for a thoughtful evaluation of the specific data modalities and clinical tasks to determine the most suitable fusion strategy. Moreover, for samples with incomplete modalities, the focus shifts toward learning robust representations to minimize performance degradation. Imputation-based methods focus on compensating the missing modalities using information from the observed modalities, while imputation-free methods directly leverage the observed modalities to perform multimodal fusion without imputing the missing modalities. As the former method may introduce additional noise by imputing missing modalities, and the latter method overlooks the correlations between modalities, striking a balance between noise reduction and capturing inter-modality relationships becomes crucial.

\includegraphics

[width=1 ]image/framework.pdf

Figure \thefigure: Overview of multimodal data integration for advancing precision oncology.
\includegraphics

[width=0.47 ]image/histogram.pdf

Figure \thefigure: Histogram of the reviewed papers on multimodal data integration for precision oncology in the past decade.

In this paper, we surveyed about 300 publications in the field of multimodal data integration for precision oncology over the past 10 years (2014 - 2024 up to April), as listed in Fig. 1. This review stands out from existing literature [kline2022multimodal, panayides2020ai, qiu2023large, tong2023integrating, kang2022roadmap, steyaert2023multimodal, baltruvsaitis2018multimodal, acosta2022multimodal, huang2020fusion, lipkova2022artificial, lahat2015multimodal, zhang2020advances, stahlschmidt2022multimodal, jacobs2023artificial, muhammad2021comprehensive, zhao2024review, kho2022saline, soomro2022image, boehm2022harnessing] on the specified research topic due to its extensive analysis of the strengths and limitations of methodologies utilized in clinical applications of precision oncology. Specifically, we first categorize the reviewed methods into two main topics mainly based on their different focus in dealing with complete or incomplete data. For samples with complete modalities, we further categorize existing methods into early, intermediate, late, and multi-level fusion, and subsequently conduct an in-depth analysis of their properties regarding architectural complexity, multimodal interconnection modeling, and the potential risk of modality collapse. These critical aspects are instrumental in ensuring optimal effectiveness and efficiency in leveraging multimodal data integration for precision oncology applications. For samples with incomplete modalities, we provide a detailed categorization of imputation-based methods, specifically into three distinct subcategories: data generation, feature generation, and sample retrieval, and imputation-free methods, specifically into three distinct subcategories as well: robustness enhancement, multi-task learning, and knowledge distillation. This refined categorization allows for a more comprehensive understanding and exploration of the various approaches employed to address the challenges posed by incomplete modalities. Furthermore, our investigation delves into the clinical applications of multimodal data integration within the context of precision oncology. By exploring the practical use cases, we discuss the challenges that impede the advancement of multimodal data integration in the realm of precision oncology. By identifying these challenges, we explore potential future directions for further advancements in integrating diverse data modalities to enhance precision oncology approaches and improve patient outcomes. The remainder of this work is structured as follows: In Section 2, we illustrate data modalities and corresponding modality representation extraction techniques. Next, in Section 3, we review existing multimodal data integration techniques from two perspectives, complete and incomplete data, respectively. Subsequently, we investigate the clinical applications of multimodal data integration in Section 4. Based on the above investigation, we conclude several challenges and potential future directions in Section 5. Finally, we summarize our survey in Section 6.

2 Data Modality

Imaging data provides valuable visual information that helps clinicians in diagnosing cancers, assessing the extent and progression of conditions, planning treatments, and monitoring treatment responses. Endoscopic and dermoscopic images are captured using seamlessly integrated cameras within their respective instruments, namely endoscopes and dermatoscopes. These cameras utilize diverse imaging techniques, such as white-light and narrowband images, that are considered distinct modalities. Given the similarity to natural images, existing encoders pre-trained on natural images, such as CNN [ronneberger2015u, çiçek20163d] or Transformer [dosovitskiy2020image, wang2021transbts] models, can be employed to extract deep feature from each image directly. Radiology imaging technologies aim to show the structure or function of tissues and organs and are widely used for diagnosing and treating cancers. There are two main types of imaging: structural imaging, which creates images of the anatomy and morphology of body parts, and functional imaging, which captures the functioning of tissues and organs [histed2012review]. Structural imaging techniques include computed tomography (CT), magnetic resonance imaging (MRI), ultrasound (US), and mammography scans. CT imaging uses X-rays to create detailed cross-sectional images of the body, while MRI imaging employs a strong magnetic field and radio waves for detailed cross-sectional images. US imaging uses sound waves to generate real-time images of internal organs, and mammography uses low-dose X-rays for detailed images of breast tissue, making it the standard for breast cancer screening [luo2024deep]. Functional imaging techniques like positron emission tomography (PET), single-photon emission computed tomography (SPECT), and optical imaging reveal the functioning of tissues and organs. SPECT and PET use small amounts of radioactive tracers to produce concentrated images of body parts, while optical imaging uses digital cameras to detect molecular emissions from electromagnetic waves. Molecular imaging targets specific biomolecules involved in cellular processes underlying disease states. Despite the inherent differences between radiology images and natural images, existing encoders are commonly utilized for feature extraction. Pathology image diagnosis represents the gold standard in tumor diagnosis, offering a meticulous examination of tissue structures and cellular characteristics, unrivaled by radiology scans. However, the high-resolution nature of pathology images poses a challenge for AI models to extract discriminative features while disregarding non-informative regions. To leverage the rich information within pathology images, Multiple Instance Learning (MIL) [ilse2018attention] has emerged as a prominent approach. Specifically, each pathology image is split into numerous image patches, while patch features are aggregated to form a holistic representation. MIL strategy enables AI models to select informative patches and extract lower-dimension yet discriminative representations from pathology images. Clinical data encompass a wealth of medical records from cancer patients, including medical history, medications, demographics, laboratory test values, diagnostic reports, etc. Structured data in clinical records refers to information organized in a predefined format, which may be continuous (e.g., age and tumor size) or discrete (e.g., race and metastasis status) variables. To integrate them into a joint representation, various techniques are employed for continuous and discrete variables, respectively. Continuous variables can undergo normalization to ensure comparability across scales. Meanwhile, discrete variables with limited categories can be transformed using one-hot encoding, where each category is converted into a binary feature. By aggregating all encoded features, structured data can be transformed into a cohesive representation, facilitating subsequent multimodal integration. On the other hand, unstructured data in clinical records refers to information that is not organized in a predetermined format, such as free-text clinical notes and diagnostic reports. They often require natural language processing (NLP) techniques to extract relevant information for subsequent analysis and decision-making. It is noteworthy that structured data can also be formulated as sentences, allowing for a more comprehensive and nuanced understanding of clinical records. Recent approaches leverage the Large Language Models (LLMs) [thirunavukarasu2023large, xu2021mufasa] to capture complex semantic information in textual data, facilitating advanced comprehension of clinical reports. Omics data refers to large-scale biological data generated from high-throughput technologies that capture information about various biological molecules, such as genes, proteins, and metabolites, which are considered different modalities [zheng2023multi, han2022multimodal]. It finds extensive application in systems biology and functional genomics, enabling the exploration of molecular interactions and their impact on the overall functionality of cells, tissues, and organisms. Omic data, characterized by complexity, high dimensionality, and noise, necessitates the utilization of specialized computational methods and tools for its analysis and interpretation. To this end, researchers employ advanced techniques such as self-normalizing neural networks (SNN) [klambauer2017self] to enable a deeper understanding of the underlying biological mechanisms, facilitating personalized medicine and targeted therapies.

\includegraphics

[width=0.99 ]image/fusion1.pdf

Figure \thefigure: Fusion strategies for complete data, including (a) early fusion, (b) intermediate fusion, (c) late fusion, and (d-e) multi-level fusion.

3 Methods of Multimodal Data Integration

Multimodal data integration in precision oncology aims at leveraging heterogeneous information from multiple data sources to build a holistic understanding of tumors. When constructing AI models for multimodal data integration, two scenarios arise: samples with full modality data or some modalities are missing. Each scenario entails specific objectives for model construction, requiring careful consideration and adaptation based on data availability. In the case of complete data, researchers strive to maximize the performance of downstream tasks by effectively integrating multimodal data. Conversely, in incomplete cases, robust methods are necessary to handle incomplete data and minimize potential performance degradation. Both scenarios offer unique opportunities to unveil patterns, enhance predictive accuracy, and facilitate informed decision-making in precision oncology.

\thesubsection Integration of Complete Data

To integrate multimodal data, we conclude four fusion strategies from the reviewed papers in Fig. 2, including early, intermediate, late, and multi-level fusion.

\thesubsubsection Early Fusion

Early fusion refers to the integration of multimodal information at the input level, which could be raw data, hand-crafted features, or pre-processed deep features. Concatenation is the most straightforward operation to obtain a joint representation [xu2016multimodal, chai2021integrating, saeed2022tmss, fang2021self, gu2023segcofusion], as it is capable of accommodating any format of representation. Moreover, element-wise operations such as addition, multiplication, concatenation, or pooling can be adopted for modalities of the same shape, especially aligned multimodal imaging data. For example, pixel-wise concatenation of different MRI sequences has been widely adopted in recent works [tang2020deep, cheng2022fully, razzak2018efficient, yang2023flexible, hou2023mfd, qian2021prospective]. Deep models designed for early fusion generally exhibit a relatively lower architectural complexity compared to other fusion strategies that involve processing multiple inputs simultaneously. For instance, simple U-shape networks [luo2020hdc, cui2020unified] can effectively extract joint representation from the concatenated multimodal inputs, as they operate on a single input stream. The low architectural complexity of early fusion facilitates model design, parameter tuning, and interpretation, making them more accessible and convenient for clinicians. While early attempts [pereira2016brain, schulte2021integration, anagnostou2020multimodal] widely embrace it, the approach of early fusion has gradually faced some criticism and been overshadowed by more intricate fusion strategies in the latest works. The first issue is the limitation on bridging explicit and intricate interconnections between multiple modalities. Firstly, modalities often have different data types, structures, and scales [toney2014neural]. Directly concatenating them into a unified input may make it difficult to effectively bridge intricate yet meaningful interconnections. The multimodal heterogeneity necessitates careful consideration of pre-processing steps and feature engineering techniques to appropriately integrate multimodal data. Secondly, when concatenating multiple modalities, the dimensionality of the input substantially increases [cui2020unified, tang2020deep], leading to the accumulation of information redundancy across all modalities. It presents a formidable challenge when dealing with high-dimensional inputs, as it demands a substantial amount of data to mitigate overfitting and effectively learn intricate patterns. The scarcity of available data, combined with the soaring dimensionality, gives rise to a sparsity predicament, impeding the ability to achieve good generalization on unseen samples. Thirdly, the interaction between different modalities can manifest in intricate and non-linear dynamics, introducing a layer of complexity [stahlschmidt2022multimodal] that may not be well captured by early fusion. Certain multimodal interconnections may necessitate the utilization of specific attention mechanisms, gating mechanisms, or fusion techniques. Overlooking these interconnections will limit the model’s capacity to fully leverage the heterogeneous information in multimodal data. Another potential issue is modality collapse, wherein the learned representation excessively relies on a single modality while underutilizing information from other modalities [wang2020auto]. Specifically, the primary goal of multimodal data integration is to effectively integrate information from multiple modalities, leveraging their complementary nature. However, modality collapse poses a significant challenge to this objective by limiting the contribution of certain modalities, leading to an imbalanced or biased representation. Consequently, the utilization of multimodal information is compromised, impeding the attainment of a comprehensive and accurate understanding of the data. Within the early fusion approach, modality collapse can occur when one modality dominates the fusion process due to stronger predictive signals or when there exists a significant dimensionality gap between the modalities. This imbalance can hinder the model’s ability to capture the synergistic effects and complementary nature of different modalities, resulting in the underutilization of available multimodal information and suboptimal performance [javaloy2022mitigating, nazabal2020handling].

\thesubsubsection Intermediate Fusion

Intermediate fusion involves the fusion of multimodal information at the feature level, culminating in the extraction of an abstract joint representation for decision-making. Given the diverse nature of feature modeling across different modalities, the fusion operations used in intermediate fusion exhibit significant variability. In addition to the concatenation operations employed in early fusion, intermediate fusion offers a wide range of additional operations that can be utilized, such as Graph Neural Networks (GNNs) [liu2024muse, mo2020multimodal], Transformers [nakhli2023sparse, zhou2023cross, 10155265], and attention mechanisms [li2023survival, zhang2017tandemnet]. These techniques provide more flexibility in capturing the complex multimodal interconnections and enhancing multimodal representation. Due to its inherent flexibility, intermediate fusion has garnered growing attention in recent works within the field of precision oncology. Intermediate fusion generally has a moderate architectural complexity. In the intermediate fusion approach, deep models showcase a sophisticated architecture that incorporates modality-specific sub-networks to capture the distinctive characteristics of each modality, along with fusion modules that model the interconnections between different modalities. This additional processing step increases architectural complexity and computational requirements compared to early fusion, posing challenges like scalability concerns, computational efficiency, and the need for efficient training schemes. The intricate interplay between sub-networks and fusion modules enables the capture of explicit and intricate interconnections between multiple modalities. Existing approaches leverage multimodal data to enhance model performance from various perspectives. The first approach is to align the feature representations of different modalities [nakhli2023sparse, chen2019robust, ding2023pathology, ning2021multi] to emphasize consensus and improve the confidence of the predictions. By map** the features into a shared representation space, different modalities can be effectively inter-connected, enabling the exchange of information and enhancing consensus [meng2024nama, lara2020multimodal, chen2021multimodal]. Another approach involves leveraging the strengths of each modality and combining their complementary information [zhang2017tandemnet, 8911262, fang2021self, gu2023segcofusion, zhang2024prototypical]. Rather than consensus enhancement, these methods seek to capture the unique contributions [fu2021multimodal] of different modalities, enhancing the overall understanding and decision-making process. Furthermore, some researchers aim to model the dependencies and interactions between modalities explicitly [liu2022multimodal, liu2024multimodal, meng2022msmfn]. Graph-based representations, for example, enable the creation of a structure that reflects the interconnections between modalities [mo2020multimodal, shi2024mif], facilitating the propagation of information and capturing complex interactions. The above approaches highlight the distinct merits of multimodal learning, focusing on common, unique, and synergistic knowledge between modalities, respectively. To avoid favoring one type of knowledge over others and potentially overlooking valuable knowledge, comprehensive knowledge decomposition [10155265, 9478224] has gained increasing attention. This approach involves decomposing multimodal knowledge into distinct components, allowing for a comprehensive analysis of each knowledge component’s contributions. By incorporating all knowledge components and dynamically adapting their contributions, a holistic and nuanced comprehension can be attained, consequently yielding remarkable performance enhancements. Besides, a quantification analysis [liang2023quantifying] of knowledge components is crucial for evaluating the significance of each knowledge component, identifying potential biases or imbalances, and fine-tuning the model to ensure a fair and effective integration of all knowledge types. However, this direction is still relatively underexplored, highlighting the necessity for further research and development. Modality collapse presents a notable concern within intermediate fusion, as it occurs when the fusion process inadequately harnesses the information from all modalities, consequently leading to suboptimal performance. This issue can manifest in different ways, highlighting the need for careful consideration in multimodal fusion operations. One common manifestation is the dominance of some modalities in the fusion process, overshadowing the contributions of other modalities. This scenario arises when one modality is more informative or easier to converge, causing the fusion model to heavily rely on that modality while neglecting the valuable information from other modalities. Another manifestation occurs when the fusion process fails to effectively capture the complementary information in different modalities. Consequently, redundant or irrelevant information may be preserved in the fusion process, limiting the potential benefits of multimodal fusion. To mitigate modality collapse, several techniques [huang2022modality, wu2022scaling] have been explored to ensure a balanced integration of modalities, promote the equitable utilization of information, and capture the synergies between different modalities.

\thesubsubsection Late Fusion

Late fusion aggregates modality-specific decisions into a more accurate joint decision, leveraging the decisive information from different modalities. The aggregation operations in late fusion [yala2019deep] may be the same as the concatenation operations used in early fusion. In some cases, alternative aggregation operations, such as weighting [fang2022weighted], feature selection [shao2019integrative, boehm2022multimodal], rule-based aggregation [yan2024combiner], Bayesian-based fusion [liu2023m] or learnable modules [zheng2023multi], can further enhance the fusion process. These options provide a certain level of flexibility to customize the fusion process according to the specific characteristics of multimodal data. In the reviewed papers, a considerable number of studies adopted this strategy. In deep models, late fusion approaches typically exhibit a higher architectural complexity when compared to early and intermediate fusion [zhao2024review, zhuang20223d]. Specifically, the overall architecture of late fusion typically comprises multiple parallel branches, with each branch dedicated to a specific modality. This architecture allows for the incorporation of separate model structures that are tailored to capture the unique and nuanced characteristics of each modality. However, the architectural complexity arises from the need for processing multiple branches and ensuring proper aggregation of the modality-specific decisions. Overall, while late fusion approaches offer the advantage of capturing distinctive information from each modality, their added complexity may require additional computational resources and model parameters. The utilization of modality-specific branches also introduces challenges in effectively leveraging the synergistic effects between different modalities. Specifically, when processing each modality independently, these individual branches may encounter limitations in capturing the complex interactions that arise when multiple modalities are combined [zheng2023multi, fang2022weighted]. These interactions play a crucial role in understanding the underlying interconnections between modalities and making more accurate joint decisions. Therefore, the absence of multimodal interactions hampers the ability to capture the complete knowledge and exploit the synergistic benefits derived from the combination of multiple modalities. Late fusion approaches may report limited performance in scenarios where the interactions between modalities significantly contribute to overall performance. Indeed, training multiple branches to produce modality-specific decisions is highly advantageous in mitigating the modality collapse issue commonly encountered in early and intermediate fusion approaches. Specifically, late fusion encourages the preservation of unique information inherent to each modality [holste2021end, yala2019deep] by making independent decisions based on modality-specific representations. By producing modality-specific decisions, it promotes a more balanced fusion of modalities, avoiding the dominance of a single modality and enhancing the utilization of the complementary information provided by different modalities. Thus, the modality collapse issue can be alleviated by ensuring that the valuable knowledge in each modality is appropriately captured and integrated during the fusion process. Overall, multiple branches in late fusion can capture the distinct characteristics of each modality more effectively, leveraging the unique information present in each modality and building a nuanced understanding of multimodal data.

\thesubsubsection Multi-level Fusion

Different fusion strategies of multimodal data in precision oncology research offers several benefits and limitations. Researchers have been exploring early, intermediate, and late fusion strategies to leverage the advantages of each while minimizing their drawbacks. For example, Zhuang et al. [zhuang20223d] conducted a study using multi-sequence MRI images, dividing them into distinct T1-T1ce and T2-FLAIR groups. They concatenated the multimodal data within each group at an early level and used separate encoders to extract multimodal representations for each group. These representations were integrated using a cross-modal interaction module, known as intermediate fusion. This early-intermediate fusion strategy is particularly suitable for the coexistence of heterogeneous and homogeneous data modalities. In addition to early-intermediate fusion, a combination of intermediate and late fusion strategies also gained attention in recent studies [he2020feasibility, li2022adaptive], enabling a sophisticated integration of multimodal decisions or underlying features simultaneously. It involves modeling intricate multimodal interconnections at the intermediate level, resulting in a multimodal decision, which can then be aggregated with modality-specific decisions to generate the final decision. Notably, intricate multimodal interconnections can be modeled at the intermediate level, producing multimodal decisions, which can be aggregated with modality-specific decisions to produce the final decision. Furthermore, the modality collapse issue can be effectively addressed by incorporating modality-specific decision modules, ensuring that the unique information from each modality is appropriately captured and preventing the overshadowing of any modality. The multi-level fusion strategy enables researchers to effectively capitalize on the strengths of different fusion strategies and facilitate a deeper understanding of multimodal data and informed decision-making.

\thesubsubsection Comparative Analysis.

In the field of multimodal data integration for precision oncology, four fusion strategies have emerged: early, intermediate, late, and multi-level fusion. Specifically, early fusion captures the interactions between modalities from the beginning, requiring a simpler architecture to integrate multimodal data. However, it faces challenges in terms of plain multimodal interconnections and the modality collapse issue. Moreover, intermediate fusion processes modalities separately and allows more flexible integration of multimodal data. However, it comes with increased computational complexity and risks the occurrence of modality collapse issues. Furthermore, late fusion employs modality-specific branches to process each modality independently, incorporate modality-specific decisions, and mitigate the modality collapse issue. Nonetheless, it introduces higher architectural complexity, increased computational demands, and difficulties in modeling intricate multimodal interconnections. At last, the exploration of multi-level fusion, which integrates multimodal knowledge at different levels, holds promise for leveraging the advantages of various fusion strategies. Although many recent works [liu2023m, zhuang20223d, zheng2023multi, zhang2017convolutional] have compared these fusion strategies, the underlying datasets are typically small-scale and less representative, leading to varying conclusions. Therefore, there is currently a lack of a comprehensive and reliable comparison between these fusion strategies, particularly when considering multimodal data in the context of precision oncology. In conclusion, the field of multimodal data integration in precision oncology offers different fusion strategies, each with its unique strengths and challenges, necessitating careful consideration of the specific conditions and requirements to achieve optimal results.

\includegraphics

[width=0.99 ]image/missing2.pdf

Figure \thefigure: Fusion strategies for incomplete date, including imputation-based methods and imputation-free methods.

\thesubsection Integration of Incomplete Data

Multimodal learning has emerged as a highly promising approach for acquiring comprehensive information about cancer patients by harnessing the power of diverse data sources. However, in the realm of clinical applications, the assumption of complete access to all modalities for fusion is often unattainable. It is a common occurrence to encounter missing data in one or more modalities, stemming from various factors including limitations in data collection, privacy concerns surrounding data sharing, and technical challenges associated with data acquisition. The presence of incomplete multimodal data poses a significant challenge to the performance of multimodal fusion models, and it can even lead to the failure of previously established fusion methods that heavily rely on complete modality data. Consequently, researchers have dedicated substantial efforts to address this critical issue within the domain of precision oncology. While previous studies and surveys such as [zhou2023literature, shah2023survey] have provided valuable insights into the integration of incomplete multimodal data, their focus has been primarily confined to methods applicable to homogeneous multimodal data, such as multi-sequence MRI data. These studies fail to offer a comprehensive overview of the existing techniques that can be applied to both homogeneous and heterogeneous multimodal data. Hence, in this section, we aim to bridge this gap by providing a comprehensive review of the existing methods employed for integrating incomplete multimodal data within the context of precision oncology, as shown in Fig. 3. Our investigation will shed light on the strengths, limitations, and potential applications of these techniques, thereby contributing to the advancement of multimodal fusion research in precision oncology.

\thesubsubsection Imputation-based Methods

Intuitively, the most straightfoward solution to address the issue of missing modalities is to employ imputation techniques, which involve filling in the missing modalities using information from the observed modalities. Imputation-based methods can be further categorized into three subcategories: i) imputation via data generation, ii) imputation via feature generation, and iii) imputation via sample retrieval. In the case of imputation via data generation, researchers commonly employ generative models such as generative adversarial networks (GANs) and variational autoencoders (VAEs). These models serve the purpose of synthesizing the missing modalities by leveraging the information present in the observed modalities. Subsequently, the generated modalities are combined with the observed ones through multimodal fusion techniques, enabling downstream tasks to be performed. Typically, all available modalities are integrated to learn a shared modality-invariant latent representation, which effectively captures the underlying data distribution. This latent representation is then utilized as a generation condition for the generative model to synthesize the missing modalities [chartsias2017multimodal, sharma2019missing, zhou2020hi, wu2023collaborative, chen2024modality]. Essentially, the missing information of incomplete multimodal data is modality-specific components that are not shared across all modalities. Consequently, the objective of generative model is to capture the modality-specific information embedded within the missing modalities. To enhance the quality of the generated modalities, certain studies explicitly incorporate a feature disentanglement scheme, which decomposes the available modalities into modality-invariant and modality-specific components [shen2020multi, peng2021multi], thereby exploring and completing the possible modality-specific information contained within the missing modalities. To handle various possible missing modality cases, these early studies often need to train multiple generative networks. To improve the efficiveness of generation process and reduce the required computational resources, some studies have proposed to train a unified generative network that can handle multiple missing modality cases [hamghalam2021modality, dalmaz2022resvit, yang2023learning, yuan2023rethinking]. Notably, the diffusion model, renowned for its success in image generation and inpainting tasks, has also motivated the development of diffusion-based imputation methods for multimodal data[meng2024multi]. Nonetheless, it is worth highlighting that the majority of methods in this category primarily focus on multi-sequence MRI data, which represents a typical example of homogeneous multimodal data. When confronted with highly heterogeneous data, the generative model may encounter challenges in effectively learning the complex data distribution, thereby resulting in suboptimal imputation performance. To utilize generative models for synthesize missing modalities in highly heterogeneous multimodal data, some researchers have proposed imputation via feature generation methods [huang2021aw3m, hou2023hybrid, wang2023multi, ting2023multimodal, jiao2023gmrlnet]. These methods focus on extracting features from observed modalities and leveraging them to impute the missing modalities through the training of feature-level generative models. Unlike the aforementioned imputation techniques that operate at the data level, imputation via feature generation methods specifically target the generation of low-dimensional feature representations for the missing modalities. By extracting relevant features from the available modalities, these methods aim to estimate the latent representation of the missing modalities. This feature-level generation approach offers several advantages in effectively handling the complexities posed by highly heterogeneous multimodal data. One key advantage is the elimination of the need for the generative model to capture the high-dimensional data distribution, which can be computationally demanding and challenging in such heterogeneous settings. Instead, these methods prioritize the capture and estimation of low-dimensional feature representations, which often prove to be more feasible and effective in addressing the missing modality problem. It is important to highlight that imputation via feature generation methods is applicable to both homogeneous and heterogeneous multimodal data, presenting a versatile solution for various scenarios. However, a notable challenge arises from the abstract nature of the generated features. Consequently, the direct interpretation of these features becomes a non-trivial task. Moreover, evaluating and controlling the quality of the generated features pose additional difficulties in this context. The aforementioned methods generally rely on the generative model to synthesize the missing modalities, which may be computationally expensive and sensitive to the scale of training set, suffering from the issue of mode collapse. These methods aim to address missing modalities by aggregating compensation information from similar samples in the training set [chen2020hgmf, zhang2022m3care]. Specifically, given the incomplete data, the sample retrieval methods leverage observed information to calculate the similarity among the inference sample and the samplesin the training set. Subsequently, the retrieved samples, possessing the desired modalities, are then used to fill in the missing modalities. To enhance the efficiveness and effectiveness of the retrieval process, some studies have proposed the utilization of learnable prototypes or prompts as representations for the samples in the training set [chen2023towards, wang2024mgiml]. Then, attention mechanisms are employed to aggregate the compensation information from the prototypes, which are learned from the entire training set. Nevertheless, the performance of sample retrieval methods heavily depends on the quality of the retrieved samples, which may not always accurately represent the true underlying data distribution and can introduce bias into the fusion model. Achieving precise retrieval of similar samples often necessitates predefined metric learning or similarity measurement, which can be impractical in certain scenarios. Furthermore, due to the limited number of samples in the training set, the modalities imputated by sample retrieval methods may lack the necessary diversity to fully capture the underlying data distribution. These limitations pose challenges that need to be addressed in order to enhance the effectiveness and reliability of sample retrieval methods. Indeed, imputation-based methods offer valuable approaches to address the challenges associated with missing modalities in multimodal data. The choice of method depends on factors such as the characteristics of the data and the specific requirements of the application. Imputation via data generation methods is well-suited for homogeneous multimodal data, as it excels in synthesizing missing modalities. Imputation via feature generation methods, on the other hand, provides a versatile solution that can be applied to both homogeneous and heterogeneous multimodal data. This approach focuses on estimating low-dimensional feature representations for the missing modalities. Lastly, imputation via sample retrieval methods offers an efficient alternative by aggregating compensation information from similar samples in the training set. Despite their strengths, these methods also face several challenges, such as the scalability of the generative model, the interpretability of the generated features, and the diversity of the retrieved samples. Future research should aim to enhance the performance, interpretability, and scalability of imputation-based methods in addressing missing modalities, particularly in highly heterogeneous multimodal data settings.

\thesubsubsection Imputation-free Methods

In contrast to imputation-based methods, imputation-free methods provide a more efficient and flexible solution for handling missing modalities in multimodal data. These methods directly leverage the observed modalities to perform multimodal fusion without imputing the missing modalities using sample information. Imputation-free methods can be further categorized into three subcategories: i) robustness enhancement, ii) multi-task learning, and iii) knowledge distillation. Robustness enhancement methods aim to improve the robustness of multimodal fusion models to missing modalities, allowing the models to adapt to their presence. These methods generally utilize sophisticated fusion strategies and modules that are designed to be less sensitive to missing modalities, effectively enhancing the robustness of multimodal fusion models [havaei2016hemis, ning2021relation, ding2021rfnet, liu2022moddrop, wu2023multimodal, liu2023sfusion, 10288381]. Some of these models focus on capturing modality-invariant features shared across modalities, while others aim to directly integrate all available information from the observed modalities. In the domain of deep learning, specialized frameworks such as transformers and graph neural networks exhibit insensitivity to the dimensionality of input data, making them successful tools for fusing incomplete multimodal data [zhao2022modality, zhang2022mmformer, shi2023m, qiu2023modal, zhang2024tmformer]. Undoubtedly, robustness enhancement methods offer flexibility and efficiency in handling diverse cases of missing modalities. However, it is important to note that these methods may overlook the modality-specific information present in the available modalities, which can limit their performance in scenarios where modality-specific information is crucial for downstream tasks. To ensure that robustness enhancement models capture the modality-specific information of each observed modality, researchers have introduced multi-task learning strategies that employ auxiliary tasks to encourage the model to learn complementary information from the observed modalities. One widely used auxiliary task is the reconstruction task, which aims to reconstruct the input data or features from the fused representation, effectively enforcing the model to retain essential information from all observed modalities [van2018learning, dorent2019hetero, zhou2020brain, zhou2021latent, cui2022survival, liu2023learning, zhou2023feature, liu2023m3ae]. Compared to robustness enhancement methods, multi-task learning methods implicitly capture modality-specific information from the observed modalities, further improving the performance of robustness enhancement models in handling missing data. However, these methods may still struggle to model the modality-specific information contained in the missing modalities. To effectively capture the modality-specific information of missing modalities, knowledge distillation methods offer a promising solution. These methods transfer knowledge from a teacher model trained on complete multimodal data to a student model trained on incomplete data. Knowledge distillation can be performed at different levels, including feature-level distillation, relation-level distillation, and response-level distillation. The distinction lies in the granularity of the distilled knowledge, with feature-level distillation focusing on low-level features, relation-level distillation capturing high-level relations between features, and response-level distillation targeting the final prediction responses. Existing studies either adopt one of these distillation strategies [ning2022mutual, konwer2023enhancing, wang2023learnable, qiu2023scratch] or combine multiple distillation strategies [hu2020knowledge, wang2021acn, vadacchino2021had, yang2022d, azad2022smu, karimijafarbigloo2024mmcformer]. By leveraging the distilled knowledge, the student model can achieve comparable performance to the teacher model, even in the presence of missing modalities. Imputation-free methods offer a more efficient and flexible solution for handling missing modalities in multimodal data. These methods directly leverage the observed modalities to perform multimodal fusion without imputing the missing modalities. Robustness enhancement methods focus on improving the robustness of multimodal fusion models to missing modalities, while multi-task learning strategies encourage the model to learn complementary information from the observed modalities. Knowledge distillation methods transfer knowledge from a teacher model trained on complete multimodal data to a student model trained on incomplete data. However, they lack the ability to explicitly model the modality-specific information contained in the missing modalities. This limitation may impact their performance in certain scenarios.

\thesubsubsection Comparative Analysis

In summary, both imputation-based and imputation-free methods offer valuable approaches to address the issue of missing modalities in multimodal fusion. Imputation-based methods focus on compensate the missing modalities using information from the observed modalities, while imputation-free methods directly leverage the observed modalities to perform multimodal fusion without imputing the missing modalities. The former can complete the missing modalities to some extent, but may suffer from the issue of mode collapse and the difficulty in capturing the complex data distribution. The latter provides a more efficient and flexible solution for handling missing modalities, but may struggle to generalize well to incomplete data if the shared representation or distilled knowledge fails to capture the essential information of missing modalities. It is important to note that there is no one-size-fits-all solution for addressing the challenges posed by missing modalities in multimodal fusion. Future research should aim to comprehensively analyze the strengths and limitations of both imputation-based and imputation-free methods. Additionally, exploring novel techniques that combine the advantages of these approaches can help enhance the robustness and generalization capabilities of multimodal fusion models when dealing with incomplete data. By considering the strengths and weaknesses of each method and develo** novel techniques that leverage their respective advantages, researchers can advance the field of multimodal fusion and effectively address the challenges associated with missing modalities in diverse real-world applications.

4 Applications of Multimodal Data Integration

Multimodal data integration has facilitated various applications of precision oncology, including early assessment, diagnosis, prognosis, and biomarker discovery.

\thesubsection Early Assessment

\thesubsubsection Risk Stratification

The goal of risk stratification is to evaluate an individual’s risk of develo** cancer in the early future based on various factors, including personal medical history, lifestyle choices, genetic predisposition, etc. Traditional risk stratification tools [RATool] simply estimate the risk using personalized information, such as race, age, diet, and medical history. Recent research [yala2019deep] found that risk stratification for breast cancer leveraging multi-view mammography images is significantly more accurate than traditional Tyrer-Cuzick (version 8) model [tyrer2004breast], while combining them obtains the optimal accuracy. It indicates that multimodal data integration is a promising tool for breast cancer risk stratification and may be feasible for other cancers.

\thesubsubsection Screening

Cancer screening helps detect cancer early, improving the chances of successful treatment by identifying abnormal tissue before symptoms manifest and allowing for more effective intervention. It relies on specific examinations to detect cancer in individuals, such as physical exams, laboratory tests, or imaging procedures. Various data modalities produced in these examinations imply a great potential for multimodal data integration to improve screening accuracy. For instance, Wu et al. [8861376] showcased a remarkable method that seamlessly consolidates multiple mammography images, achieving a good accuracy on par with experienced radiologists. Several studies have drawn similar conclusions regarding the screening of patients with diverse cancer types, such as prostate [rossi2020multi], ovarian [xiang2024development], breast [liao2019automatic], and upper gastrointestinal (UGI) [ding2020scnet] cancer. In summary, these studies highlight that integrating multimodal information in cancer screening produces more valuable insights than relying solely on a single modality.

\thesubsubsection Detection

Lesion detection aims at identifying the presence and location of lesions within the human body. It typically involves the use of various imaging techniques [alyafeai2020fully], such as MRI, CT, or PET, to identify abnormal growths or masses that may indicate the presence of a tumor. For example, Kumar et al. [kumar2019co] proposed a co-learning method to detect and segment tumors of lung cancer simultaneously. To reduce the annotation burdens, researchers [wang2018automated, yang2017co] utilized weakly-supervised methods to localize prostate tumors using sample-level labels. They demonstrated the feasibility of automatically and accurately locating lesions with a small amount of annotation effort.

\thesubsection Diagnosis

\thesubsubsection Segmentation

Lesion segmentation refers to the process of outlining the boundaries of a tumor on medical imaging scans, e.g., CT, MRI, or PET images. It empowers clinicians with more valuable information about the tumor’s location, size, shape, and relationship with surrounding structures and organs, facilitating personalized treatment planning and decision-making. Segmentation has garnered significant attention in recent years, largely attributed to the availability of high-quality datasets. For example, the brain tumor segmentation (BraTS) challenge series [menze2014multimodal] established a good benchmark consisting of multi-sequence MRI images and pixel-level annotations of the enhancing tumor (ET), the tumor core (TC), and the whole tumor (WT). Extensive efforts [yue2023adaptive, lin2023ckd, ma2018concatenated, ding2021mvfusfra, zhu2023brain, nie20183] have been dedicated to improving segmentation accuracy on this benchmark through the development of effective multimodal data integration techniques. Besides, some works focus on segmenting lung [zhou2023coco, xiang2022modality, podobnik2023multimodal], head & neck [shi2023h], prostate [zhang2021cross], colon [lin2023lesion], liver [mo2020multimodal] tumors in PET-CT image pairs or multi-sequence MRI, as well as brain lesions [zhang2024robust, chen2018voxresnet, zhuang2021aprnet] on other datasets [mendrik2015mrbrains]. In conclusion, multimodal data integration has a significant impact on improving segmentation accuracy, as evidenced by numerous successful efforts.

\thesubsubsection Subty**

Cancer subty** refers to the process of categorizing tumors into distinct subgroups based on their molecular, genetic, histological, or clinical characteristics. These subtypes represent different manifestations of cancer, each with unique biological features, behavior, and response to treatments. Subty** holds paramount significance in the realms of treatment selection, outcome prediction, patient care guidance, and patient support. Recently, several works [kim2023heterogeneous, fang2024dynamic, alwazzan2024foaa, wang2021mogonet] leveraged advanced multimodal data integration techniques to build deep models for predicting breast cancer subtypes using multi-omics data. Zheng et al. [zheng2023multi] cooperated with both feature- and label-level confidence learning for cancer subty**. Moreover, Han et al. [han2022multimodal] integrated multimodal data based on the estimated modality-specific informativeness scores. The achievements of the aforementioned works underscore the importance of multimodal data integration for cancer subty**.

\thesubsubsection Grading

Tumor grading is a process that assesses the cellular characteristics of cancer cells and determines the degree of abnormality or aggressiveness of a tumor. It involves examining tumor tissue samples under a microscope and assigning a grade based on specific histological features. It is essential for treatment decision-making, prognosis estimation, cancer monitoring, and effective communication with patients. Dozens of grading methods have been proposed for breast [li2023msa], glioma [9478224], prostate [lara2020multimodal], bladder [8911262], and hepatocellular carcinoma [li2022adaptive] cancers. For example, Zhang et al. [8911262] predicted the grades of bladder cancer patients using pathology images and corresponding reports from clinicians. They found that text information can improve grading performance by providing valuable clinical knowledge. Cancer grading models, harnessing the power of multimodal data integration, hold profound implications for the advancement of precision oncology.

\thesubsubsection Metastasis Prediction and Detection

Metastasis refers to the identification of cancers that spread from their primary sites to distant organs or tissues in the body, impacting cancer staging and the choice of therapeutic interventions [hou2023deep, hou2021integration]. Metastasis prediction [zheng2020deep] aims to estimate the likelihood of metastasis occurring in cancer patients, while metastasis detection methods [10155265] are designed to identify the presence of metastasis in the given input data. Hu et al. [hu2023multi] leveraged the graph models to explore the relations between different features to detect lymph node metastasis (LNM). Qiao et al. [qiao2022breast] effectively integrated MRI and US images by an explicit knowledge decomposition to jointly predict LNM, histological grade, and Ki-67 protein expression levels. They also found that the modality-shared features precisely focus on tumor regions, extracting more tumor-related characteristics and improving the model’s interpretability. It suggests that multimodal data integration provides more precise information for metastasis prediction and detection, facilitating cancer staging and treatment planning.

\thesubsubsection Staging

Staging refers to the process of determining the extent and spread of cancer within a patient’s body. Existing staging systems, e.g., TNM system [denoix1946enquete], combine the extent of the tumor (T), the extent of spread to the lymph nodes (N), and the presence of metastasis (M) to classify cancers. The specific staging criteria may vary for different types of cancer, and clinicians rely on established guidelines and staging systems specific to each cancer type for accurate staging. Multiple multimodal data integration-based staging models have been proposed recently. For example, Toney et al. [toney2014neural] attempted to predict the nodal stage for non-small cell lung cancer by integrating CT and PET images. Recently, Zhou et al. [zhou2023rfia] leveraged endoscopic and pathology images to classify the stage of oesophageal cancers. Multimodal data integration has shown promising potential for improving staging performance for a variety of cancers.

\thesubsection Prognosis

\thesubsubsection Treatment Response Prediction

Treatment response prediction refers to the estimation on how a patient will respond to a specific treatment or intervention. It involves using various factors, such as tumor characteristics, biomarkers, and imaging data, to assess the likelihood of a favorable response to a particular treatment. Treatment response prediction is valuable for clinicians as it helps guide treatment decision-making, optimize therapy selection, and improve patient outcomes. For example, integrating CT, pathology, and genomics [vanguri2022multimodal, boehm2022multimodal] for predicting treatment response in non-small cell lung and ovarian cancer patients can provide a quantitative rationale for clinicians. More works [anagnostou2020multimodal, **2021predicting] for treatment response prediction indicate the significance of multimodal data integration.

\thesubsubsection Survival Analysis

Survival analysis is a statistical method used to analyze data related to the time until the occurrence of an event, i.e., death in survival analysis. It provides valuable insights into the prognosis and outcomes of patients, allowing clinicians to make informed decisions about treatment options, monitor disease progression, and evaluate the effectiveness of therapeutic interventions. The Cancer Genome Atlas (TCGA) program [tcga] provides a wealth of clinical, imaging, and omics data, as well as follow-up records for patients with different cancers, significantly contributing to the development of multimodal data integration for survival analyses. In these years, we have witnessed a large number of multimodal models [zheng2022multi, li2022hfbsurv, jaume2023modeling, chen2020pathomic, nie20163d, qayyum20223d, tan2022multi, wu2023camr, zhang2022deep] for survival analysis using pathology and omics data. These models hold promise in advancing our understanding of patient prognosis, guiding personalized treatment decisions, and ultimately improving treatment outcomes for individuals with cancer.

\thesubsubsection Recurrence Prediction

Recurrence prediction involves estimating the probability of cancer returning in patients who have received cancer treatment. It involves analyzing various factors, such as tumor characteristics and treatment response, to assess the probability of the cancer returning after an initial remission. Recurrence prediction is valuable for clinicians as it helps guide surveillance strategies, inform treatment decisions, and optimize long-term management of patients [nguyen2022attentive]. Tang et al. [tang2024new] identified high-risk recurrence after hepatic resection of colorectal cancer liver metastases using multi-sequence MRI images. Gui et al. [gui2023multimodal] developed a novel model integrating clinical, genomic, and histopathological data to improve the predictive accuracy for localized renal cell carcinoma recurrence. These studies highlight the effectiveness of integrating multimodal data for predicting cancer recurrence.

\thesubsubsection Tumor Growth Prediction

Quantitatively characterizing the tumor’s spatial-temporal progression is valuable in staging tumors and designing optimal treatment strategies. Tumor growth not only relies on the properties of cancer cells but also depends on dynamic interactions between cancer cells and their constantly changing microenvironment. The complexity of the cancer system motivates the study of tumor growth using multimodal data. For instance, Liu et al. [liu2014patient] and Zhang et al. [zhang2017convolutional] presented patient-specific tumor growth prediction models based on longitudinal dual-phase CT and PET imaging data. These studies have provided valuable insights for tumor staging and treatment planning by analyzing tumor growth patterns and cell interactions using multimodal data.

\thesubsection Biomarker Discovery

Biomarkers are measurable indicators that can be used to detect the presence of cancer, predict its prognosis, monitor its progression, or evaluate the response to treatment. There are many works [nabbi2023multimodal, carneiro2015weakly, wei2023multi, shi2023novel, braman2021deep] attempted to analyze biomarker related to cancer diagnosis and prognosis based on multimodal data integration. For imaging data, the Grad-CAM technique [selvaraju2017grad] provides a powerful tool for analyzing which regions in images have high responses to the model’s decisions [chen2021multimodal, chen2020pathomic, chen2022pan, zhou2023texture]. Moreover, genome-wide association studies (GWASs) [claussnitzer2020brief, tam2019benefits, wang2005genome] aim to identify genomic variants that are statistically associated with cancer susceptibility [amos2008genome, farashi2019post, wu2018transcriptome]. Furthermore, Shapley’s additive interpretation (SHAP) [lundberg2017unified] is widely used in clinical records to understand the contribution of each input feature to model predictions. In addition, the elucidation of cross-modal attention has emerged as a valuable technique for deciphering the intricate interconnections between different modalities, enabling a deeper understanding of how information from diverse modalities interacts with each other. Rather than obsessing over the opacity of AI models, some researchers argue that it is crucial to emphasize the importance of rigorous validation through randomized clinical trials [ghassemi2021false]. Prospective trials enable us to thoroughly assess AI models under real-world conditions, compare their performance to standard-of-care practices, evaluate how clinicians interact with the AI tool, and determine the most effective way to integrate the models into the clinical workflow without causing disruption [lipkova2022artificial].

5 Challenges and Future Directions

\thesubsection Robust Learning with Imperfect Data

Training multimodal AI models with strong robustness exhibits a strong need for large-scale and high-quality datasets. However, collecting such datasets poses a significant challenge in the medical field, particularly in precision oncology, where the required examinations for cancer diagnosis and prognosis vary depending on the personalized health conditions of patients. In addition, collecting cancer diagnostic and prognostic labels from patients is very laborious, for example, survival status usually takes months or even years to collect. These issues suggest that the collected data is prone to imperfections, such as missing modalities or labels, which can significantly influence the effectiveness and generalizability of AI models. In cases where samples have missing modalities, imputation-based methods [chartsias2017multimodal, sharma2019missing, zhou2020hi, wu2023collaborative, chen2024modality] can be employed to complete the missing modalities by either generative models or retrieval-based methods. Generative models like Generative Adversarial Networks (GANs) and diffusion models have garnered significant attention and have proven successful in various precision oncology applications. Meanwhile, retrieval-based methods also hold the potential to address missing modalities by leveraging similarities between samples to retrieve relevant modalities. Another viable option is imputation-free methods [havaei2016hemis, ning2021relation, ding2021rfnet, liu2022moddrop, wu2023multimodal, liu2023sfusion, 10288381], which aim to maximize the utilization of available modalities to mitigate the potential performance degradation and enhance the robustness of AI models when dealing with missing modalities. Various techniques such as representation learning, multi-task learning, and knowledge distillation can be employed to enhance the model’s robustness in the presence of incomplete modalities, without the need for imputation. Both of the above approaches demonstrate promising potential for addressing missing modalities in clinical scenarios, but which one is better remains under-explored. In cases where the labels are unavailable for some samples, label-efficient learning techniques such as weakly- or semi-supervised learning mitigate the dependence on fully labeled data. These techniques empower AI models to leverage samples with weak or even no labels, thereby enhancing their performance and capabilities. By leveraging these approaches, models can effectively utilize more samples, enabling the learning process to benefit from a broader range of available information and reduce the need for extensive labeling efforts. Furthermore, extensive unavailable labels may result in a small-scale training dataset. In this case, few-shot learning is a valuable approach that can address the challenge of limited labeled data. In few-shot learning, models are trained to recognize patterns and extract relevant features from a limited number of labeled samples, enabling efficient adaptation and generalization to unseen samples. In addition, federated learning can be adapted to train models collaboratively across multiple decentralized devices without the need to share raw data. This approach has the advantage of increasing the scale of the training set while maintaining data privacy, thereby enhancing the robustness of the models. Overall, the utilization of the above techniques depends on the specific requirements and constraints of the clinical scenario.

\thesubsection Effective Integration of Heterogeneous Modalities

Modality heterogeneity, the variations in information presented across different modalities, can manifest in terms of data formats, scales, resolutions, or even semantics, posing challenges for effective multimodal integration and analysis. The presence of heterogeneity not only complicates the fusion process but also introduces the risk of information loss or mismatch between modalities. To address this issue, researchers have explored various techniques and methodologies, such as cross-modality representation learning [xiang2022modality, yue2023adaptive, liu2023m] to enhance modality representations based on multimodal interconnections, semantic alignment methods [zhou2023cross, wu2023camr, lara2020multimodal] to mitigate the semantic gaps between modalities, or knowledge decomposition [10155265, qiao2022breast] to explicitly model distinct knowledge components for a comprehensive integration. Moreover, the availability of knowledge quantification tools [liang2023quantifying] is essential for accurately quantifying the nature (knowledge type) and extent (knowledge amount) of interactions between modalities, providing valuable insights into the underlying patterns, correlations, and dynamics within the multimodal data. These tools can enable us to design a more effective fusion strategy that can leverage the full potential of multimodal data and provide a solid foundation for evaluating and comparing different multimodal data integration models. Furthermore, the foundation model for multimodal data integration can provide general and discriminative representations by learning from large-scale datasets, which is also a promising direction. Overall, the above directions can unlock the potential of multimodal data, informing the design of a better fusion strategy, enabling more accurate analysis and decision-making, and driving progress in precision oncology. Information redundancy presents challenges for AI models to effectively discern between task-relevant and irrelevant information. When multiple modalities provide a huge amount of information, it becomes difficult for models to discern which pieces of information are truly informative for the given task. Various techniques have been adopted to reduce information redundancy in multimodal data integration, such as cross-modal feature selection [xu2023multimodal], task-oriented dimensionality reduction [xu2024label], metric learning [shao2023fam3l], and information bottleneck [zhang2024prototypical, fang2024dynamic] methods. These techniques have been proven effective in eliminating redundant information, leading to improved performance and efficiency in multimodal data integration. The emergence of information theory-based approaches [zhang2024prototypical] presents a promising direction by providing a solid theoretical foundation and sophisticated tools for effectively handling redundancy, allowing researchers to advance the state-of-the-art performance. In summary, further investigations on information theory hold the potential to advance our comprehension of cancer and significantly improve cancer diagnosis, prognosis, and treatment decision-making processes. In addition, the expertise of healthcare professionals contains a wide range of valuable knowledge about cancers, which can enhance the multimodal representations and improve the model’s performance on clinical applications. By tailoring the expertise knowledge to individual characteristics such as patient demographics or genetic profiles, we can enhance the applicability of the expertise knowledge in personalized cancer care and treatment. To utilize expertise knowledge, researchers have developed expert-driven modules [lin2023ckd, zhang2017tandemnet] that integrate clinical guidelines and best practices with multimodal data analysis. Furthermore, collaboration between clinicians and data scientists allows for the identification of relevant clinical factors, contextual information, and domain-specific knowledge to enhance the representation extracted by AI models [hu2023multi, li2023lvit, shao2023characterizing]. The collaboration of researchers and clinicians can bridge the gap between data-driven models and clinical practice, leading to accurate diagnoses, tailored treatment plans, and improved patient outcomes.

\thesubsection Explainable and Trustworthy AI Models

Gaining trust is crucial for clinicians and patients to accept diagnoses and treatment recommendations provided by multimodal AI models. To achieve this goal, multimodal AI models must demonstrate transparency in their decision-making and multimodal interconnection processes. It involves extracting meaningful insights [schulte2021integration], identifying the contributing factors, and providing transparent explanations for the decisions made by the model. To facilitate decision interpretation, researchers have proposed various techniques and methodologies. These include visualization methods [chen2020pathomic, zhang2017mdnet] that provide intuitive representations of the multimodal data and decision-making process, feature importance analysis [li2022adaptive] to identify the most influential features to the decision, and rule extraction techniques [yan2024combiner] that extract interpretable rules from the integrated data. Moreover, by analyzing the learned cross-modal attention [xu2023multimodal, chen2022pan], we can reveal the relationships and dependencies between different modalities, providing valuable insights into the intricate interactions and complementary nature of multimodal data. Besides gaining trust, improving the interpretability of AI models can also help clinical developments. For example, it can highlight the pivotal role of modifiable risk factors, such as Mediterranean lifestyle and physical activity, to the susceptibility of cancer. Moreover, model interpretability can guide clinicians to discover new biomarkers that provide valuable insights into the presence and characteristics of cancer, enabling personalized and targeted approaches to diagnosis and treatment. Furthermore, the utilization of cross-modal interconnections can offer remarkable benefits by enabling the discovery of non-invasive alternatives, thereby reducing the need for extensive examinations and minimizing patient discomfort and pain. Overall, enhancing the interpretability of AI models brings numerous benefits to both clinicians and patients in precision oncology.

\thesubsection Efficient Processing with Limited Resource

Efficient processing with limited resources poses a significant challenge in the context of multimodal data integration for precision oncology. The integration of diverse data modalities typically requires sophisticated fusion strategies and computational frameworks that can effectively extract and fuse information from multiple sources. As the volume and complexity of multimodal data continue to grow, there is a need for handling multimodal data efficiently, particularly when resource constraints, such as limited computational power or storage capacity. Future directions in this field involve develo** resource-efficient modules [shi2023h], leveraging techniques such as dimensionality reduction, and parallel processing to optimize the processing and analysis of multimodal data. Additionally, advancements in hardware-friendly modules [tang2023ac2as] can also contribute to more efficient processing of multimodal data. Moreover, the development of knowledge distillation [hu2020knowledge] can help focus computational resources on the most informative features of multimodal data, ensuring efficient processing without compromising accuracy and precision in clinical applications [nie20163d, pereira2016brain, zhang2017convolutional]. Overall, addressing the challenge of efficient processing with limited resources is crucial in accelerating the multimodal data integration models into practical and scalable solutions for precision oncology applications.

\thesubsection Cross-center Adaption and Evaluation

Cross-center adaption and evaluation present an important concern in the realm of multimodal data integration for precision oncology. It is crucial for multimodal data integration models to extract robust and general features and be applied across diverse patient populations and healthcare settings. However, differences in data acquisition protocols, imaging equipment, and clinical practices among centers pose challenges in harmonizing clinical data. Future directions in this area may focus on addressing these challenges by develo** standardized protocols and guidelines for data acquisition, annotation, and representation to ensure compatibility and interoperability across centers. This includes the establishment of data-sharing collaborations that promote the exchange of multimodal data while maintaining patient privacy and data security. Additionally, the development of transfer learning [wang2022continual] and domain adaptation [chen2022contrastive, nagabandi2018learning] techniques can enable the adaptation of models trained on data from one center to another, bridging the gap between different centers and facilitating the integration of multimodal data. Furthermore, the establishment of robust evaluation benchmark datasets [menze2014multimodal, zhou2024benchmarking, ma2024multimodality] that encompass diverse patient cohorts and center-specific characteristics is crucial for assessing the performance and generalizability of multimodal models across centers. Overall, addressing the challenge of cross-center adaption and evaluation is essential for advancing the field of multimodal data integration in precision oncology and ensuring the translation of research findings into clinical practice.

6 Conclusion

The booming development of multimodal data integration in cancer research has provided unprecedented discovery and advancement of precision oncology practice. In this paper, we review about 300 papers on multimodal data integration for precision oncology over the past decade. Specifically, integrating multimodal data at intermediate or late levels has gained substantial attention, while the emergence of the multi-level fusion strategy facilitates a more effective method to unveil intricate multimodal interconnections. In tackling samples with missing modalities, both imputation-based and imputation-free methods have demonstrated their effectiveness in various clinical applications. However, determining which type of method is superior remains an unsettled question, as the conclusions of these studies are heavily influenced by the specific data utilized. Through the discussion of existing challenges, we provide valuable insights on future directions for advancing multimodal data integration in precision oncology. As precision oncology continues to evolve, embracing the power of multimodal data integration will undoubtedly shape the future of cancer care, offering enormous potential for personalized medicine and transforming the lives of countless patients worldwide.