R&B - Rhythm and Brain: Cross-subject Decoding of Music from Human Brain Activity

Matteo Ferrante
Department of Biomedicine and Prevention
University of Rome Tor Vergata
[email protected]
&Matteo Ciferri*
Department of Biomedicine and Prevention
University of Rome Tor Vergata
[email protected]
Nicola Toschi
Department of Biomedicine and Prevention
University of Rome Tor Vergata
A.A. Martinos Center for Biomedical Imaging
Harvard Medical School/MGH, Boston (US)
These authors contributed equally to this work

Abstract

Music is a universal phenomenon that profoundly influences human experiences across cultures. This study investigates whether music can be decoded from human brain activity measured with functional MRI (fMRI) during its perception. Leveraging recent advancements in extensive datasets and pre-trained computational models, we construct map**s between neural data and latent representations of musical stimuli. Our approach integrates functional and anatomical alignment techniques to facilitate cross-subject decoding, addressing the challenges posed by the low temporal resolution and signal-to-noise ratio (SNR) in fMRI data. Starting from the GTZan fMRI dataset, where five participants listened to 540 musical stimuli from 10 different genres while their brain activity was recorded, we used the CLAP (Contrastive Language-Audio Pretraining) model to extract latent representations of the musical stimuli and developed voxel-wise encoding models to identify brain regions responsive to these stimuli. By applying a threshold to the association between predicted and actual brain activity, we identified specific regions of interest (ROIs) which can be interpreted as key players in music processing. Our decoding pipeline, primarily retrieval-based, employs a linear map to project brain activity to the corresponding CLAP features. This enables us to predict and retrieve the musical stimuli most similar to those that originated the fMRI data. Our results demonstrate state-of-the-art identification accuracy, with our methods significantly outperforming existing approaches. Our findings suggest that neural-based music retrieval systems could enable personalized recommendations and therapeutic applications. Future work could use higher temporal resolution neuroimaging and generative models to improve decoding accuracy and explore the neural underpinnings of music perception and emotion.

1 Introduction

Music universally permeates cultures, exerting a profound influence on the lives of those who perceive its harmonies and rhythms. Despite its pervasive role, the intricacies of how music impacts the human brain remain enigmatic. Music engages complex neurological pathways, triggering diverse emotional responses, evoking vivid episodic memories, and even interacting with various neurological disorders. These interactions suggest a deep and multifaceted relationship between music and brain function, warranting extensive scientific exploration (Margulis et al., 2019). This study investigates the extent to which music can be decoded from human brain activity measured with functional MRI (fMRI).

Historically, the study of how the brain interprets and processes music has been a topic of classical inquiry within neuroscience (Raglio et al., 2019). However, recent advancements have revolutionized this field, making it practicable to use AI to explore and decode brain patterns relative to a wide set of stimuli (Oota et al., 2023). In this context, the emergence of extensive datasets coupled with robust, pre-trained computational models presents an unprecedented opportunity. These tools enable us to construct detailed map**s between neural data and the latent, compact representations of external stimuli, such as images (Ferrante et al., 2023c, a; Ozcelik and VanRullen, 2023; Chen et al., 2022; Scotti et al., 2023), videos (Chen et al., 2023), language (Antonello et al., 2023; Défossez et al., 2023; Tang et al., ), and notably, music (Denk et al., 2023). These works propose several retrievals as well as generative pipelines to create a map between neural data and latent representations of external stimuli. The neural data is primarily measured via functional magnetic resonance imaging (fMRI), magnetoencephalography (MEG), or electroencephalography (EEG), and the latent representations are commonly obtained from large pretrained models. The estimated latent representations are further used for stimulus retrieval or conditioning of a generative model to generate e.g. images in vision decoding. Typically, these pipelines involve linear map**s between these two spaces (brain and latent representations of stimuli) and require subject-specific models, although some approaches to multisubject brain representations or alignment and nonlinear map**s exist (Ferrante et al., 2023b; Benchetrit et al., 2023; Scotti et al., 2024).

Refer to caption — Figure 1: Overview of our pipeline. Top pane: In the GTZan fMRI experiment, five participants were exposed to auditory stimuli that included multiple musical tracks while their brain activity was monitored via functional MRI. This setup captures the direct neural response to complex auditory inputs. In the middle pane, our encoding pipeline is described: Starting from the music stimulus, we first obtain its latent representation using the CLAP model. Subsequently, we develop voxel-wise encoding models to map the brain’s response to these stimuli to this latent space. A threshold is then applied to the voxel-wise correlation between real and predicted brain activities to identify brain regions whose activity allows the best decoding of musical stimuli. These regions are considered as most responsive to music-related regions of interest (ROIs). The bottom pane outlines our decoding pipeline, which is primarily retrieval-based. We train a model that inputs brain activity from the previously identified ROIs and predicts the corresponding CLAP features. Using these features, we then search within the CLAP latent space for the closest musical stimulus, selecting the nearest k (k=5) stimulus as our retrieved samples.

Understanding these complex relationships is both fascinating and informative, potentially offering insights into fundamental brain functions. For example, understanding the connection between music perception and neural responses could unlock novel avenues for diagnosing and treating neurological disorders. Moreover, it could enhance music therapy approaches, potentially leading to innovative treatments that harness the therapeutic properties of music (Kamioka et al., 2014; de Witte et al., 2022).

In this work, we aim to decode music from brain activity—a process that involves translating the neural signals evoked by music into a comprehensible format. This objective challenges us to retrieve complex auditory information encoded within the brain’s activity. In the case of fMRI, the primary challenge lies in decoding a signal of inherently higher frequency than the neural signal, which is further confounded by the local variation in the brain of the Haemodynamic Response Function (HRF). Additional limitations include the constraints posed by small datasets typically comprising few subjects with intrinsic between-subject anatomical and functional differences.

To address these challenges, we first constructed encoding models to identify brain regions responsive to musical stimuli. We then aggregated brain activity across subjects to facilitate a cross-subject decoding approach. This included aligning functional brain data and map** the identified regions’ activity to the latent representations of music stimuli. These representations were derived using an open-source, multimodal pre-trained foundation model known as Contrastive Language-Audio Pretraining (CLAP) (Elizalde et al., 2022). In the final stages of our study, we compared the representations of music estimated from brain data with their true counterparts, employing a selection criterion that identified the five closest matching representations as potential candidates for accurate decoding.

The studies most closely related to our research include Bellier et al. (2023) and Denk et al. (2023). Bellier et al. (2023) demonstrate that time-frequency decompositions can be effective representations for this type of task, and that they can be performed using both linear and nonlinear approaches to decode the auditory experience using invasive iEEG data.

Another pivotal study, (Denk et al., 2023), shares similarities with our approach in that it addresses the challenges of retrieval-based as well as generative music decoding using the same fMRI dataset we employ here. However, unlike our methodology, Denk et al. (2023) uses subject-specific decoding pipelines based on anatomical atlases and proprietary models like MuLAN and MusicLM (Agostinelli et al., 2023; Huang et al., 2022).

In this paper, we advance the state of the art by designing a streamlined pipeline that leverages open-source models. Our approach begins by identifying brain regions whose activity can be reliably modelled using latent representations of audio stimuli. Subsequently, we use the brain activity from these regions to construct cross-subject decoding pipelines. Figure 1 depicts our pipeline. In our work, we aspire to refine our understanding of how music is processed within the brain and to lay the groundwork for future explorations into the therapeutic potential of music in neurological settings.

2 Material and Methods

In this section, we describe the proposed method and the data we used. The data are publicly available and can be requested at https://openneuro.org/datasets/ds003720/versions/1.0.1. All experiments and models were trained on a server equipped with four NVIDIA A100 GPU cards (80GB RAM each connected through NVLINK) and 2 TB of System RAM. Throughout this paper, we will use the terms "fMRI data" as "brain activity", "neural activity" or "neural representations" interchangeably. These terms all stand for the fMRI signal, averaged over the time-points related to a specific stimulus, i.e. a 3D map. Additionally, the terms "musical features" or "musical representations" always refer to the embedding of musical stimuli generated by the CLAP model. Code is available at this repository: https://github.com/neoayanami/fmri-music-retrieve.

2.1 Data

The GTZan fMRI dataset (Nakai et al., 2023) comprises functional magnetic resonance imaging (fMRI) data collected from five subjects ("sub-001" to "sub-005") while they listened to music stimuli drawn from 10 distinct genres. The experimental protocol included 18 fMRI acquisitions (i.e. "runs") per subject, consisting of 12 training runs and 6 test runs. Each run is also associated with detailed information about each stimulus, including onset time, genre type, track name, and start and end times of excerpts from the original music stimuli. All stimuli have a duration of 15 seconds, including 2 seconds of fade-in and fade-out (a total of 4 seconds). The data are provided in intensity normalized form, i.e. after root mean square (RMS) normalization. In the test run ensemble, each musical stimulus was administered four times and the brain activity averaged across identical stimuli. Data averaging improves the signal-to-noise ratio (SNR) and enhances the detection of consistent neural responses associated with the stimulus under investigation.

After motion correction, we co-registered the fMRI data to the Montreal Neurological Institute (MNI) standard space using a T1w anatomical image as reference for each subject, and applied detrending and standardization at the run level. The final step involved "delaying" the brain activity by 3 Repetition Times (TR) (i.e. 4.5 s) in order to account for the peak of the hemodynamic response, and averaging the following 15 seconds to obtain a neural representation for each musical stimulus. Our final dataset is therefore composed of a total of 540 stimuli-processed fMRI pairs for each subject, divided into 480/60 train/test, as defined by the authors of the dataset. We used FSL (Jenkinson et al., 2012) for co-registration and the Nilearn python library (Abraham et al., 2014) to perform all other preprocessing steps.

2.2 Functional Alignment

To address the inherent variability in brain structure/function across different individuals, we explored three distinct methodologies for aggregating cross-subject data. These techniques aim to enhance the robustness and accuracy of decoding models by aligning and integrating neural data from multiple subjects. Each method offers a unique approach to the challenge of intersubject variability, a common hurdle in neuroimaging studies.

The first method we implemented was anatomical alignment, which uses standard brain atlases to align brain imaging data from different subjects based on their anatomical landmarks. By map** each subject’s data to a common anatomical space, we can directly compare and combine data across individuals, despite differences in brain size, shape, or orientation. This method is widely used in neuroimaging as it facilitates the direct comparison of localized brain activity across subjects.

Moving beyond mere anatomical correspondence, our second method, functional alignment, aligns brain activity based on functional data. This technique involves matching brain regions that exhibit similar activity patterns during specific tasks or stimuli across different subjects. Unlike anatomical alignment, functional alignment accounts for individual variations in brain function topology that may not align with variations in physical brain structures, making it particularly advantageous for studies where functional responses to complex stimuli are the primary focus. To this end, we leveraged the "hyperalignment" strategy proposed by Haxby et al. (2011) based on Procrustes analysis.

Lastly, given recent literature (Ferrante et al., 2023b; Défossez et al., 2023; Benchetrit et al., 2023) which demonstrated that linear layers are a useful tool to align neural representations into a common space, we employed ridge regression to aggregate cross-subject brain data. This approach applies regularization to address multicollinearity in high-dimensional datasets, which is typical of fMRI data. By introducing a penalty term, ridge regression combines voxel-wise data from different subjects into a unified model while enhancing the stability and generalizability of our predictions. Each of these methods was tested for its potential to improve the accuracy of our decoding models, with the goal of establishing a reliable approach to interpreting complex brain data in a multi-subject context.

2.3 Music Feature Extraction

Our brain engages with music in intricate, non-linear ways, forming representations that support our cognitive processes. This complexity suggests that a multimodal pre-trained model like CLAP, Elizalde et al. (2022)) may mimic some aspects of how our brains process music. Under this hypothesis, CLAP can transform musical stimuli into a vectorial representation that could present topological similarities similarity with the brain representations, allowing the identifications of simple map** between the latent representations generated by CLAP and those generated by the human brain.

CLAP is a multimodal neural network designed for contrastive learning in the realm of audio and text processing. It is trained on a diverse set of audio and text pairs, learning to align text and audio latent representations. The model employs the SWINTransformer (Liu et al., 2021) to extract audio features from log-Mel representations and the RoBERTa model (Liu et al., 2019) to extract text representations, both projected into a shared latent space of identical dimensionality. The similarity between audio and text features is measured using cosine similarity.

Figure 2 shows the results of using t-Distributed Stochastic Neighbor Embedding (t-SNE, van der Maaten and Hinton (2008)) to create a 2D visualization of the true music features overlayed on genre labels, offering a qualitative understanding of how the CLAP model’s representations are able to separate different genres.

2.4 Encoding Models

The primary goal of this part of our study was to identify brain regions responsive to musical stimuli by constructing voxel-wise encoding models. These models map the latent representations of musical stimuli onto voxel-wise brain activity. To assess the efficacy of each voxel’s model, we employed a cross-validation scheme, wherein the correlation between the predicted and real brain activities of each voxel was measured.

Model training incorporated a hyperparameter search for the regularization parameter $\alpha$ . We explored a range of $\alpha$ values set on a logarithmic scale from $10^{-2}$ to $10^{3}$ . Upon completing the model training, we established an empirical threshold for selection at a correlation of 0.1. This threshold was empirically chosen during preliminary explorations and was used to generate a mask of the brain regions. This mask delineates areas showing higher responsiveness to musical stimuli.

2.5 Decoding Model

Following the identification of brain regions responsive to music, our next objective was to construct a common model that could map the brain activity from these regions to the latent representations of musical features. This model aims to facilitate a translation process where the neural responses could potentially be directly mapped into musical features, by creating a predictive model where the brain’s response could serve as a proxy for the music itself, also illustrating a direct link between neural activity and musical perception. To this end, we trained a Ridge regression with hyperparameter optimization between the aligned brain activity of all subjects in "music-responsive" brain regions. Successively, we then focused on optimizing the retrieval process within the testing dataset. For each predicted musical feature, we selected the top-k closest elements based on the lowest L2 (Euclidean) distance between predicted and true musical features in CLAP space. This approach forms the basis of a straightforward retrieval pipeline, where the model searches for and retrieves the most similar musical stimuli from the latent space, based on the neural activity they elicited.

2.6 Evaluation

In our study, we measured the identification accuracy as described in the Brain2Music framework (Denk et al., 2023). Identification accuracy quantifies how accurately the predicted $d$ -dimensional features correspond to the target features by computing the Pearson correlation coefficient between each pair of predicted and target features. In our case, the features are the estimated and true CLAP features (last layer, dimensionality 512). The accuracy for each prediction is the proportion of correct identifications, where a correct identification occurs if the correlation (computed as above) for a given prediction is higher than the one for any other prediction. In detail, the metric is calculated as follows: first, construct a correlation matrix between the predicted and true embeddings. Each element of this matrix, $C_{i,j}$ , represents the Pearson correlation coefficient between the $i$ -th predicted embedding and the $j$ -th target embedding. For each predicted embedding, determine whether the correlation with its corresponding target (diagonal element $C_{i,i}$ ) is greater than the correlations with all other targets (non-diagonal elements $C_{i,j}$ for $j\neq i$ ). The identification accuracy for each prediction is then calculated using an indicator function:

\text{id\_acc}_{i}=\frac{1}{n-1}\sum_{j=1}^{n}1\left[C_{i,i}>C_{i,j}\right]

where $1[\cdot]$ is the indicator function that returns 1 if the condition is true and 0 otherwise. The formula ensures that each comparison excludes the self-comparison ( $j=i$ ). The overall identification accuracy is the average across all predictions:

\text{id\_acc}=\frac{1}{n}\sum_{i=1}^{n}\text{id\_acc}_{i}

Identification accuracy is especially useful in scenarios where the data may lead to ambiguous interpretations, requiring robust model performance to correctly identify the underlying condition or stimulus. Following an intuitive explanation of identification accuracy provided in (Denk et al., 2023) adapted for our case: from a practical perspective, consider a model that achieves an identification accuracy of 90%. This implies that, on average, 10% of the predictions are incorrect, i.e. cases where another candidate (not the correct "target") corresponds to a higher correlation coefficient than the correct candidate. In a dataset containing 60 examples, this would mean that the correct music track, on average, is ranked sixth (10% of 60 equals 6) in terms of correlation, suggesting that five other music stimuli were mistakenly rated as more likely candidates as compared to the correct one.

For demonstration purposes, we provide qualitative examples of decoded music. These examples can be accessed at the provided URL https://mind2music.my.canva.site/decoding-music-from-brain-activity-exploring-the-neural-correlates-of-music-perception, where listeners can directly experience the output of our decoding process, offering an auditory validation of the model’s performance.

3 Results

This study examined the effectiveness of various embedding models and functional alignment strategies in identifying and classifying musical genres based on brain activity data. The results highlight significant advancements in genre classification accuracy and provide insights into the spatial distribution of musically responsive brain regions.

3.1 Encoding Models and Delineation of brain areas responsive to music

By setting a threshold of 0.1 (see methods), the encoding models identified 833 voxels in total. This threshold was empirically determined to optimize the balance between sensitivity and specificity in our voxel selection procedure. Figure 3 shows the distribution of the relevant voxels within anatomical brain space, which appear to co-localize within lateral and temporal regions.

3.2 Identification Accuracy

As shown in Table 1, our proposed methods with functional alignment techniques, denoted linear and hyperalign, demonstrated superior performance with identification accuracies of 0.9012 ± 0.01573 and 0.8805 ± 0.0231, respectively, outperforming other baselines and the anatomical alignment method. The linear alignment method, in particular, shows the highest performance, underscoring the efficacy of our linear modelling approach to achieve cross-subject music decoding from brain activity. This is in accordance with our previous observation in vision decoding (Ferrante et al., 2023b).

Table 1: Comparison of Test Identification Accuracy

Embedding	Test Identification Accuracy
SoundStream-avg	$0.674\pm 0.016$
w2v-BERT-avg	$0.837\pm 0.005$
MuLan_text	$0.817\pm 0.014$
MuLan_music	$0.876\pm 0.015$
Ours - anatomical	$0.7746\pm 0.01551$
Ours - hyperalign	$\mathbf{0.8805\pm 0.0231}$
Ours - linear	$\mathbf{0.9012\pm 0.01573}$

3.3 Genre Decoding

The confusion matrix shown in Figure 4 illustrates the model’s capability to classify musical genres based on brain activity, with a notable concentration of correct predictions along the diagonal. Classical and jazz genres showed high accuracy with minimal confusion, suggesting that they correspond to distinct neural representations. However, genres like metal and disco exhibited more confusion, potentially indicating less separability in the CLAP space. For example, the confusion between disco and metal may arise from similar rhythmic patterns or instrumentation that blur genre-specific boundaries in neural encoding. Figure 5 shows the similarity between the retrieved music and the original genre stimulus, using time-frequency as visual aids. Within the retrieved cluster, the exact stimulus is found very often, emphasizing the effectiveness of the pipeline. Given feature overlap, it is common to encounter different genres in the retrieved group of music stimuli compared to the stimulus, although always within genres that exhibit shared acoustic patterns.

3.4 Impact of Functional Alignment techniques

The choice of functional alignment techniques significantly enhanced the identification accuracy compared to baselines that did not make use of alignment. This improvement indicates that aligning functional brain data across subjects, while preserving individual differences in brain anatomy, allows for more accurate generalizations when decoding music genres from brain activity when compared to single-subject modelling. The technique effectively harnesses shared information across different subjects, thereby boosting the overall model’s performance (Denk et al., 2023).

Compared to existing studies, such as those using basic MuLan or SoundStream embeddings (Huang et al., 2022; Denk et al., 2023), our method provides high performances in music track retrieval and genre classification accuracy. Previous studies often did not account for individual variations in brain anatomy and function as effectively, which our hyperalignment and linear methods address directly.

The results from this study not only reinforce the utility of advanced machine learning techniques in neuroscience but also pave the way for more personalized and accurate interpretations of brain activity in response to complex stimuli like music. Future work could explore deeper neural network architectures or alternative machine learning models that might further refine the accuracy of musical genre classification from brain imaging data.

3.5 Decoding in Time

In our main experiment, we averaged the 15 seconds of fMRI data for each musical stimulus. Another possible interesting research question is when, after the stimulus onset, a peak in performance for music decoding can be observed. To address this question, we evaluated the neural responses contained in each fMRI volume. This analysis relies on identical procedures as described above; however, instead of using averaged brain activity over 15s as input for the decoding model, instantaneous (i.s. sample-wise) brain activity is used, resulting in a decoding-in-time representation (Figure 6). By identifying the samples/time delays at which the identification accuracy is highest, this approach illustrates the specific temporal dynamics underlying music perception within the brain.

4 Discussion

The findings of this study provide compelling evidence that decoding music from cross-subject neural activity is not only feasible, but also remarkably accurate when appropriate computational approaches and neural data alignment techniques are employed and adapted. This opens up numerous possibilities for understanding the cognitive processing of music and its applications, ranging from therapeutic practices to advanced brain-computer interfaces.

4.1 Implications of Music Decoding

The successful decoding of music genres from brain activity suggests profound implications for cognitive neuroscience and psychological studies. By associating specific genres with distinct patterns of brain activation, researchers can further explore how these patterns correlate with cognitive functions, emotional states, and individual preferences. This understanding could eventually lead to personalized music interventions designed to manage various psychological conditions such as anxiety, depression, and stress. Further refinement of this process could lead to neural-guided recommendation systems, allowing individuals to receive personalized music suggestions based on neural similarities with music stimuli they enjoy or those that evoke specific emotions.

4.2 Performance on Genre Decoding

Our analysis achieved results in line with (Nakai et al., 2022), further showing that certain genres like classical and jazz are more distinctly encoded in the brain, possibly due to their unique structural and rhythmic complexities which might engage specific neural pathways. However, the confusion between closely related genres like rock and metal highlights the challenges of distinguishing between potentially similar auditory stimuli and suggests a need for more refined modelling techniques that can capture subtle nuances in music perception.

4.3 Identification of music-related brain regions

Our results identified key brain regions involved in music perception and processing. Specifically, we identified the superior temporal gyrus (STG) (Yoo et al., 2016), primary auditory cortex (Warren, 2008), planum temporale (Warren and Griffiths, 2003), and potentially the inferior parietal lobule (Yoo et al., 2016). These areas are essential for decoding various aspects of auditory and musical stimuli, contributing to our ability to perceive and appreciate music. The superior temporal gyrus (STG), which includes the primary auditory cortex, is crucial for processing auditory information such as pitch, rhythm, and timbre. The primary auditory cortex, located within the STG, plays a fundamental role in detecting and discriminating sound frequencies, allowing us to discern different notes and rhythms in music Warren (2008). This region’s function is vital for understanding melodies and the basic structural components of music. Adjacent to the primary auditory cortex is the planum temporale, a region involved in higher-order auditory processing Warren and Griffiths (2003). The planum temporale is asymmetrically larger in the left hemisphere, a feature associated with language dominance, but it also plays a significant role in music processing Warren and Griffiths (2003). This area is crucial for discerning complex auditory patterns and structures, such as harmonies and musical sequences. The ability of the planum temporale to process these intricate auditory stimuli contributes to our cognitive understanding of music and its structural components. In addition to the STG and planum temporale, the inferior parietal lobule is implicated in the integration of sensory information from various modalities (Pando-Naude et al., 2021). This region contributes to spatial awareness of sounds, which is important for perceiving the spatial dynamics of music, such as the localization of instruments within a stereo field. The inferior parietal lobule also plays a role in attention and the processing of rhythmic elements, enhancing our ability to perceive musical tempo and timing Pando-Naude et al. (2021). This integrative function is essential for experiencing music as a coherent and dynamic auditory event. Together, these regions form a network that facilitates different aspects of music perception. The superior temporal gyrus and primary auditory cortex are central to decoding the basic auditory properties of music Yoo et al. (2016); Warren (2008), while the planum temporale supports higher-order processing and pattern recognition. The inferior parietal lobule’s involvement in sensory integration and attention further enriches our ability to experience and appreciate the spatial and temporal dimensions of music. These interconnected brain regions work in concert to provide a comprehensive and nuanced understanding of music, enabling listeners to engage with its emotional and aesthetic qualities fully.

4.4 Impact on Musical Therapy

There are potential applications of this research in the field of musical therapy that could be significant. Making a step towards a better understanding of the neural underpinnings of how music influences emotion and cognition can aid in develo** more effective therapeutic protocols. As highlighted in (Raglio et al., 2016, 2019), music therapy has been shown to have beneficial effects on various patient outcomes. While still in early stages, genre-specific neural decoding could tailor these therapies to individual needs, enhancing their effectiveness. Music therapy has been utilized in various clinical settings, demonstrating positive outcomes in patients with conditions such as Alzheimer’s disease, stroke, and depression (Kamioka et al., 2014; de Witte et al., 2022). By decoding how different genres affect brain activity, therapists could potentially customize music interventions that align more closely with the neural and emotional states of individual patients. This personalized approach could maximize therapeutic benefits by targeting specific neural circuits involved in emotional regulation and cognitive function. Moreover, further research into the relationship between music and neural responses could contribute to the development of innovative treatment modalities. For instance, integrating neurofeedback mechanisms that respond to real-time neural data could enable dynamic adjustments in musical stimuli, optimizing therapeutic outcomes. This approach could be particularly effective in managing chronic pain, stress, and anxiety, where music’s role in altering brain states can be leveraged for long-term health benefits (Koelsch, 2011, 2014; Koelsch et al., 2006). Understanding the specific neural mechanisms involved in music perception and emotional processing also provides insights into broader applications in cognitive neuroscience. For example, exploring how music can enhance cognitive rehabilitation in post-stroke patients or improve social communication skills in individuals with autism spectrum disorder represents promising research avenues. The ability to decode and harness the power of music at a neural level opens up new possibilities for both clinical practice and scientific inquiry into the profound effects of music on the human brain (Nakai et al., 2021, 2022).

4.5 Deeper Investigation of Music and Emotions

Further research could benefit from exploring the intricate connections between music and emotions, a relationship well-documented in the studies by (Koelsch et al., 2006; Koelsch, 2011, 2014). By decoding the emotional content of music from brain activity, researchers could gain insights into the emotional processing in the brain, providing a clearer picture of the emotional impacts of music at a neurological level. Envisioning a significant advancement for the future, we could consider this type of research as the foundation for a neural recommendation system. This system could potentially offer personalized music track suggestions based on our emotional and neural states or even suggest music stimuli that could guide us toward new emotional experiences.

4.6 Extension to Generative Music

Looking forward, the decoding techniques used in this study could be extended to generative music systems, potentially leading to innovative applications in creating music from brain activity, including musical imagery.

At the time of writing, the primary reason we are focusing on retrieval rather than generation is the low temporal resolution of fMRI acquisition. This limitation constrains the possibility of generating music online based on neural dynamics, which however might be achievable with other neural activity measures like iEEG or MEG. A particularly intriguing prospect is to replace the retrieval module with a generative stage, especially by combining music decoding with imagery. Imagine an artist entering the scanner and envisioning a music track to be decoded through this process. The resulting piece could be seen as a collaborative creation between the artist’s imagination and artificial intelligence, potentially giving rise to a new art form where learned musical priors are transformed and used by neural decoding models to produce unique artistic expressions. Such systems would not only deepen our understanding of the creative processes that underpin music generation but also open the door to innovative forms of artistic expression that are directly influenced by neural dynamics.

4.7 Limitations

Despite these advancements, several limitations remain. The neural signals used in this study are inherently noisy and are only a subsampled representation of brain activity, which limits the detail and accuracy of the music that can be reconstructed. Rhythmic elements, particularly those at fine temporal resolutions, remain challenging to decode accurately due to the limitations in the temporal resolution of fMRI technology. Moreover, the extensive scanning time required for collecting sufficient data is a practical limitation that could restrict the use of these techniques in everyday applications.

4.8 Future Work

Future research could explore the use of alternative neuroimaging methods, such as electroencephalography (EEG) or intracranial EEG (iEEG), which offer higher temporal resolution and could potentially provide more detailed insights into the neural encoding of music. Additionally, the development of more sophisticated generative models that can better handle the complexity and variability of neural data represents a promising direction for both academic research and practical applications in neuromusicology.

5 Conclusion

This study demonstrates high identification accuracy in decoding music from cross-subject neural activity using a streamlined retrieval pipeline, setting a new benchmark in neuromusicology with significant implications for therapeutic and personalized music applications.

References

Abraham et al. [2014] Alexandre Abraham, Fabian Pedregosa, Michael Eickenberg, Philippe Gervais, Andreas Mueller, Jean Kossaifi, Alexandre Gramfort, Bertrand Thirion, and Gael Varoquaux. Machine learning for neuroimaging with scikit-learn. Frontiers in Neuroinformatics, 8, 2014. ISSN 1662-5196. doi: 10.3389/fninf.2014.00014. URL https://www.frontiersin.org/articles/10.3389/fninf.2014.00014.
Agostinelli et al. [2023] Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, and Christian Frank. Musiclm: Generating music from text, 2023.
Antonello et al. [2023] Richard Antonello, Aditya Vaidya, and Alexander G. Huth. Scaling laws for language encoding models in fmri, 2023.
Bellier et al. [2023] L. Bellier, A. Llorens, D. Marciano, A. Gunduz, G. Schalk, P. Brunner, et al. Music can be reconstructed from human auditory cortex activity using nonlinear decoding models. PLoS Biology, 21(8):e3002176, 2023. doi: 10.1371/journal.pbio.3002176. URL https://doi.org/10.1371/journal.pbio.3002176.
Benchetrit et al. [2023] Yohann Benchetrit, Hubert Banville, and Jean-Rémi King. Brain decoding: toward real-time reconstruction of visual perception, 2023.
Chen et al. [2022] Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding, 2022.
Chen et al. [2023] Zijiao Chen, Jiaxin Qing, and Juan Helen Zhou. Cinematic mindscapes: High-quality video reconstruction from brain activity, 2023.
de Witte et al. [2022] Martina de Witte, Ana da Silva Pinho, Geert-Jan Stams, Xavier Moonen, Arjan E R Bos, and Susan van Hooren. Music therapy for stress reduction: a systematic review and meta-analysis. Health Psychol. Rev., 16(1):134–159, March 2022.
Défossez et al. [2023] A. Défossez, C. Caucheteux, J. Rapin, et al. Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence, 5:1097–1107, 2023. doi: 10.1038/s42256-023-00714-5.
Denk et al. [2023] Timo I. Denk, Yu Takagi, Takuya Matsuyama, Andrea Agostinelli, Tomoya Nakai, Christian Frank, and Shinji Nishimoto. Brain2music: Reconstructing music from human brain activity, 2023.
Défossez et al. [2023] A. Défossez, C. Caucheteux, J. Rapin, et al. Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence, 5:1097–1107, 2023. doi: 10.1038/s42256-023-00714-5. URL https://doi.org/10.1038/s42256-023-00714-5.
Elizalde et al. [2022] Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap: Learning audio concepts from natural language supervision, 2022.
Ferrante et al. [2023a] Matteo Ferrante, Tommaso Boccato, Furkan Ozcelik, Rufin VanRullen, and Nicola Toschi. Multimodal decoding of human brain activity into images and text. In UniReps: the First Workshop on Unifying Representations in Neural Models, 2023a. URL https://openreview.net/forum?id=rGCabZfV3d.
Ferrante et al. [2023b] Matteo Ferrante, Tommaso Boccato, and Nicola Toschi. Through their eyes: multi-subject brain decoding with simple alignment techniques, 2023b.
Ferrante et al. [2023c] Matteo Ferrante, Tommaso Boccato, and Nicola Toschi. Semantic brain decoding: from fmri to conceptually similar image reconstruction of visual stimuli, 2023c.
Haxby et al. [2011] James V Haxby, J Swaroop Guntupalli, Andrew C Connolly, Yaroslav O Halchenko, Bryan R Conroy, M Ida Gobbini, Michael Hanke, and Peter J Ramadge. A common, high-dimensional model of the representational space in human ventral temporal cortex. Neuron, 72(2):404–416, October 2011.
Huang et al. [2022] Qingqing Huang, Aren Jansen, Joonseok Lee, Ravi Ganti, Judith Yue Li, and Daniel P. W. Ellis. Mulan: A joint embedding of music audio and natural language, 2022.
Jenkinson et al. [2012] M. Jenkinson, C. F. Beckmann, T. E. J. Behrens, M. W. Woolrich, and S. M. Smith. Fsl. NeuroImage, 62(2):782–790, 2012. doi: 10.1016/j.neuroimage.2011.09.015. URL https://doi.org/10.1016/j.neuroimage.2011.09.015.
Kamioka et al. [2014] Hiroharu Kamioka, Kiichiro Tsutani, Minoru Yamada, Hyuntae Park, Hiroyasu Okuizumi, Koki Tsuruoka, Takuya Honda, Shinpei Okada, Sang-Jun Park, Jun Kitayuguchi, Takafumi Abe, Shuichi Handa, Takuya Oshio, and Yoshiteru Mutoh. Effectiveness of music therapy: a summary of systematic reviews based on randomized controlled trials of music interventions. Patient Prefer. Adherence, 8:727–754, May 2014.
Koelsch [2014] S. Koelsch. Brain correlates of music-evoked emotions. Nature Reviews Neuroscience, 15:170–180, 2014. doi: 10.1038/nrn3666. URL https://doi.org/10.1038/nrn3666.
Koelsch [2011] Stefan Koelsch. Toward a neural basis of music perception - a review and updated model. Front. Psychol., 2:110, June 2011.
Koelsch et al. [2006] Stefan Koelsch, Thomas Fritz, D. Yves V Cramon, Karsten Müller, and Angela D. Friederici. Investigating emotion with music: an fMRI study. Human Brain Map**, 27(3):239–250, March 2006. ISSN 1065-9471. doi: 10.1002/hbm.20180.
Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021.
Margulis et al. [2019] Elizabeth Hellmuth Margulis, Patrick C. M. Wong, Rhimmon Simchy-Gross, and J. Devin McAuley. What the music said: narrative listening across cultures. Palgrave Communications, 5(1):146, Nov 2019. ISSN 2055-1045. doi: 10.1057/s41599-019-0363-1. URL https://doi.org/10.1057/s41599-019-0363-1.
Nakai et al. [2021] Tomoya Nakai, Naoko Koide-Majima, and Shinji Nishimoto. Correspondence of categorical and feature-based representations of music in the human brain. Brain and Behavior, 11(1):e01936, 2021. ISSN 2162-3279. doi: 10.1002/brb3.1936. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/brb3.1936. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/brb3.1936.
Nakai et al. [2022] Tomoya Nakai, Naoko Koide-Majima, and Shinji Nishimoto. Music genre neuroimaging dataset. Data in Brief, 40:107675, 2022. ISSN 2352-3409. doi: https://doi.org/10.1016/j.dib.2021.107675. URL https://www.sciencedirect.com/science/article/pii/S2352340921009501.
Nakai et al. [2023] Tomoya Nakai, Naoko Koide-Majima, and Shinji Nishimoto. "music genre fmri dataset", 2023.
Oota et al. [2023] Subba Reddy Oota, Manish Gupta, Raju S. Bapi, Gael Jobard, Frederic Alexandre, and Xavier Hinaut. Deep neural networks and brain alignment: Brain encoding and decoding (survey), 2023.
Ozcelik and VanRullen [2023] Furkan Ozcelik and Rufin VanRullen. Brain-diffuser: Natural scene reconstruction from fmri signals using generative latent diffusion, 2023.
Pando-Naude et al. [2021] V. Pando-Naude, A. Patyczek, L. Bonetti, et al. An ale meta-analytic review of top-down and bottom-up processing of music in the brain. Scientific Reports, 11:20813, 2021. doi: 10.1038/s41598-021-00139-3.
Raglio et al. [2016] Alfredo Raglio, Caterina Galandra, Luisella Sibilla, Fabrizio Esposito, Francesca Gaeta, Francesco Di Salle, Luca Moro, Irene Carne, Stefano Bastianello, Maurizia Baldi, and Marcello Imbriani. Effects of active music therapy on the normal brain: fMRI based evidence. Brain Imaging and Behavior, 10(1):182–186, March 2016. ISSN 1931-7565. doi: 10.1007/s11682-015-9380-x.
Raglio et al. [2019] Alfredo Raglio, Enrico Oddone, Lara Morotti, Yasmin Khreiwesh, Chiara Zuddas, Jessica Brusinelli, Chiara Imbriani, and Marcello Imbriani. Music in the workplace: A narrative literature review of intervention studies. Journal of Complementary & Integrative Medicine, pages /j/jcim.ahead–of–print/jcim–2017–0046/jcim–2017–0046.xml, October 2019. ISSN 1553-3840. doi: 10.1515/jcim-2017-0046.
Scotti et al. [2023] Paul S. Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Ethan Cohen, Aidan J. Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, Kenneth A. Norman, and Tanishq Mathew Abraham. Reconstructing the mind’s eye: fmri-to-image with contrastive learning and diffusion priors, 2023.
Scotti et al. [2024] Paul S. Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A. Norman, and Tanishq Mathew Abraham. Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data, 2024.
[36] Jerry Tang, Amanda LeBel, Shailee Jain, and Alexander G. Huth. Semantic reconstruction of continuous language from non-invasive brain recordings. 26(5):858–866. ISSN 1546-1726. doi: 10.1038/s41593-023-01304-9. URL https://www.nature.com/articles/s41593-023-01304-9. Number: 5 Publisher: Nature Publishing Group.
van der Maaten and Hinton [2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579–2605, 2008. URL http://jmlr.org/papers/v9/vandermaaten08a.html.
Warren and Griffiths [2003] J D Warren and T D Griffiths. Distinct mechanisms for processing spatial sequences and pitch sequences in the human auditory brain. J. Neurosci., 23(13):5799–5804, July 2003.
Warren [2008] Jason Warren. How does the brain process music? Clin. Med., 8(1):32–36, February 2008.
Yoo et al. [2016] Hyun-Joon Yoo, Hyun Im Moon, and Sung-Bom Pyun. Amusia after right temporoparietal lobe infarction: A case report. Ann. Rehabil. Med., 40(5):933–937, October 2016.