Generative AI Empowered LiDAR Point Cloud Generation with Multimodal Transformer

Mohammad Farzanullah1, Han Zhang1, Akram Bin Sediq2, Ali Afana2 and Melike Erol-Kantarci1
Emails: {mfarz086, hzhan363, melike.erolkantarci}@uottawa.ca
{akram.bin.sediq, ali.afana}@ericsson.com
1 School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON, Canada 2 Ericsson Inc., Ottawa, ON, Canada
Abstract

Integrated sensing and communications is a key enabler for the 6G wireless communication systems. The multiple sensing modalities will allow the base station to have a more accurate representation of the environment, leading to context-aware communications. Some widely equipped sensors such as cameras and RADAR sensors can provide some environmental perceptions. However, they are not enough to generate precise environmental representations, especially in adverse weather conditions. On the other hand, the LiDAR sensors provide more accurate representations, however, their widespread adoption is hindered by their high cost. This paper proposes a novel approach to enhance the wireless communication systems by synthesizing LiDAR point clouds from images and RADAR data. Specifically, it uses a multimodal transformer architecture and pre-trained encoding models to enable an accurate LiDAR generation. The proposed framework is evaluated on the DeepSense 6G dataset, which is a real-world dataset curated for context-aware wireless applications. Our results demonstrate the efficacy of the proposed approach in accurately generating LiDAR point clouds. We achieve a modified mean squared error of 10.3931. Visual examination of the images indicates that our model can successfully capture the majority of structures present in the LiDAR point cloud for diverse environments. This will enable the base stations to achieve more precise environmental sensing. By integrating LiDAR synthesis with existing sensing modalities, our method can enhance the performance of various wireless applications, including beam and blockage prediction.

Index Terms:
LiDAR generation, Multimodal transformers, Joint sensing and communications

I Introduction

Artificial Intelligence (AI) is a rapidly evolving field that encompasses various techniques and approaches aimed to enable machines to mimic human intelligence and cognitive abilities. At its core, AI involves the development of algorithms and models that enable machines to process data, learn from it, and make decisions or predictions accordingly. Recently, AI algorithms have been widely utilized for the optimization of the 5th Generation (5G) and 6th Generation (6G) wireless communication systems [1].

Future 6G systems will combine higher frequency bands from millimeter wave (mmWave) to THz, wider bandwidth, and denser antenna arrays to integrate signal sensing and communication, mutually enhancing their capabilities [2]. The communication systems can help to improve sensing capabilities. Simultaneously, the base stations are equipped with diverse modalities such as visual, sensing, and localization capabilities. This configuration paves the way for context-aware communications, which is beneficial for applications such as beam prediction, handover control, and reduced overhead for tracking channel state information.

Some existing works used visual input and RADAR input to perform wireless communication-related tasks [3, 4]. However, cameras can be less reliable in harsh weather conditions, such as snow and fog, and RADAR sensors only produce a sparser representation of the environment. Measurements generated by these sensors in sub-optimal sensing environments may not be solely relied upon for context-aware communications. In contrast, LiDAR sensors can provide a more accurate and stable representation of the environment. However, LiDAR sensors have not yet been widely deployed at the base stations considering their high cost [5]. Based on their exceptional achievements demonstrated on autonomous driving applications, LiDAR sensors are believed to present a more promising sensing solution for mmWave base stations. In the existing studies, the LiDAR data has proven effective for various wireless applications, such as beam prediction [6, 7] and handover efficiency [8]. Therefore, there is a need to synthesize LiDAR data with other modalities to enhance the wireless communication performance without introducing extra costs.

On the other hand, transformer-based generative models have recently attracted much attention due to their remarkable ability to generate coherent and contextually relevant text [9]. These models, built upon the transformer architecture, have been applied across various domains, including Natural Language Processing (NLP) and computer vision. The transformer architecture relies on the concept of self-attention mechanisms, which allows the model to weigh the importance of different words in a sequence. This self-attention mechanism enables transformers to capture long-range dependencies in the input data. A multimodal transformer encoder [10] refers to a variation or extension of the transformer architecture specifically designed to process multimodal data. Transformers typically utilize self-attention mechanisms, allowing each token to attend to other tokens within the same input sequence. In a multimodal setting, cross-modal attention mechanisms are introduced to allow tokens from one modality to attend to tokens from other modalities, enabling the model to learn inter-modal relationships.

Inspired by these thoughts, in this paper, we utilize a multimodal transformer architecture to generate LiDAR point clouds from images and RADAR data. The major contributions of our paper can be summarized as follows:

  • We designed a novel multimodal transformer-based LiDAR synthesis/generation algorithm, comprising image encoder, RADAR encoder, depth encoder, multimodal transformer, and LiDAR decoder. The inputs to the deep learning algorithm are the camera image and RADAR sensor data. To the best of our knowledge, we are the first to use multimodal transformer to learn the relationship between different input modalities to predict LiDAR. The synthesized LiDAR data can help in the decision-making process for many wireless applications such as beam prediction and blockage prediction, and for autonomous vehicles.

  • Most existing LiDAR generation works use an encoder-decoder architecture, and train the complete model. Instead, we use a pre-trained vision transformer and depth encoder during the encoding stage to learn the image and RADAR representation, and this effectively reduces the training cost.

We evaluate our model on the DeepSense 6G dataset. Our results demonstrate that we can accurately predict LiDAR point clouds for diverse environments, allowing the base station to have more accurate sensing of the environment. The visual inspection of our generated LiDAR images show that our model was able to learn majority of the structures captured by the actual LiDAR sensors.

The subsequent sections of the paper are organized as follows: Section II offers a review of relevant literature. Next, Section III discusses in detail the data preprocessing and our Machine Learning model. This is followed by the dataset and training details in Section IV and results in Section V. Finally, we conclude our paper in Section VI.

II Related work

A few studies have attempted to predict LiDAR point clouds from other modalities. The authors in [11] used a convolutional stacked autoencoder to predict two-dimensional (2D) LiDAR data from a series of ultrasonic images in an indoor environment. The ultrasonic images yield a sparser representation of the environment, making them insufficient as a sensing modality. An encoder-decoder architecture was used, with the aim of learning distance for each angle. Another study aimed to learn the depth map from images and RADAR point clouds for a self-driving car dataset [5]. A two-stage architecture was developed, where quasi-dense depth was learned in the first stage, and an encoder-decoder architecture was used in the second stage to generate a depth map. The depth map was generated using scaffolding technique on LiDAR data. The authors in [12] sought to learn the LiDAR reflective intensity based on camera and depth map using a U-Net architecture. However, [11] used single modality at the input, while [5, 12] used an encoder-decoder architecture without benefiting from the inter-modal relationships. In contrast, to the best of our knowledge, we are the first to use multimodal transformer-based generative AI model to predict the 3D LiDAR data from camera and RADAR data, with pre-trained encoding stage. Furthermore, [11, 5, 12] evaluated their models on self-driving vehicles dataset, while we use the DeepSense 6G dataset, which was collected for wireless communications scenario.

The recent popularity of generative AI algorithms inspired the research community for the densification of the LiDAR point clouds. Densification refers to the process of increasing the point density within a given area, resulting in a more detailed representation of the environment. A U-Net architecture was used for the generation and densification of LiDAR point clouds in [13]. Similarly, [14] used probabilistic diffusion models and [15] used a variation of encoder-decoder architecture for the densification of LiDAR point clouds. However, these works fall in the category of synthetic data generation or densification, as they do not use other modalities to predict the LiDAR point cloud.

III Methodology

In this section, we discuss the model architecture of our proposed multimodal transformer-based LiDAR point cloud generation framework. The inputs to the framework are the camera images and RADAR sensor information, and the expected output is the generated LiDAR image.

As discussed in Section I, the LiDAR sensor is expensive to install and might not be feasible to install at every base station. This calls for a strategy to generate LiDAR data from other modalities. In this paper, we aim to generate LiDAR depth data based on camera images and RADAR data. Mathematically, the goal is to find the optimal model parameters Wsuperscript𝑊W^{*}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that minimize a given loss, which can be formulated as follows:

W=argminWL(Y,f(X;W))superscript𝑊subscriptargmin𝑊𝐿𝑌𝑓𝑋𝑊\displaystyle W^{*}=\operatorname*{arg\,min}_{W}L(Y,f(X;W))italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_L ( italic_Y , italic_f ( italic_X ; italic_W ) ) (1)

where W𝑊Witalic_W are the parameters of our model, X𝑋Xitalic_X are the inputs (camera and RADAR sensor information), Y𝑌Yitalic_Y is the actual LiDAR sensor information, and f()𝑓f()italic_f ( ) is the model we use for prediction. f(X;W)𝑓𝑋𝑊f(X;W)italic_f ( italic_X ; italic_W ) denotes the LiDAR point cloud predicted by our model. L()𝐿L()italic_L ( ) is the loss function, which will be defined in Section IV.

Refer to caption
Figure 1: Our model architecture. The input modalities are passed through vision transformers to produce embeddings. The embeddings are passed through the transformer encoder, followed by LiDAR decoder to generate the final LiDAR image.

The Fig. 1 shows our model architecture. There are three main stages in this architecture, namely the encoding stage, the multimodal transformer stage, and the LiDAR decoder stage. It is worth noting that the encoding stage uses pre-trained models. Only the transformer encoder stage and the LiDAR decoder need to be trained.

During the encoding stage, the pre-trained vision transformer converts all the modalities into embeddings of size 768, which represents a fixed output dimensionality inherent to the vision transformer architecture. The camera image is converted to camera embeddings using the vision transformer. Simultaneously, the camera image is also passed through a pre-trained depth model to generate a relative depth image. The depth image provides a richer contextual understanding of an image, enabling more accurate scene understanding. The depth image is passed through a vision transformer to produce depth embeddings. Similarly, the RADAR range against angle matrix and range against velocity matrix are fed through the vision transformer to produce another pair of embeddings. These four embeddings, each of size 768, are passed to the transformer encoder block. The transformer encoder model produces a latent space vector. This is followed by a LiDAR decoder that produces the final LiDAR image. The details of each of these parts are discussed in the remaining part of this section.

III-A Modailty Preprocessing

III-A1 RADAR preprocessing

In the dataset provided in [16], a frequency modulated continuous wave RADAR is used for the collection of data. The data consists of 3D complex I/Q RADAR measurements of the shape (number of receive antenna ×\times× samples per chirp ×\times× chirps per frame). The resolution of the RADAR is 60 cm.

First, the range Fast Fourier Transform (FFT) is obtained by performing FFT in the second dimension, which corresponds to samples per chirp in the time domain. The result of the range FFT is then subjected to an extra FFT operation along the first dimension, which indicates the number of receive antennas, in order to obtain the range against angle matrix. Lastly, by applying FFT to the range FFT output along the third dimension, which is indicative of chirps per frame, the range against velocity matrix is obtained. These two matrices are depicted in the form of an image in Fig. 1.

III-A2 LiDAR preprocessing

The LiDAR data is provided in the form of 3D point clouds. The 3D point cloud consists of multiple points in the 3D space that reflect the light. It is not feasible to directly learn the LiDAR point cloud as it encompasses the entire 3D space. Recognizing the computational challenges posed by the vastness of LiDAR point cloud data, an alternative strategy is pursued to facilitate effective learning.

We convert each point in the LiDAR point cloud from Cartesian coordinates to the polar coordinates. This forms a 2D matrix, where the dimensions of the matrix represent the reference angle from the XY plane (termed as θ𝜃\thetaitalic_θ), and the reference angle from the Z-axis (termed as ϕitalic-ϕ\phiitalic_ϕ). This can be viewed as a 2D LiDAR image, where the dimensions of the matrix represent θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ, and the values in the matrix represent the distance from the origin. For θ𝜃\thetaitalic_θ, a granularity of 0.25superscript0.250.25^{\circ}0.25 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT is used for the whole range from 180superscript180-180^{\circ}- 180 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT to 180superscript180180^{\circ}180 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. Meanwhile, for ϕitalic-ϕ\phiitalic_ϕ, a granularity of 0.25superscript0.250.25^{\circ}0.25 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT is used for the range from 60superscript60-60^{\circ}- 60 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT to 5superscript5-5^{\circ}- 5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT and from 5superscript55^{\circ}5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT to 62superscript6262^{\circ}62 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, and a granularity of 0.015625superscript0.0156250.015625^{\circ}0.015625 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT is used for the range from 5superscript5-5^{\circ}- 5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT to 5superscript55^{\circ}5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. A finer granularity is used for 5superscript5-5^{\circ}- 5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT to 5superscript55^{\circ}5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT because most of the objects of interest (cars, people) lie in this range. This resulted in a target matrix of size 1440×1088144010881440\times 10881440 × 1088. As shown in Fig. 1, the 3D LiDAR point cloud and 2D LiDAR image are reciprocal (one can be generated from the other).

In this way, our problem can be converted into predicting the distance from the origin for all the possible angles in space.

III-B Modality encoding stage

The modality encoding stage comprises pre-trained models only.

III-B1 Depth Model

To reduce the difficulty of model training, we use the pre-trained “Depth Anything” model [17] to find the relative depth of an image taken from a base station. The relative depth enriches our model with a greater contextual understanding of the surroundings. Relative depth integration improves the model’s comprehension of spatial relationships and object locations. This is an encoder-decoder model, using the DINOv2 encoder and the dense prediction transformer (DPT) decoder.

III-B2 Vision Transformer

In this framework, we use pre-trained vision transformer [18] to convert each multimodal information into embeddings of size 768. The vision transformer first partitions each image into fixed-size patches. The patches are linearly embedded, position embeddings are added, and the resulting vector is passed through a transformer encoder, to produce the final embeddings. For data of each modality, a separate vision transformer is used. The vision transformer acts as an encoder for all the modalities, lowering the dimension of each modality to 768.

III-C Multimodal Transformer Encoder

In this framework, a multimodal transformer encoder is also used to capture long-range dependencies in the multimodal input data. The multimodal transformer cross-modal attention mechanisms are introduced to allow tokens from one modality to attend to tokens from other modalities, enabling the model to learn inter-modal relationships.

In the proposed LiDAR generation framework, the embeddings of all of the four modalities are first concatenated and then fed into the multimodal transformer encoder. The multihead attention is able to compute multiple attention mechanisms in parallel, which enables the model to focus on different parts of the input sequence simultaneously. The position-wise feedforward layer in transformers enables localized non-linear transformations at each position within the input sequence, capturing relevant patterns in the input. The output is a latent space embedding of size 1024.

III-D LiDAR decoder

Refer to caption
Figure 2: The LiDAR decoder consists of 5 transpose convolutional layers, that upscale the image to our LiDAR image size.

The last part of the proposed framework is the LiDAR decoder. In this stage, the embeddings produced by the multimodal transformer encoder are fed into the LiDAR decoder for 2D LiDAR image generation. The LiDAR decoder consists of a fully connected layer followed by a series of transpose convolutional layers to upscale the output at each stage. Each layer i𝑖iitalic_i consists of a convolutional transpose layer with lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT filters. Filters in Convolutional Neural Networks act as feature detectors, convolving over input data to extract relevant patterns, edges, and textures, enabling the network to learn hierarchical representations in an image. We use a kernel size of 4, with a stride of 2 and padding of 1. This guarantees that the image undergoes a doubling in size with each layer. Each convolutional transpose layer is followed by batch normalization and rectified linear unit (ReLU) activation function.

IV Simulation settings

IV-A Dataset

We use Scenarios 31-34 of the DeepSense 6G dataset [16], where the data for each scenario was collected at a different location. The data for scenario 31 and 32 was collected during the day, while the data for scenario 33 and 34 was collected during the night. The dataset consists of an outdoor environment. The base station is equipped with different sensors so data of different modalities are provided in the dataset, including the imaging modality and the RADAR modality. The ground-truth LiDAR data is also provided for training and testing. Other than the above mentioned data, the 64-element receiving power vector is also provided for beam prediction. The primary goal of the dataset was for sensing aided beam prediction, where the sensing data can be utilized to improve beam management. With this dataset, we generate LiDAR data with camera images and RADAR data. By comparing the synthetic LiDAR data with real LiDAR data, we can evaluate the effectiveness of our proposed LiDAR generation framework.

IV-B Implementation Details

For each scenario, we split the data into 60% (11199 data samples), 20% (3733 data samples), and 20% (3733 data samples) for training, validation, and test set respectively. For each scenario, the training data comprises the first 60% of the samples, followed by 20% for validation, and the remaining 20% for testing, aimed at minimizing correlation between training and testing sets.

The LiDAR image is sparse, with varying density in different regions of the image, particularly concentrated around values of ϕitalic-ϕ\phiitalic_ϕ close to 0. Training the model using the simple mean squared error results in the model over-fitting, causing it to predict close to 0 for the entire sparse LiDAR image. To counter this, we propose modified mean squared error (MMSE) loss, where the denser areas of the LiDAR image were given 10 times more weight than sparser regions. For an image with M𝑀Mitalic_M pixels, the MMSE loss is defined as:

1Mi=1Mαi(yiy^i)2,{αi=10,ϕ[1.71875,2.1875]αi=1,otherwise1𝑀superscriptsubscript𝑖1𝑀subscript𝛼𝑖superscriptsubscript𝑦𝑖subscript^𝑦𝑖2casessubscript𝛼𝑖10italic-ϕ1.718752.1875subscript𝛼𝑖1otherwise\displaystyle\frac{1}{M}\sum_{i=1}^{M}\alpha_{i}(y_{i}-\hat{y}_{i})^{2},\begin% {cases}\alpha_{i}=10,&\phi\in[-1.71875,2.1875]\\ \alpha_{i}=1,&\text{otherwise}\end{cases}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , { start_ROW start_CELL italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 10 , end_CELL start_CELL italic_ϕ ∈ [ - 1.71875 , 2.1875 ] end_CELL end_ROW start_ROW start_CELL italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , end_CELL start_CELL otherwise end_CELL end_ROW (2)

where αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT refers to weight given to pixel i𝑖iitalic_i in the image, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT refers to the actual LiDAR image, and y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT refers to the LiDAR image predicted by our model. The values of α𝛼\alphaitalic_α were obtained through trial-and-error.

TABLE I: MMSE comparison between different model complexities and all-zeroes prediction.
Scenario
\mathcal{L}caligraphic_L 31 32 33 34 Overall
{32, 16, 8, 8} 18.8390 16.8927 17.9924 22.5443 19.2058
{64, 32, 16, 16} 12.3774 12.6932 12.8715 18.4655 13.9846
{128, 64, 32, 32} 11.0890 10.4354 10.3220 14.4871 11.6153
{256, 128, 64, 64} 10.4239 9.3329 9.2480 12.1860 10.3931
{512, 256, 128, 128} 11.1808 9.1056 9.1524 12.4217 10.6743
Predict all zeroes 48.3617 26.7759 30.1806 30.8811 38.5810
{256, 128, 64, 64} - without transformer encoder 12.2389 (+14.83%) 9.2966 (-0.39%) 9.6717 (+4.38%) 11.9774 (-1.74%) 11.1118 (+6.47%)

We used an Adam Optimizer with a batch size of 32. The training was performed for 20 epochs, where the learning rate for the first 10 epochs was 0.001 and was reduced to 0.0001 for the last 10 epochs.

We used one transformer encoder layer with 12 heads in the multi-head attention model. The dimension of the feedforward network model was set to 2048, and the dropout was set to 0.1. The output of the transformer encoder layer had size of 1024.

For the LiDAR decoder, we first had a fully connected layer with 1350 neurons. The 1350 nodes were then reshaped to a matrix of 45x30. Next, we use 5 convolutional transpose layers, each consisting of convolutional transpose filters, batch normalization, and ReLU activation function. Let \mathcal{L}caligraphic_L denote the set consisting of convolutional filters in the first four layers, i.e., {l1,l2,l3,l4}subscript𝑙1subscript𝑙2subscript𝑙3subscript𝑙4\{l_{1},l_{2},l_{3},l_{4}\}{ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT }. In the final layer, we use 1 filter as it produces the final LiDAR image. In our experiments, we varied the number of filters, and obtained the best results with 256, 128, 64, 64 filters in the first four layers. The weights for the convolutional layers were initialized using a normal distribution with a mean of 0 and a standard deviation of 0.02, while batch normalization parameters were initialized with a mean of 1 and standard deviation 0.02.

PyTorch library was used to develop the model. The model was trained using the NVIDIA A100 Tensor core GPU. It took around 3 hours to train 20 epochs.

V Results and Discussions

In this section, we display our experimental results.

Refer to caption
Figure 3: Training and validation loss against number of epochs.

Fig. 3 shows the training set and validation set loss against the number of epochs during the training phase. Before the start of the training, the training and validation loss is larger than 35. By the end of the training, the training and validation MMSE losses have reduced by 74.3% and 71.2% to 9.3 and 10.5, respectively. Moreover, it can be observed that reducing the learning rate from 0.0010.0010.0010.001 to 0.00010.00010.00010.0001 after 10 epochs slightly helps in reducing MMSE further.

In Table I, we show the performance of the model when we vary the number of filters in the convolutional transpose layers in the LiDAR decoder. Furthermore, as the LiDAR image is sparse and contains the value of 0 for more than 95% of the pixels, we develop a lower benchmark called all-zeroes prediction, where the output LiDAR image consists of all zeroes. We provide another benchmark, where we train the model using the LiDAR decoder that results in the least MMSE, without the transformer encoder block (the modality embeddings are directly passed to the LiDAR decoder).

The predict all-zeroes benchmark results in overall MMSE of 38.5810 for the testing set. In comparison, all of our models perform better. At first, increasing the number of filters in the LiDAR decoder helps in reducing the MMSE loss. The best results are obtained with {256, 128, 64, 64} filters in the convolutional layers, resulting in an overall testing loss of 10.3931. Increasing the complexity of the LiDAR decoder further does not help in reducing the overall MMSE loss. Moreover, the benchmark without transformer encoder achieves an overall MMSE of 11.1118, which is higher than that of the proposed method by 6.47%. The scenario 34 obtains MMSE of 12.1860, which is higher than other scenarios. The model without transformer encoder achieves almost the same MMSE as our proposed method for scenario 34. One potential explanation might be that the data was gathered during nighttime, leading to a decline in image clarity. Interestingly, the highest MMSE in the all-zeroes prediction is observed for scenario 31. However, our model exhibits superior performance in scenario 31, achieving a notably low MMSE of 10.4239. In comparison, the model without transformer encoder achieves MMSE of 12.2389, which is 14.83% higher as compared to our proposed method.

Refer to caption
(a) Scenario 31
Refer to caption
(b) Scenario 32
Refer to caption
(c) Scenario 33
Refer to caption
(d) Scenario 34
Figure 4: Comparison between actual and predicted LiDAR for a random LiDAR point cloud from each scenario.

Fig. 4 compares the actual and predicted LiDAR image. We randomly sample a single LiDAR image from each scenario, and visually compare the actual LiDAR image and the LiDAR image predicted by our best model (with {256, 128, 64, 64} filters in LiDAR decoder). For each scenario, the model is able to learn most of the structures captured by the LiDAR image. As discussed earlier, the loss is highest in Scenario 34, which is also apparent by visual inspection of Fig. 4(d). However, overall the model is able to accurately learn most of the structures captured by the LiDAR sensor for all four scenarios. This shows the robustness of our model for multiple diverse environments.

VI Conclusions and future work

The LiDAR sensor proves valuable for context-aware communications, offering optimization benefits across various wireless communication tasks like handover prediction and beam prediction. In this work, we develop a multimodal transformer encoder based LiDAR point cloud generation model. Our model allows the generation of LiDAR sensor information without installing an expensive LiDAR sensor. The inputs to our model are the camera image and RADAR sensing information. We use pre-trained models to encode input modalities. The multimodal transformer model facilitates learning the relationship between modalities. Finally, we use a convolutional neural network based LiDAR decoder to generate a LiDAR image. Our experiments show that we were able to obtain MMSE of 10.3931. The visual inspection of the images shows that our model is able to learn most of the structures within the LiDAR image for diverse scenarios.

In the future, we can utilize the synthesized LiDAR point cloud to enhance various wireless communication tasks such as beam and blockage prediction. Furthermore, our work can be useful for other industries as well, like the self-driving cars.

References

  • [1] L. Qiao, Y. Li, D. Chen, S. Serikawa, M. Guizani, and Z. Lv, “A survey on 5G/6G, AI, and Robotics,” Computers and Electrical Engineering, vol. 95, p. 107372, 2021.
  • [2] D. K. Pin Tan et al., “Integrated Sensing and Communication in 6G: Motivations, Use Cases, Requirements, Challenges and Future Directions,” in 2021 1st IEEE International Online Symposium on Joint Communications & Sensing (JC&S), 2021, pp. 1–6.
  • [3] U. Demirhan and A. Alkhateeb, “Radar aided 6G beam prediction: Deep learning algorithms and real-world demonstration,” in 2022 IEEE Wireless Communications and Networking Conference (WCNC).   IEEE, 2022, pp. 2655–2660.
  • [4] G. Charan, T. Osman, A. Hredzak, N. Thawdar, and A. Alkhateeb, “Vision-position multi-modal beam prediction using real millimeter wave datasets,” in 2022 IEEE Wireless Communications and Networking Conference (WCNC).   IEEE, 2022, pp. 2727–2731.
  • [5] A. D. Singh, Y. Ba et al., “Depth estimation from camera image and mmwave radar point cloud,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9275–9285.
  • [6] S. Jiang, G. Charan, and A. Alkhateeb, “LiDAR aided future beam prediction in real-world millimeter wave V2I communications,” IEEE Wireless Communications Letters, vol. 12, no. 2, pp. 212–216, 2022.
  • [7] Y. Tian, Q. Zhao, F. Boukhalfa, K. Wu, F. Bader et al., “Multimodal Transformers for Wireless Communications: A Case Study in Beam Prediction,” arXiv preprint arXiv:2309.11811, 2023.
  • [8] S. Wu, C. Chakrabarti, and A. Alkhateeb, “LiDAR-aided mobile blockage prediction in real-world millimeter wave systems,” in 2022 IEEE Wireless Communications and Networking Conference (WCNC).   IEEE, 2022, pp. 2631–2636.
  • [9] A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [10] P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transformers: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [11] N. Balemans, P. Hellinckx, and J. Steckel, “Predicting lidar data from sonar images,” IEEE Access, vol. 9, pp. 57 897–57 906, 2021.
  • [12] P. Vacek, O. Jašek, K. Zimmermann, and T. Svoboda, “Learning to predict lidar intensities,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 4, pp. 3556–3564, 2021.
  • [13] V. Zyrianov, X. Zhu, and S. Wang, “Learning to generate realistic lidar point clouds,” in European Conference on Computer Vision.   Springer, 2022, pp. 17–35.
  • [14] K. Nakashima and R. Kurazume, “LiDAR Data Synthesis with Denoising Diffusion Probabilistic Models,” arXiv preprint arXiv:2309.09256, 2023.
  • [15] Y. Xiong, W.-C. Ma, J. Wang, and R. Urtasun, “Learning compact representations for lidar completion and generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1074–1083.
  • [16] A. Alkhateeb, G. Charan, T. Osman, A. Hredzak, J. Morais, U. Demirhan, and N. Srinivas, “DeepSense 6G: A Large-Scale Real-World Multi-Modal Sensing and Communication Dataset,” IEEE Communications Magazine, 2023.
  • [17] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” arXiv preprint arXiv:2401.10891, 2024.
  • [18] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.