-
Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model
Authors:
Seonhee Cho,
Choonghan Kim,
Jiho Lee,
Chetan Chilkunda,
Su** Choi,
Joo Heung Yoon
Abstract:
Recent advancements in Large Multimodal Models (LMMs) have attracted interest in their generalization capability with only a few samples in the prompt. This progress is particularly relevant to the medical domain, where the quality and sensitivity of data pose unique challenges for model training and application. However, the dependency on high-quality data for effective in-context learning raises…
▽ More
Recent advancements in Large Multimodal Models (LMMs) have attracted interest in their generalization capability with only a few samples in the prompt. This progress is particularly relevant to the medical domain, where the quality and sensitivity of data pose unique challenges for model training and application. However, the dependency on high-quality data for effective in-context learning raises questions about the feasibility of these models when encountering with the inevitable variations and errors inherent in real-world medical data. In this paper, we introduce MID-M, a novel framework that leverages the in-context learning capabilities of a general-domain Large Language Model (LLM) to process multimodal data via image descriptions. MID-M achieves a comparable or superior performance to task-specific fine-tuned LMMs and other general-domain ones, without the extensive domain-specific training or pre-training on multimodal data, with significantly fewer parameters. This highlights the potential of leveraging general-domain LLMs for domain-specific tasks and offers a sustainable and cost-effective alternative to traditional LMM developments. Moreover, the robustness of MID-M against data quality issues demonstrates its practical utility in real-world medical domain applications.
△ Less
Submitted 29 April, 2024;
originally announced May 2024.
-
CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment
Authors:
Hyeongmin Lee,
Kyoungkook Kang,
Jungseul Ok,
Sunghyun Cho
Abstract:
Recent image tone adjustment (or enhancement) approaches have predominantly adopted supervised learning for learning human-centric perceptual assessment. However, these approaches are constrained by intrinsic challenges of supervised learning. Primarily, the requirement for expertly-curated or retouched images escalates the data acquisition expenses. Moreover, their coverage of target style is con…
▽ More
Recent image tone adjustment (or enhancement) approaches have predominantly adopted supervised learning for learning human-centric perceptual assessment. However, these approaches are constrained by intrinsic challenges of supervised learning. Primarily, the requirement for expertly-curated or retouched images escalates the data acquisition expenses. Moreover, their coverage of target style is confined to stylistic variants inferred from the training data. To surmount the above challenges, we propose an unsupervised learning-based approach for text-based image tone adjustment method, CLIPtone, that extends an existing image enhancement method to accommodate natural language descriptions. Specifically, we design a hyper-network to adaptively modulate the pretrained parameters of the backbone model based on text description. To assess whether the adjusted image aligns with the text description without ground truth image, we utilize CLIP, which is trained on a vast set of language-image pairs and thus encompasses knowledge of human perception. The major advantages of our approach are three fold: (i) minimal data collection expenses, (ii) support for a range of adjustments, and (iii) the ability to handle novel text descriptions unseen in training. Our approach's efficacy is demonstrated through comprehensive experiments, including a user study.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
C-Band Lithium Niobate on Silicon Carbide SAW Resonator with Figure-of-Merit of 124 at 6.5 GHz
Authors:
Tzu-Hsuan Hsu,
Joshua Campbell,
Jack Kramer,
Sinwoo Cho,
Ming-Huang Li,
Ruochen Lu
Abstract:
In this work, we demonstrate a C-band shear-horizontal surface acoustic wave (SH-SAW) resonator with high electromechanical coupling (kt2) of 22% and a quality factor (Q) of 565 based on a thin-film lithium niobate (LN) on silicon carbide (SiC) platform, featuring an excellent figure-of-merit (FoM = kt2*Q ) of 124 at 6.5 GHz, the highest FoM reported in this frequency range. The resonator frequenc…
▽ More
In this work, we demonstrate a C-band shear-horizontal surface acoustic wave (SH-SAW) resonator with high electromechanical coupling (kt2) of 22% and a quality factor (Q) of 565 based on a thin-film lithium niobate (LN) on silicon carbide (SiC) platform, featuring an excellent figure-of-merit (FoM = kt2*Q ) of 124 at 6.5 GHz, the highest FoM reported in this frequency range. The resonator frequency upscaling is achieved through wavelength ($λ$) reduction and the use of thin aluminum (Al) electrodes. The LN/SiC waveguide and synchronous resonator design collectively enable effective acoustic energy confinement for a high FoM, even when the normalized thickness of LN approaches a scale of 0.5$λ$ to 1$λ$. To perform a comprehensive study, we also designed and fabricated five additional resonators, expending the $λ$ studied ranging from 480 to 800 nm, in the same 500 nm-thick transferred Y-cut thin-film LN on SiC. The fabricated SH-SAW resonators, operating from 5 to 8 GHz, experimentally demonstrate a kt2 from 20.3% to 22.9% and a Q from 350 to 575, thereby covering the entire C-band with excellent performance.
△ Less
Submitted 29 June, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
23.8-GHz Acoustic Filter in Periodically Poled Piezoelectric Film Lithium Niobate With 1.52-dB IL and 19.4% FBW
Authors:
Sinwoo Cho,
Omar Barrera,
Jack Kramer,
Vakhtang Chulukhadze,
Tzu-Hsuan Hsu,
Joshua Campbell,
Ian Anderson,
Ruochen Lu
Abstract:
This paper reports the first piezoelectric acoustic filter in periodically poled piezoelectric film (P3F) lithium niobate (LiNbO3) at 23.8 GHz with low insertion loss (IL) of 1.52 dB and 3-dB fractional bandwidth (FBW) of 19.4%. The filter features a compact footprint of 0.64 mm2. The third-order ladder filter is implemented with electrically coupled resonators in 150 nm bi-layer P3F 128 rotated Y…
▽ More
This paper reports the first piezoelectric acoustic filter in periodically poled piezoelectric film (P3F) lithium niobate (LiNbO3) at 23.8 GHz with low insertion loss (IL) of 1.52 dB and 3-dB fractional bandwidth (FBW) of 19.4%. The filter features a compact footprint of 0.64 mm2. The third-order ladder filter is implemented with electrically coupled resonators in 150 nm bi-layer P3F 128 rotated Y-cut LiNbO3 thin film, operating in second-order symmetric (S2) Lamb mode. The record-breaking performance is enabled by the P3F LiNbO3 platform, where piezoelectric thin films of alternating orientations are transferred subsequently, facilitating efficient higher-order Lamb mode operation with simultaneously high quality factor (Q) and coupling coefficient (k2) at millimeter-wave (mmWave). Also, the multi-layer P3F stack promises smaller footprints and better nonlinearity than single-layer counterparts, thanks to the higher capacitance density and lower thermal resistance. Upon further development, the reported P3F LiNbO3 platform is promising for compact filters at mmWave.
△ Less
Submitted 28 June, 2024; v1 submitted 19 February, 2024;
originally announced February 2024.
-
Thin-film Lithium Niobate on Insulator Surface Acoustic Wave Devices for 6G Centimeter Bands
Authors:
Tzu-Hsuan Hsu,
Joshua Campbell,
Jack Kramer,
Sinwoo Cho,
Zhi-Qiang Lee,
Ming-Huang Li,
Ruochen Lu
Abstract:
In this work, we investigate the frequency scaling of shear-horizontal (S.H.) surface acoustic wave (SAW) resonators based on a lithium niobate on insulator (LNOI) substrate into the centimeter bands for 6G wireless systems. Prototyped resonators with wavelengths ranging between 240 nm and 400 nm were fabricated, and the experimental results exhibit a successful frequency scaling between 9.05 and…
▽ More
In this work, we investigate the frequency scaling of shear-horizontal (S.H.) surface acoustic wave (SAW) resonators based on a lithium niobate on insulator (LNOI) substrate into the centimeter bands for 6G wireless systems. Prototyped resonators with wavelengths ranging between 240 nm and 400 nm were fabricated, and the experimental results exhibit a successful frequency scaling between 9.05 and 13.37 GHz. However, a noticeable performance degradation can be observed as the resonance frequency (fs) scales. Such an effect is expected to be caused by non-ideal helec/λ for smaller λ devices. The optimized LNOI SH-SAW with a λ of 400 nm exhibits a fs of 9.05 GHz, a keff2 of 15%, Qmax of 213 and a FoM of 32, which indicates a successful implementation for device targeting centimeter bands.
△ Less
Submitted 25 February, 2024; v1 submitted 18 February, 2024;
originally announced February 2024.
-
UGPNet: Universal Generative Prior for Image Restoration
Authors:
Hwayoon Lee,
Kyoungkook Kang,
Hyeongmin Lee,
Seung-Hwan Baek,
Sunghyun Cho
Abstract:
Recent image restoration methods can be broadly categorized into two classes: (1) regression methods that recover the rough structure of the original image without synthesizing high-frequency details and (2) generative methods that synthesize perceptually-realistic high-frequency details even though the resulting image deviates from the original structure of the input. While both directions have b…
▽ More
Recent image restoration methods can be broadly categorized into two classes: (1) regression methods that recover the rough structure of the original image without synthesizing high-frequency details and (2) generative methods that synthesize perceptually-realistic high-frequency details even though the resulting image deviates from the original structure of the input. While both directions have been extensively studied in isolation, merging their benefits with a single framework has been rarely studied. In this paper, we propose UGPNet, a universal image restoration framework that can effectively achieve the benefits of both approaches by simply adopting a pair of an existing regression model and a generative model. UGPNet first restores the image structure of a degraded input using a regression model and synthesizes a perceptually-realistic image with a generative model on top of the regressed output. UGPNet then combines the regressed output and the synthesized output, resulting in a final result that faithfully reconstructs the structure of the original image in addition to perceptually-realistic textures. Our extensive experiments on deblurring, denoising, and super-resolution demonstrate that UGPNet can successfully exploit both regression and generative methods for high-fidelity image restoration.
△ Less
Submitted 30 December, 2023;
originally announced January 2024.
-
PhysRFANet: Physics-Guided Neural Network for Real-Time Prediction of Thermal Effect During Radiofrequency Ablation Treatment
Authors:
Minwoo Shin,
Minjee Seo,
Seonaeng Cho,
Juil Park,
Joon Ho Kwon,
Deukhee Lee,
Kyungho Yoon
Abstract:
Radiofrequency ablation (RFA) is a widely used minimally invasive technique for ablating solid tumors. Achieving precise personalized treatment necessitates feedback information on in situ thermal effects induced by the RFA procedure. While computer simulation facilitates the prediction of electrical and thermal phenomena associated with RFA, its practical implementation in clinical settings is hi…
▽ More
Radiofrequency ablation (RFA) is a widely used minimally invasive technique for ablating solid tumors. Achieving precise personalized treatment necessitates feedback information on in situ thermal effects induced by the RFA procedure. While computer simulation facilitates the prediction of electrical and thermal phenomena associated with RFA, its practical implementation in clinical settings is hindered by high computational demands. In this paper, we propose a physics-guided neural network model, named PhysRFANet, to enable real-time prediction of thermal effect during RFA treatment. The networks, designed for predicting temperature distribution and the corresponding ablation lesion, were trained using biophysical computational models that integrated electrostatics, bio-heat transfer, and cell necrosis, alongside magnetic resonance (MR) images of breast cancer patients. Validation of the computational model was performed through experiments on ex vivo bovine liver tissue. Our model demonstrated a 96% Dice score in predicting the lesion volume and an RMSE of 0.4854 for temperature distribution when tested with foreseen tumor images. Notably, even with unforeseen images, it achieved a 93% Dice score for the ablation lesion and an RMSE of 0.6783 for temperature distribution. All networks were capable of inferring results within 10 ms. The presented technique, applied to optimize the placement of the electrode for a specific target region, holds significant promise in enhancing the safety and efficacy of RFA treatments.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
ParamISP: Learned Forward and Inverse ISPs using Camera Parameters
Authors:
Woohyeok Kim,
Geonu Kim,
Junyong Lee,
Seungyong Lee,
Seung-Hwan Baek,
Sunghyun Cho
Abstract:
RAW images are rarely shared mainly due to its excessive data size compared to their sRGB counterparts obtained by camera ISPs. Learning the forward and inverse processes of camera ISPs has been recently demonstrated, enabling physically-meaningful RAW-level image processing on input sRGB images. However, existing learning-based ISP methods fail to handle the large variations in the ISP processes…
▽ More
RAW images are rarely shared mainly due to its excessive data size compared to their sRGB counterparts obtained by camera ISPs. Learning the forward and inverse processes of camera ISPs has been recently demonstrated, enabling physically-meaningful RAW-level image processing on input sRGB images. However, existing learning-based ISP methods fail to handle the large variations in the ISP processes with respect to camera parameters such as ISO and exposure time, and have limitations when used for various applications. In this paper, we propose ParamISP, a learning-based method for forward and inverse conversion between sRGB and RAW images, that adopts a novel neural-network module to utilize camera parameters, which is dubbed as ParamNet. Given the camera parameters provided in the EXIF data, ParamNet converts them into a feature vector to control the ISP networks. Extensive experiments demonstrate that ParamISP achieve superior RAW and sRGB reconstruction results compared to previous methods and it can be effectively used for a variety of applications such as deblurring dataset synthesis, raw deblurring, HDR reconstruction, and camera-to-camera transfer.
△ Less
Submitted 14 April, 2024; v1 submitted 20 December, 2023;
originally announced December 2023.
-
Fast and accurate sparse-view CBCT reconstruction using meta-learned neural attenuation field and hash-encoding regularization
Authors:
Heejun Shin,
Taehee Kim,
Jongho Lee,
Se Young Chun,
Seungryung Cho,
Dongmyung Shin
Abstract:
Cone beam computed tomography (CBCT) is an emerging medical imaging technique to visualize the internal anatomical structures of patients. During a CBCT scan, several projection images of different angles or views are collectively utilized to reconstruct a tomographic image. However, reducing the number of projections in a CBCT scan while preserving the quality of a reconstructed image is challeng…
▽ More
Cone beam computed tomography (CBCT) is an emerging medical imaging technique to visualize the internal anatomical structures of patients. During a CBCT scan, several projection images of different angles or views are collectively utilized to reconstruct a tomographic image. However, reducing the number of projections in a CBCT scan while preserving the quality of a reconstructed image is challenging due to the nature of an ill-posed inverse problem. Recently, a neural attenuation field (NAF) method was proposed by adopting a neural radiance field algorithm as a new way for CBCT reconstruction, demonstrating fast and promising results using only 50 views. However, decreasing the number of projections is still preferable to reduce potential radiation exposure, and a faster reconstruction time is required considering a typical scan time. In this work, we propose a fast and accurate sparse-view CBCT reconstruction (FACT) method to provide better reconstruction quality and faster optimization speed in the minimal number of view acquisitions ($<$ 50 views). In the FACT method, we meta-trained a neural network and a hash-encoder using a few scans (= 15), and a new regularization technique is utilized to reconstruct the details of an anatomical structure. In conclusion, we have shown that the FACT method produced better, and faster reconstruction results over the other conventional algorithms based on CBCT scans of different body parts (chest, head, and abdomen) and CT vendors (Siemens, Phillips, and GE).
△ Less
Submitted 16 January, 2024; v1 submitted 4 December, 2023;
originally announced December 2023.
-
Millimeter Wave Thin-Film Bulk Acoustic Resonator in Sputtered Scandium Aluminum Nitride Using Platinum Electrodes
Authors:
Sinwoo Cho,
Omar Barrera,
Pietro Simeoni,
Ellie Y. Wang,
Jack Kramer,
Vakhtang Chulukhadze,
Joshua Campbell,
Matteo Rinaldi,
Ruochen Lu
Abstract:
This work describes sputtered scandium aluminum nitride (ScAlN) thin-film bulk acoustic resonators (FBAR) at millimeter wave (mmWave) with high quality factor (Q) using platinum (Pt) electrodes. FBARs with combinations of Pt and aluminum (Al) electrodes, i.e., Al top Al bottom, Pt top Al bottom, Al top Pt bottom, and Pt top Pt bottom, are built to study the impact of electrodes on mmWave FBARs. Th…
▽ More
This work describes sputtered scandium aluminum nitride (ScAlN) thin-film bulk acoustic resonators (FBAR) at millimeter wave (mmWave) with high quality factor (Q) using platinum (Pt) electrodes. FBARs with combinations of Pt and aluminum (Al) electrodes, i.e., Al top Al bottom, Pt top Al bottom, Al top Pt bottom, and Pt top Pt bottom, are built to study the impact of electrodes on mmWave FBARs. The demonstrated FBAR with Pt top and bottom electrodes achieve electromechanical coupling (k2) of 4.0% and Q of 116 for the first-order symmetric (S1) mode at 13.7 GHz, and k2 of 1.8% and Q of 94 for third-order symmetric (S3) mode at 61.6 GHz. Through these results, we confirmed that even in the frequency band of approximately 60 GHz, ScAlN FBAR can achieve a Q factor approaching 100 with optimized fabrication and acoustic/EM design. Further development calls for stacks with better quality in piezoelectric and metallic layers.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
Transferred Thin Film Lithium Niobate as Millimeter Wave Acoustic Filter Platforms
Authors:
Omar Barrera,
Sinwoo Cho,
Kenny Hyunh,
Jack Kramer,
Michael Liao,
Vakhtang Chulukhadze,
Lezli Matto,
Mark S. Goorsky,
Ruochen Lu
Abstract:
This paper reports the first high-performance acoustic filters toward millimeter wave (mmWave) bands using transferred single-crystal thin film lithium niobate (LiNbO3). By transferring LiNbO3 on the top of silicon (Si) and sapphire (Al2O3) substrates with an intermediate amorphous Si (aSi) bonding and sacrificial layer, we demonstrate compact acoustic filters with record-breaking performance beyo…
▽ More
This paper reports the first high-performance acoustic filters toward millimeter wave (mmWave) bands using transferred single-crystal thin film lithium niobate (LiNbO3). By transferring LiNbO3 on the top of silicon (Si) and sapphire (Al2O3) substrates with an intermediate amorphous Si (aSi) bonding and sacrificial layer, we demonstrate compact acoustic filters with record-breaking performance beyond 20 GHz. In the LN-aSi-Al2O3 platform, the third-order ladder filter exhibits low insertion loss (IL) of 1.62 dB and 3-dB fractional bandwidth (FBW) of 19.8% at 22.1 GHz, while in the LN-aSi-Si platform, the filter shows low IL of 2.38 dB and FBW of 18.2% at 23.5 GHz. Material analysis validates the great crystalline quality of the stacks. The high-resolution x-ray diffraction (HRXRD) shows full width half maximum (FWHM) of 53 arcsec for Al2O3 and 206 arcsec for Si, both remarkably low compared to piezoelectric thin films of similar thickness. The reported results bring the state-of-the-art (SoA) of compact acoustic filters to much higher frequencies, and highlight transferred LiNbO3 as promising platforms for mmWave filters in future wireless front ends.
△ Less
Submitted 21 November, 2023;
originally announced November 2023.
-
38.7 GHz Thin Film Lithium Niobate Acoustic Filter
Authors:
Omar Barrera,
Sinwoo Cho,
Jack Kramer,
Vakhtang Chulukhadze,
Joshua Campbell,
Ruochen Lu
Abstract:
In this work, a 38.7 GHz acoustic wave ladder filter exhibiting insertion loss (IL) of 5.63 dB and 3-dB fractional bandwidth (FBW) of 17.6% is demonstrated, pushing the frequency limits of thin-film piezoelectric acoustic filter technology. The filter achieves operating frequency up to 5G millimeter wave (mmWave) frequency range 2 (FR2) bands, by thinning thin-film LiNbO3 resonators to sub-50 nm t…
▽ More
In this work, a 38.7 GHz acoustic wave ladder filter exhibiting insertion loss (IL) of 5.63 dB and 3-dB fractional bandwidth (FBW) of 17.6% is demonstrated, pushing the frequency limits of thin-film piezoelectric acoustic filter technology. The filter achieves operating frequency up to 5G millimeter wave (mmWave) frequency range 2 (FR2) bands, by thinning thin-film LiNbO3 resonators to sub-50 nm thickness. The high electromechanical coupling (k2) and quality factor (Q) of first-order antisymmetric (A1) mode resonators in 128 Y-cut lithium niobate (LiNbO3) collectively enable the first acoustic filters at mmWave. The key design consideration of electromagnetic (EM) resonances in interdigitated transducers (IDT) is addressed and mitigated. These results indicate that thin-film piezoelectric resonators could be pushed to 5G FR2 bands. Further performance enhancement and frequency scaling calls for better resonator technologies and EM-acoustic filter co-design.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
RTNH+: Enhanced 4D Radar Object Detection Network using Combined CFAR-based Two-level Preprocessing and Vertical Encoding
Authors:
Seung-Hyun Kong,
Dong-Hee Paek,
Sangjae Cho
Abstract:
Four-dimensional (4D) Radar is a useful sensor for 3D object detection and the relative radial speed estimation of surrounding objects under various weather conditions. However, since Radar measurements are corrupted with invalid components such as noise, interference, and clutter, it is necessary to employ a preprocessing algorithm before the 3D object detection with neural networks. In this pape…
▽ More
Four-dimensional (4D) Radar is a useful sensor for 3D object detection and the relative radial speed estimation of surrounding objects under various weather conditions. However, since Radar measurements are corrupted with invalid components such as noise, interference, and clutter, it is necessary to employ a preprocessing algorithm before the 3D object detection with neural networks. In this paper, we propose RTNH+ that is an enhanced version of RTNH, a 4D Radar object detection network, by two novel algorithms. The first algorithm is the combined constant false alarm rate (CFAR)-based two-level preprocessing (CCTP) algorithm that generates two filtered measurements of different characteristics using the same 4D Radar measurements, which can enrich the information of the input to the 4D Radar object detection network. The second is the vertical encoding (VE) algorithm that effectively encodes vertical features of the road objects from the CCTP outputs. We provide details of the RTNH+, and demonstrate that RTNH+ achieves significant performance improvement of 10.14\% in ${{AP}_{3D}^{IoU=0.3}}$ and 16.12\% in ${{AP}_{3D}^{IoU=0.5}}$ over RTNH.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Fundamental Antisymmetric Mode Acoustic Resonator in Periodically Poled Piezoelectric Film Lithium Niobate
Authors:
Omar Barrera,
Jack Kramer,
Ryan Tetro,
Sinwoo Cho,
Vakhtang Chulukhadze,
Luca Colombo,
Ruochen Lu
Abstract:
Radio frequency (RF) acoustic resonators have long been used for signal processing and sensing. Devices that integrate acoustic resonators benefit from their slow phase velocity (vp), in the order of 3 to 10 km/s, which allows miniaturization of the device. Regarding the subject of small form factor, acoustic resonators that operate at the so-called fundamental antisymmetric mode (A0), feature eve…
▽ More
Radio frequency (RF) acoustic resonators have long been used for signal processing and sensing. Devices that integrate acoustic resonators benefit from their slow phase velocity (vp), in the order of 3 to 10 km/s, which allows miniaturization of the device. Regarding the subject of small form factor, acoustic resonators that operate at the so-called fundamental antisymmetric mode (A0), feature even slower vp (1 to 3 km/s), which allows for smaller devices. This work reports the design and fabrication of A0 mode resonators leveraging the advantages of periodically poled piezoelectricity (P3F) lithium niobate, which includes a pair of piezoelectric layers with opposite polarizations to mitigate the charge cancellation arising from opposite stress of A0 in the top and bottom piezoelectric layers. The fabricated device shows a quality factor (Q) of 800 and an electromechanical coupling (k2) of 3.29, resulting in a high figure of merit (FoM, Q times k2) of 26.3 at the resonant frequency of 294 MHz, demonstrating the first efficient A0 device in P3F platforms. The proposed A0 platform could enable miniature signal processing, sensing, and ultrasound transducer applications upon optimization.
△ Less
Submitted 27 August, 2023;
originally announced September 2023.
-
Millimeter Wave Thin-Film Bulk Acoustic Resonator in Sputtered Scandium Aluminum Nitride
Authors:
Sinwoo Cho,
Omar Barrera,
Pietro Simeoni,
Emily N. Marshall,
Jack Kramer,
Keisuke Motoki,
Tzu-Hsuan Hsu,
Vakhtang Chulukhadze,
Matteo Rinaldi,
W. Alan Doolittle,
Ruochen Lu
Abstract:
This work reports a millimeter wave (mmWave) thin-film bulk acoustic resonator (FBAR) in sputtered scandium aluminum nitride (ScAlN). This paper identifies challenges of frequency scaling sputtered ScAlN into mmWave and proposes a stack and new fabrication procedure with a sputtered Sc0.3Al0.7N on Al on Si carrier wafer. The resonator achieves electromechanical coupling (k2) of 7.0% and quality fa…
▽ More
This work reports a millimeter wave (mmWave) thin-film bulk acoustic resonator (FBAR) in sputtered scandium aluminum nitride (ScAlN). This paper identifies challenges of frequency scaling sputtered ScAlN into mmWave and proposes a stack and new fabrication procedure with a sputtered Sc0.3Al0.7N on Al on Si carrier wafer. The resonator achieves electromechanical coupling (k2) of 7.0% and quality factor (Q) of 62 for the first-order symmetric (S1) mode at 21.4 GHz, along with k2 of 4.0% and Q of 19 for the third-order symmetric (S3) mode at 55.4 GHz, showing higher figures of merit (FoM, k2xQ) than reported AlN/ScAlN-based mmWave acoustic resonators. The ScAlN quality is identified by transmission electron microscopy (TEM) and X-ray diffraction (XRD), identifying the bottlenecks in the existing piezoelectric-metal stack. Further improvement of ScAlN/AlN-based mmWave acoustic resonators calls for better crystalline quality from improved thin-film deposition methods.
△ Less
Submitted 6 September, 2023;
originally announced September 2023.
-
Thin-Film Lithium Niobate Acoustic Resonator with High Q of 237 and k2 of 5.1% at 50.74 GHz
Authors:
Jack Kramer,
Vakhtang Chulukhadze,
Kenny Huynh,
Omar Barrera,
Michael Liao,
Sinwoo Cho,
Lezli Matto,
Mark S. Goorsky,
Ruochen Lu
Abstract:
This work reports a 50.74 GHz lithium niobate (LiNbO3) acoustic resonator with a high quality factor (Q) of 237 and an electromechanical coupling (k2) of 5.17% resulting in a figure of merit (FoM, Q x k2) of 12.2. The LiNbO3 resonator employs a novel bilayer periodically poled piezoelectric film (P3F) 128 Y-cut LiNbO3 on amorphous silicon (a-Si) on sapphire stack to achieve low losses and high cou…
▽ More
This work reports a 50.74 GHz lithium niobate (LiNbO3) acoustic resonator with a high quality factor (Q) of 237 and an electromechanical coupling (k2) of 5.17% resulting in a figure of merit (FoM, Q x k2) of 12.2. The LiNbO3 resonator employs a novel bilayer periodically poled piezoelectric film (P3F) 128 Y-cut LiNbO3 on amorphous silicon (a-Si) on sapphire stack to achieve low losses and high coupling at millimeter wave (mm-wave). The device also shows a Q of 159, k2 of 65.06%, and FoM of 103.4 for the 16.99 GHz tone. This result shows promising prospects of P3F LiNbO3 towards mm-wave front-end filters.
△ Less
Submitted 11 July, 2023;
originally announced July 2023.
-
Thin-Film Lithium Niobate Acoustic Filter at 23.5 GHz with 2.38 dB IL and 18.2% FBW
Authors:
Omar Barrera,
Sinwoo Cho,
Lezli Matto,
Jack Kramer,
Kenny Huynh,
Vakhtang Chulukhadze,
Yen-Wei Chang,
Mark S. Goorsky,
Ruochen Lu
Abstract:
This work reports an acoustic filter at 23.5 GHz with a low insertion loss (IL) of 2.38 dB and a 3-dB fractional bandwidth (FBW) of 18.2%, significantly surpassing the state-of-the-art. The device leverages electrically coupled acoustic resonators in 100 nm 128° Y-cut lithium niobate (LiNbO3) piezoelectric thin film, operating in the first-order antisymmetric (A1) mode. A new film stack, namely tr…
▽ More
This work reports an acoustic filter at 23.5 GHz with a low insertion loss (IL) of 2.38 dB and a 3-dB fractional bandwidth (FBW) of 18.2%, significantly surpassing the state-of-the-art. The device leverages electrically coupled acoustic resonators in 100 nm 128° Y-cut lithium niobate (LiNbO3) piezoelectric thin film, operating in the first-order antisymmetric (A1) mode. A new film stack, namely transferred thin-film LiNbO3 on silicon (Si) substrate with an intermediate amorphous silicon (a-Si) layer, facilitates the record-breaking performance at millimeter-wave (mmWave). The filter features a compact footprint of 0.56 mm2. In this letter, acoustic and EM consideration, along with material characterization with X-ray diffraction and verified with cross-sectional electron microscopy are reported. Upon further development, the reported filter platform can enable various front-end signal-processing functions at mmWave.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
On the Spatial-Wideband Effects in Millimeter-Wave Cell-Free Massive MIMO
Authors:
Seyoung Ahn,
Soohyeong Kim,
Yongseok Kwon,
Joohan Park,
Jiseung Youn,
Sunghyun Cho
Abstract:
In this paper, we investigate the spatial-wideband effects in cell-free massive MIMO (CF-mMIMO) systems in mmWave bands. The utilization of mmWave frequencies brings challenges such as signal attenuation and the need for denser networks like ultra-dense networks (UDN) to maintain communication performance. CF-mMIMO is introduced as a solution, where distributed access points (APs) transmit signals…
▽ More
In this paper, we investigate the spatial-wideband effects in cell-free massive MIMO (CF-mMIMO) systems in mmWave bands. The utilization of mmWave frequencies brings challenges such as signal attenuation and the need for denser networks like ultra-dense networks (UDN) to maintain communication performance. CF-mMIMO is introduced as a solution, where distributed access points (APs) transmit signals to a central processing unit (CPU) for joint processing. CF-mMIMO offers advantages in reducing non-line-of-sight (NLOS) conditions and overcoming signal blockage. We investigate the synchronization problem in CF-mMIMO due to time delays between APs. It proposes a minimum cyclic prefix length to mitigate inter-symbol interference (ISI) in OFDM systems. Furthermore, the spatial correlations of channel responses are analyzed in the frequency-phase domain. The impact of these correlations on system performance is examined. The findings contribute to improving the performance of CF-mMIMO systems and enhancing the effective utilization of mmWave communication.
△ Less
Submitted 6 July, 2023;
originally announced July 2023.
-
Neural Spectro-polarimetric Fields
Authors:
Youngchan Kim,
Wonjoon **,
Sunghyun Cho,
Seung-Hwan Baek
Abstract:
Modeling the spatial radiance distribution of light rays in a scene has been extensively explored for applications, including view synthesis. Spectrum and polarization, the wave properties of light, are often neglected due to their integration into three RGB spectral bands and their non-perceptibility to human vision. However, these properties are known to encompass substantial material and geomet…
▽ More
Modeling the spatial radiance distribution of light rays in a scene has been extensively explored for applications, including view synthesis. Spectrum and polarization, the wave properties of light, are often neglected due to their integration into three RGB spectral bands and their non-perceptibility to human vision. However, these properties are known to encompass substantial material and geometric information about a scene. Here, we propose to model spectro-polarimetric fields, the spatial Stokes-vector distribution of any light ray at an arbitrary wavelength. We present Neural Spectro-polarimetric Fields (NeSpoF), a neural representation that models the physically-valid Stokes vector at given continuous variables of position, direction, and wavelength. NeSpoF manages inherently noisy raw measurements, showcases memory efficiency, and preserves physically vital signals - factors that are crucial for representing the high-dimensional signal of a spectro-polarimetric field. To validate NeSpoF, we introduce the first multi-view hyperspectral-polarimetric image dataset, comprised of both synthetic and real-world scenes. These were captured using our compact hyperspectral-polarimetric imaging system, which has been calibrated for robustness against system imperfections. We demonstrate the capabilities of NeSpoF on diverse scenes.
△ Less
Submitted 10 December, 2023; v1 submitted 21 June, 2023;
originally announced June 2023.
-
Lightweight Monocular Depth Estimation via Token-Sharing Transformer
Authors:
Dong-Jae Lee,
Jae Young Lee,
Hyounguk Shon,
Eo**dl Yi,
Yeong-Hun Park,
Sung-Sik Cho,
Junmo Kim
Abstract:
Depth estimation is an important task in various robotics systems and applications. In mobile robotics systems, monocular depth estimation is desirable since a single RGB camera can be deployable at a low cost and compact size. Due to its significant and growing needs, many lightweight monocular depth estimation networks have been proposed for mobile robotics systems. While most lightweight monocu…
▽ More
Depth estimation is an important task in various robotics systems and applications. In mobile robotics systems, monocular depth estimation is desirable since a single RGB camera can be deployable at a low cost and compact size. Due to its significant and growing needs, many lightweight monocular depth estimation networks have been proposed for mobile robotics systems. While most lightweight monocular depth estimation methods have been developed using convolution neural networks, the Transformer has been gradually utilized in monocular depth estimation recently. However, massive parameters and large computational costs in the Transformer disturb the deployment to embedded devices. In this paper, we present a Token-Sharing Transformer (TST), an architecture using the Transformer for monocular depth estimation, optimized especially in embedded devices. The proposed TST utilizes global token sharing, which enables the model to obtain an accurate depth prediction with high throughput in embedded devices. Experimental results show that TST outperforms the existing lightweight monocular depth estimation methods. On the NYU Depth v2 dataset, TST can deliver depth maps up to 63.4 FPS in NVIDIA Jetson nano and 142.6 FPS in NVIDIA Jetson TX2, with lower errors than the existing methods. Furthermore, TST achieves real-time depth estimation of high-resolution images on Jetson TX2 with competitive results.
△ Less
Submitted 9 June, 2023;
originally announced June 2023.
-
An empirical study on speech restoration guided by self supervised speech representation
Authors:
Jaeuk Byun,
Youna Ji,
Soo Whan Chung,
Soyeon Choe,
Min Seok Choi
Abstract:
Enhancing speech quality is an indispensable yet difficult task as it is often complicated by a range of degradation factors. In addition to additive noise, reverberation, clip**, and speech attenuation can all adversely affect speech quality. Speech restoration aims to recover speech components from these distortions. This paper focuses on exploring the impact of self-supervised speech represen…
▽ More
Enhancing speech quality is an indispensable yet difficult task as it is often complicated by a range of degradation factors. In addition to additive noise, reverberation, clip**, and speech attenuation can all adversely affect speech quality. Speech restoration aims to recover speech components from these distortions. This paper focuses on exploring the impact of self-supervised speech representation learning on the speech restoration task. Specifically, we employ speech representation in various speech restoration networks and evaluate their performance under complicated distortion scenarios. Our experiments demonstrate that the contextual information provided by the self-supervised speech representation can enhance speech restoration performance in various distortion scenarios, while also increasing robustness against the duration of speech attenuation and mismatched test conditions.
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
OCELOT: Overlapped Cell on Tissue Dataset for Histopathology
Authors:
Jeongun Ryu,
Aaron Valero Puche,
JaeWoong Shin,
Seonwook Park,
Biagio Brattoli,
**hee Lee,
Wonkyung Jung,
Soo Ick Cho,
Kyunghyun Paeng,
Chan-Young Ock,
Donggeun Yoo,
Sérgio Pereira
Abstract:
Cell detection is a fundamental task in computational pathology that can be used for extracting high-level medical information from whole-slide images. For accurate cell detection, pathologists often zoom out to understand the tissue-level structures and zoom in to classify cells based on their morphology and the surrounding context. However, there is a lack of efforts to reflect such behaviors by…
▽ More
Cell detection is a fundamental task in computational pathology that can be used for extracting high-level medical information from whole-slide images. For accurate cell detection, pathologists often zoom out to understand the tissue-level structures and zoom in to classify cells based on their morphology and the surrounding context. However, there is a lack of efforts to reflect such behaviors by pathologists in the cell detection models, mainly due to the lack of datasets containing both cell and tissue annotations with overlap** regions. To overcome this limitation, we propose and publicly release OCELOT, a dataset purposely dedicated to the study of cell-tissue relationships for cell detection in histopathology. OCELOT provides overlap** cell and tissue annotations on images acquired from multiple organs. Within this setting, we also propose multi-task learning approaches that benefit from learning both cell and tissue tasks simultaneously. When compared against a model trained only for the cell detection task, our proposed approaches improve cell detection performance on 3 datasets: proposed OCELOT, public TIGER, and internal CARP datasets. On the OCELOT test set in particular, we show up to 6.79 improvement in F1-score. We believe the contributions of this paper, including the release of the OCELOT dataset at https://lunit-io.github.io/research/publications/ocelot are a crucial starting point toward the important research direction of incorporating cell-tissue relationships in computation pathology.
△ Less
Submitted 23 March, 2023; v1 submitted 23 March, 2023;
originally announced March 2023.
-
MLP-SRGAN: A Single-Dimension Super Resolution GAN using MLP-Mixer
Authors:
Samir Mitha,
Seungho Choe,
Pejman Jahbedar Maralani,
Alan R. Moody,
April Khademi
Abstract:
We propose a novel architecture called MLP-SRGAN, which is a single-dimension Super Resolution Generative Adversarial Network (SRGAN) that utilizes Multi-Layer Perceptron Mixers (MLP-Mixers) along with convolutional layers to upsample in the slice direction. MLP-SRGAN is trained and validated using high resolution (HR) FLAIR MRI from the MSSEG2 challenge dataset. The method was applied to three mu…
▽ More
We propose a novel architecture called MLP-SRGAN, which is a single-dimension Super Resolution Generative Adversarial Network (SRGAN) that utilizes Multi-Layer Perceptron Mixers (MLP-Mixers) along with convolutional layers to upsample in the slice direction. MLP-SRGAN is trained and validated using high resolution (HR) FLAIR MRI from the MSSEG2 challenge dataset. The method was applied to three multicentre FLAIR datasets (CAIN, ADNI, CCNA) of images with low spatial resolution in the slice dimension to examine performance on held-out (unseen) clinical data. Upsampled results are compared to several state-of-the-art SR networks. For images with high resolution (HR) ground truths, peak-signal-to-noise-ratio (PSNR) and structural similarity index (SSIM) are used to measure upsampling performance. Several new structural, no-reference image quality metrics were proposed to quantify sharpness (edge strength), noise (entropy), and blurriness (low frequency information) in the absence of ground truths. Results show MLP-SRGAN results in sharper edges, less blurring, preserves more texture and fine-anatomical detail, with fewer parameters, faster training/evaluation time, and smaller model size than existing methods. Code for MLP-SRGAN training and inference, data generators, models and no-reference image quality metrics will be available at https://github.com/IAMLAB-Ryerson/MLP-SRGAN.
△ Less
Submitted 10 March, 2023;
originally announced March 2023.
-
Data-Driven Web-Based Patching Management Tool Using Multi-Sensor Pavement Structure Measurements
Authors:
Sneha Jha,
Yaguang Zhang,
Bongsuk Park,
Seonghwan Cho,
James V. Krogmeier,
Tandra Bagchi,
John E. Haddock
Abstract:
Automating pavement maintenance suggestions is challenging,especially for actionable recommendations such as patching location,depth and priority.It is common practice among State agencies to manually inspect road segments of interest and decide maintenance requirements based on the pavement condition index (PCI).However,standalone PCI only evaluates the pavement surface condition and coupled with…
▽ More
Automating pavement maintenance suggestions is challenging,especially for actionable recommendations such as patching location,depth and priority.It is common practice among State agencies to manually inspect road segments of interest and decide maintenance requirements based on the pavement condition index (PCI).However,standalone PCI only evaluates the pavement surface condition and coupled with the variability in human perception of pavement distress,limits the accuracy and quality of current pavement maintenance practices.Here,a need for multi-sensor data integrated with standardized pavement distress condition ratings is required.This study explores the possibility of estimating the appropriate pavement patching strategy (i.e.,patching location,depth,and quantity) by integrating pavement structural and surface condition assessment with pavement specific ratings of distress.Especially,it combines pavement structural condition assessment parameter;falling weight deflectometer deflections along with surface condition assessment parameters;international roughness index,and cracking density for a better representation of overall pavement distress condition.Then,a pavement specific threshold-based patching suggestion algorithm is implemented to evaluate the pavement overall distress condition into a priority-based patching suggestion.The novelty in the use of pavement specific thresholds is placed on its data-driven ability to determine threshold values from current road condition measurements using a reliability concept validated by the theoretical pavement condition rating,pavement structural number.A web-based patching manager tool (PMT) was implemented to automate the patching suggestion procedure and visualize the results.Validated with road surface images obtained from three-dimensional laser sensors,PMT could successfully capture localized distresses in existing pavements.
△ Less
Submitted 10 February, 2023;
originally announced February 2023.
-
Diffusion-based Generative Speech Source Separation
Authors:
Robin Scheibler,
Youna Ji,
Soo-Whan Chung,
Jaeuk Byun,
Soyeon Choe,
Min-Seok Choi
Abstract:
We propose DiffSep, a new single channel source separation method based on score-matching of a stochastic differential equation (SDE). We craft a tailored continuous time diffusion-mixing process starting from the separated sources and converging to a Gaussian distribution centered on their mixture. This formulation lets us apply the machinery of score-based generative modelling. First, we train a…
▽ More
We propose DiffSep, a new single channel source separation method based on score-matching of a stochastic differential equation (SDE). We craft a tailored continuous time diffusion-mixing process starting from the separated sources and converging to a Gaussian distribution centered on their mixture. This formulation lets us apply the machinery of score-based generative modelling. First, we train a neural network to approximate the score function of the marginal probabilities or the diffusion-mixing process. Then, we use it to solve the reverse time SDE that progressively separates the sources starting from their mixture. We propose a modified training strategy to handle model mismatch and source permutation ambiguity. Experiments on the WSJ0 2mix dataset demonstrate the potential of the method. Furthermore, the method is also suitable for speech enhancement and shows performance competitive with prior work on the VoiceBank-DEMAND dataset.
△ Less
Submitted 2 November, 2022; v1 submitted 31 October, 2022;
originally announced October 2022.
-
Iterative Filter Adaptive Network for Single Image Defocus Deblurring
Authors:
Junyong Lee,
Hyeongseok Son,
Jaesung Rim,
Sunghyun Cho,
Seungyong Lee
Abstract:
We propose a novel end-to-end learning-based approach for single image defocus deblurring. The proposed approach is equipped with a novel Iterative Filter Adaptive Network (IFAN) that is specifically designed to handle spatially-varying and large defocus blur. For adaptively handling spatially-varying blur, IFAN predicts pixel-wise deblurring filters, which are applied to defocused features of an…
▽ More
We propose a novel end-to-end learning-based approach for single image defocus deblurring. The proposed approach is equipped with a novel Iterative Filter Adaptive Network (IFAN) that is specifically designed to handle spatially-varying and large defocus blur. For adaptively handling spatially-varying blur, IFAN predicts pixel-wise deblurring filters, which are applied to defocused features of an input image to generate deblurred features. For effectively managing large blur, IFAN models deblurring filters as stacks of small-sized separable filters. Predicted separable deblurring filters are applied to defocused features using a novel Iterative Adaptive Convolution (IAC) layer. We also propose a training scheme based on defocus disparity estimation and reblurring, which significantly boosts the deblurring quality. We demonstrate that our method achieves state-of-the-art performance both quantitatively and qualitatively on real-world images.
△ Less
Submitted 28 March, 2022; v1 submitted 31 August, 2021;
originally announced August 2021.
-
Look Who's Talking: Active Speaker Detection in the Wild
Authors:
You ** Kim,
Hee-Soo Heo,
Soyeon Choe,
Soo-Whan Chung,
Yoohwan Kwon,
Bong-** Lee,
Youngki Kwon,
Joon Son Chung
Abstract:
In this work, we present a novel audio-visual dataset for active speaker detection in the wild. A speaker is considered active when his or her face is visible and the voice is audible simultaneously. Although active speaker detection is a crucial pre-processing step for many audio-visual tasks, there is no existing dataset of natural human speech to evaluate the performance of active speaker detec…
▽ More
In this work, we present a novel audio-visual dataset for active speaker detection in the wild. A speaker is considered active when his or her face is visible and the voice is audible simultaneously. Although active speaker detection is a crucial pre-processing step for many audio-visual tasks, there is no existing dataset of natural human speech to evaluate the performance of active speaker detection. We therefore curate the Active Speakers in the Wild (ASW) dataset which contains videos and co-occurring speech segments with dense speech activity labels. Videos and timestamps of audible segments are parsed and adopted from VoxConverse, an existing speaker diarisation dataset that consists of videos in the wild. Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way. Two reference systems, a self-supervised system and a fully supervised one, are evaluated on the dataset to provide the baseline performances of ASW. Cross-domain evaluation is conducted in order to show the negative effect of dubbed videos in the training data.
△ Less
Submitted 17 August, 2021;
originally announced August 2021.
-
Survey on Aerial Radio Access Networks: Toward a Comprehensive 6G Access Infrastructure
Authors:
Nhu-Ngoc Dao,
Quoc-Viet Pham,
Ngo Hoang Tu,
Tran Thien Thanh,
Vo Nguyen Quoc Bao,
Demeke Shumeye Lakew,
Sungrae Cho
Abstract:
Current network access infrastructures are characterized by heterogeneity, low latency, high throughput, and high computational capability, enabling massive concurrent connections and various services. Unfortunately, this design does not pay significant attention to mobile services in underserved areas. In this context, the use of aerial radio access networks (ARANs) is a promising strategy to com…
▽ More
Current network access infrastructures are characterized by heterogeneity, low latency, high throughput, and high computational capability, enabling massive concurrent connections and various services. Unfortunately, this design does not pay significant attention to mobile services in underserved areas. In this context, the use of aerial radio access networks (ARANs) is a promising strategy to complement existing terrestrial communication systems. Involving airborne components such as unmanned aerial vehicles, drones, and satellites, ARANs can quickly establish a flexible access infrastructure on demand. ARANs are expected to support the development of seamless mobile communication systems toward a comprehensive sixth-generation (6G) global access infrastructure. This paper provides an overview of recent studies regarding ARANs in the literature. First, we investigate related work to identify areas for further exploration in terms of recent knowledge advancements and analyses. Second, we define the scope and methodology of this study. Then, we describe ARAN architecture and its fundamental features for the development of 6G networks. In particular, we analyze the system model from several perspectives, including transmission propagation, energy consumption, communication latency, and network mobility. Furthermore, we introduce technologies that enable the success of ARAN implementations in terms of energy replenishment, operational management, and data delivery. Subsequently, we discuss application scenarios envisioned for these technologies. Finally, we highlight ongoing research efforts and trends toward 6G ARANs.
△ Less
Submitted 27 February, 2021; v1 submitted 14 February, 2021;
originally announced February 2021.
-
Isogeometric Configuration Design Optimization of Three-dimensional Curved Beam Structures for Maximal Fundamental Frequency
Authors:
Myung-** Choi,
Jae-Hyun Kim,
Bonyong Koo,
Seonho Cho
Abstract:
This paper presents a configuration design optimization method for three-dimensional curved beam built-up structures having maximized fundamental eigenfrequency. We develop the method of computation of design velocity field and optimal design of beam structures constrained on a curved surface, where both designs of the embedded beams and the curved surface are simultaneously varied during the opti…
▽ More
This paper presents a configuration design optimization method for three-dimensional curved beam built-up structures having maximized fundamental eigenfrequency. We develop the method of computation of design velocity field and optimal design of beam structures constrained on a curved surface, where both designs of the embedded beams and the curved surface are simultaneously varied during the optimal design process. A shear-deformable beam model is used in the response analyses of structural vibrations within an isogeometric framework using the NURBS basis functions. An analytical design sensitivity expression of repeated eigenvalues is derived. The developed method is demonstrated through several illustrative examples.
△ Less
Submitted 23 January, 2021;
originally announced January 2021.
-
Archiver System Management for Belle II Detector Operation
Authors:
Y. -K. Kim,
S. -J. Cho,
S. -H. Park,
M. Nakao,
T. Konno
Abstract:
The Belle II experiment is a high-energy physics experiment at the SuperKEKB electron-positron collider. Using Belle II data, high precision measurement of rare decays and CP-violation in heavy quarks and leptons can be performed to probe New Physics. In this paper, we present the archiver system used to store the monitoring data of the Belle II detector and discuss in particular how we maintain t…
▽ More
The Belle II experiment is a high-energy physics experiment at the SuperKEKB electron-positron collider. Using Belle II data, high precision measurement of rare decays and CP-violation in heavy quarks and leptons can be performed to probe New Physics. In this paper, we present the archiver system used to store the monitoring data of the Belle II detector and discuss in particular how we maintain the system that archives the monitoring process variables of the subdetectors. We currently save about 26 thousand variables including the temperature of various subdetectors components, status of water leak sensors, high voltage power supply status, data acquisition status, and luminosity information of the colliding beams. For stable data taking, it is essential to collect and archive these variables. We ensure the availability and consistency of all the variables from the subdetector and other systems, as well as the status of the archiver itself are consistent and regularly updated. To cope with a possible hardware failure, we prepared a backup archiver that is synchronized with the main archiver.
△ Less
Submitted 30 October, 2020;
originally announced October 2020.
-
Evaluating phase synchronization methods in fMRI: a comparison study and new approaches
Authors:
Hamed Honari,
Ann S. Choe,
Martin A. Lindquist
Abstract:
In recent years there has been growing interest in measuring time-varying functional connectivity between different brain regions using resting-state functional magnetic resonance imaging (rs-fMRI) data. One way to assess the relationship between signals from different brain regions is to measure their phase synchronization (PS) across time. There are several ways to perform such analyses, and her…
▽ More
In recent years there has been growing interest in measuring time-varying functional connectivity between different brain regions using resting-state functional magnetic resonance imaging (rs-fMRI) data. One way to assess the relationship between signals from different brain regions is to measure their phase synchronization (PS) across time. There are several ways to perform such analyses, and here we compare methods that utilize a PS metric together with a sliding window, referred to here as windowed phase synchronization (WPS), with those that directly measure the instantaneous phase synchronization (IPS). In particular, IPS has recently gained popularity as it offers single time-point resolution of time-resolved fMRI connectivity. In this paper, we discuss the underlying assumptions required for performing PS analyses and emphasize the necessity of band-pass filtering the data to obtain valid results. We review various methods for evaluating PS and introduce a new approach within the IPS framework denoted the cosine of the relative phase (CRP). We contrast methods through a series of simulations and application to rs-fMRI data. Our results indicate that CRP outperforms other tested methods and overcomes issues related to undetected temporal transitions from positive to negative associations common in IPS analysis. Further, in contrast to phase coherence, CRP unfolds the distribution of PS measures, which benefits subsequent clustering of PS matrices into recurring brain states.
△ Less
Submitted 21 September, 2020;
originally announced September 2020.
-
Separating physically distinct mechanisms in complex infrared plasmonic nanostructures via machine learning enhanced electron energy loss spectroscopy
Authors:
Sergei V. Kalinin,
Kevin M. Roccapriore,
Shin Hum Cho,
Delia J. Milliron,
Rama Vasudevan,
Maxim Ziatdinov,
Jordan A. Hachtel
Abstract:
Low-loss electron energy loss spectroscopy (EELS) has emerged as a technique of choice for exploring the localization of plasmonic phenomena at the nanometer level, necessitating analysis of physical behaviors from 3D spectral data sets. For systems with high localization, linear unmixing methods provide an excellent basis for exploratory analysis, while in more complex systems large numbers of co…
▽ More
Low-loss electron energy loss spectroscopy (EELS) has emerged as a technique of choice for exploring the localization of plasmonic phenomena at the nanometer level, necessitating analysis of physical behaviors from 3D spectral data sets. For systems with high localization, linear unmixing methods provide an excellent basis for exploratory analysis, while in more complex systems large numbers of components are needed to accurately capture the true plasmonic response and the physical interpretability of the components becomes uncertain. Here, we explore machine learning based analysis of low-loss EELS data on heterogeneous self-assembled monolayer films of doped-semiconductor nanoparticles, which support infrared resonances. We propose a pathway for supervised analysis of EELS datasets that separate and classify regions of the films with physically distinct spectral responses. The classifications are shown to be robust, to accurately capture the common spatiospectral tropes of the complex nanostructures, and to be transferable between different datasets to allow high-throughput analysis of large areas of the sample. As such, it can be used as a basis for automated experiment workflows based on Bayesian optimization, as demonstrated on the ex situ data. We further demonstrate the use of non-linear autoencoders (AE) combined with clustering in the latent space of the AE yields highly reduced representations of the system response that yield insight into the relevant physics that do not depend on operator input and bias. The combination of these supervised and unsupervised tools provides complementary insight into the nanoscale plasmonic phenomena.
△ Less
Submitted 17 September, 2020;
originally announced September 2020.
-
Terahertz Line-Of-Sight MIMO Communication: Theory and Practical Challenges
Authors:
Heedong Do,
Sungmin Cho,
Jeonghun Park,
Ho-** Song,
Namyoon Lee,
Angel Lozano
Abstract:
A relentless trend in wireless communications is the hunger for bandwidth, and fresh bandwidth is only to be found at ever-higher frequencies. While 5G systems are seizing the mmWave band, the attention of researchers is shifting already to the terahertz range. In that distant land of tiny wavelengths, antenna arrays can serve for more than power-enhancing beamforming. Defying lower-frequency wisd…
▽ More
A relentless trend in wireless communications is the hunger for bandwidth, and fresh bandwidth is only to be found at ever-higher frequencies. While 5G systems are seizing the mmWave band, the attention of researchers is shifting already to the terahertz range. In that distant land of tiny wavelengths, antenna arrays can serve for more than power-enhancing beamforming. Defying lower-frequency wisdom, spatial multiplexing becomes feasible even in line-of-sight conditions. This paper reviews the underpinnings of this phenomenon, and it surveys recent results on the ensuing information-theoretic capacity. Reconfigurable array architectures are put forth that can closely approach such capacity, practical challenges are discussed, and supporting experimental evidence is presented.
△ Less
Submitted 4 August, 2020;
originally announced August 2020.
-
Overcoming label noise in audio event detection using sequential labeling
Authors:
Jae-Bin Kim,
Seongkyu Mun,
Myungwoo Oh,
Soyeon Choe,
Yong-Hyeok Lee,
Hyung-Min Park
Abstract:
This paper addresses the noisy label issue in audio event detection (AED) by refining strong labels as sequential labels with inaccurate timestamps removed. In AED, strong labels contain the occurrence of a specific event and its timestamps corresponding to the start and end of the event in an audio clip. The timestamps depend on subjectivity of each annotator, and their label noise is inevitable.…
▽ More
This paper addresses the noisy label issue in audio event detection (AED) by refining strong labels as sequential labels with inaccurate timestamps removed. In AED, strong labels contain the occurrence of a specific event and its timestamps corresponding to the start and end of the event in an audio clip. The timestamps depend on subjectivity of each annotator, and their label noise is inevitable. Contrary to the strong labels, weak labels indicate only the occurrence of a specific event. They do not have the label noise caused by the timestamps, but the time information is excluded. To fully exploit information from available strong and weak labels, we propose an AED scheme to train with sequential labels in addition to the given strong and weak labels after converting the strong labels into the sequential labels. Using sequential labels consistently improved the performance particularly with the segment-based F-score by focusing on occurrences of events. In the mean-teacher-based approach for semi-supervised learning, including an early step with sequential prediction in addition to supervised learning with sequential labels mitigated label noise and inaccurate prediction of the teacher model and improved the segment-based F-score significantly while maintaining the event-based F-score.
△ Less
Submitted 10 July, 2020;
originally announced July 2020.
-
FaceFilter: Audio-visual speech separation using still images
Authors:
Soo-Whan Chung,
Soyeon Choe,
Joon Son Chung,
Hong-Goo Kang
Abstract:
The objective of this paper is to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network. Unlike previous works that used lip movement on video clips or pre-enrolled speaker information as an auxiliary conditional feature, we use a single face image of the target speaker. In this task, the conditional feature is obtained from facial ap…
▽ More
The objective of this paper is to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network. Unlike previous works that used lip movement on video clips or pre-enrolled speaker information as an auxiliary conditional feature, we use a single face image of the target speaker. In this task, the conditional feature is obtained from facial appearance in cross-modal biometric task, where audio and visual identity representations are shared in latent space. Learnt identities from facial images enforce the network to isolate matched speakers and extract the voices from mixed speech. It solves the permutation problem caused by swapped channel outputs, frequently occurred in speech separation tasks. The proposed method is far more practical than video-based speech separation since user profile images are readily available on many platforms. Also, unlike speaker-aware separation methods, it is applicable on separation with unseen speakers who have never been enrolled before. We show strong qualitative and quantitative results on challenging real-world examples.
△ Less
Submitted 14 May, 2020;
originally announced May 2020.
-
Quality Prediction on Deep Generative Images
Authors:
Hyunsuk Ko,
Dae Yeol Lee,
Seunghyun Cho,
Alan C. Bovik
Abstract:
In recent years, deep neural networks have been utilized in a wide variety of applications including image generation. In particular, generative adversarial networks (GANs) are able to produce highly realistic pictures as part of tasks such as image compression. As with standard compression, it is desirable to be able to automatically assess the perceptual quality of generative images to monitor a…
▽ More
In recent years, deep neural networks have been utilized in a wide variety of applications including image generation. In particular, generative adversarial networks (GANs) are able to produce highly realistic pictures as part of tasks such as image compression. As with standard compression, it is desirable to be able to automatically assess the perceptual quality of generative images to monitor and control the encode process. However, existing image quality algorithms are ineffective on GAN generated content, especially on textured regions and at high compressions. Here we propose a new naturalness-based image quality predictor for generative images. Our new GAN picture quality predictor is built using a multi-stage parallel boosting system based on structural similarity features and measurements of statistical similarity. To enable model development and testing, we also constructed a subjective GAN image quality database containing (distorted) GAN images and collected human opinions of them. Our experimental results indicate that our proposed GAN IQA model delivers superior quality predictions on the generative image datasets, as well as on traditional image quality datasets.
△ Less
Submitted 17 April, 2020;
originally announced April 2020.
-
In defence of metric learning for speaker recognition
Authors:
Joon Son Chung,
Jaesung Huh,
Seongkyu Mun,
Minjae Lee,
Hee Soo Heo,
Soyeon Choe,
Chiheon Ham,
Sunghwan Jung,
Bong-** Lee,
Icksang Han
Abstract:
The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance.
A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper…
▽ More
The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance.
A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper, we present an extensive evaluation of most popular loss functions for speaker recognition on the VoxCeleb dataset. We demonstrate that the vanilla triplet loss shows competitive performance compared to classification-based losses, and those trained with our proposed metric learning objective outperform state-of-the-art methods.
△ Less
Submitted 24 April, 2020; v1 submitted 26 March, 2020;
originally announced March 2020.
-
An End-to-End Joint Learning Scheme of Image Compression and Quality Enhancement with Improved Entropy Minimization
Authors:
Jooyoung Lee,
Seunghyun Cho,
Munchurl Kim
Abstract:
Recently, learned image compression methods have been actively studied. Among them, entropy-minimization based approaches have achieved superior results compared to conventional image codecs such as BPG and JPEG2000. However, the quality enhancement and rate-minimization are conflictively coupled in the process of image compression. That is, maintaining high image quality entails less compression…
▽ More
Recently, learned image compression methods have been actively studied. Among them, entropy-minimization based approaches have achieved superior results compared to conventional image codecs such as BPG and JPEG2000. However, the quality enhancement and rate-minimization are conflictively coupled in the process of image compression. That is, maintaining high image quality entails less compression and vice versa. However, by jointly training separate quality enhancement in conjunction with image compression, the coding efficiency can be improved. In this paper, we propose a novel joint learning scheme of image compression and quality enhancement, called JointIQ-Net, as well as entropy model improvement, thus achieving significantly improved coding efficiency against the previous methods. Our proposed JointIQ-Net combines an image compression sub-network and a quality enhancement sub-network in a cascade, both of which are end-to-end trained in a combined manner within the JointIQ-Net. Also the JointIQ-Net benefits from improved entropy-minimization that newly adopts a Gussian Mixture Model (GMM) and further exploits global context to estimate the probabilities of latent representations. In order to show the effectiveness of our proposed JointIQ-Net, extensive experiments have been performed, and showed that the JointIQ-Net achieves a remarkable performance improvement in coding efficiency in terms of both PSNR and MS-SSIM, compared to the previous learned image compression methods and the conventional codecs such as VVC Intra (VTM 7.1), BPG, and JPEG2000. To the best of our knowledge, this is the first end-to-end optimized image compression method that outperforms VTM 7.1 (Intra), the latest reference software of the VVC standard, in terms of the PSNR and MS-SSIM.
△ Less
Submitted 13 March, 2020; v1 submitted 30 December, 2019;
originally announced December 2019.
-
W-Net: Two-stage U-Net with misaligned data for raw-to-RGB map**
Authors:
Kwang-Hyun Uhm,
Seung-Wook Kim,
Seo-Won Ji,
Sung-** Cho,
Jun-Pyo Hong,
Sung-Jea Ko
Abstract:
Recent research on learning a map** between raw Bayer images and RGB images has progressed with the development of deep convolutional neural networks. A challenging data set namely the Zurich Raw-to-RGB data set (ZRR) has been released in the AIM 2019 raw-to-RGB map** challenge. In ZRR, input raw and target RGB images are captured by two different cameras and thus not perfectly aligned. Moreov…
▽ More
Recent research on learning a map** between raw Bayer images and RGB images has progressed with the development of deep convolutional neural networks. A challenging data set namely the Zurich Raw-to-RGB data set (ZRR) has been released in the AIM 2019 raw-to-RGB map** challenge. In ZRR, input raw and target RGB images are captured by two different cameras and thus not perfectly aligned. Moreover, camera metadata such as white balance gains and color correction matrix are not provided, which makes the challenge more difficult. In this paper, we explore an effective network structure and a loss function to address these issues. We exploit a two-stage U-Net architecture and also introduce a loss function that is less variant to alignment and more sensitive to color differences. In addition, we show an ensemble of networks trained with different loss functions can bring a significant performance gain. We demonstrate the superiority of our method by achieving the highest score in terms of both the peak signal-to-noise ratio and the structural similarity and obtaining the second-best mean-opinion-score in the challenge.
△ Less
Submitted 21 November, 2019; v1 submitted 19 November, 2019;
originally announced November 2019.
-
Emotional Voice Conversion using Multitask Learning with Text-to-speech
Authors:
Tae-Ho Kim,
Sungjae Cho,
Shinkook Choi,
Sejik Park,
Soo-Young Lee
Abstract:
Voice conversion (VC) is a task to transform a person's voice to different style while conserving linguistic contents. Previous state-of-the-art on VC is based on sequence-to-sequence (seq2seq) model, which could mislead linguistic information. There was an attempt to overcome it by using textual supervision, it requires explicit alignment which loses the benefit of using seq2seq model. In this pa…
▽ More
Voice conversion (VC) is a task to transform a person's voice to different style while conserving linguistic contents. Previous state-of-the-art on VC is based on sequence-to-sequence (seq2seq) model, which could mislead linguistic information. There was an attempt to overcome it by using textual supervision, it requires explicit alignment which loses the benefit of using seq2seq model. In this paper, a voice converter using multitask learning with text-to-speech (TTS) is presented. The embedding space of seq2seq-based TTS has abundant information on the text. The role of the decoder of TTS is to convert embedding space to speech, which is same to VC. In the proposed model, the whole network is trained to minimize loss of VC and TTS. VC is expected to capture more linguistic information and to preserve training stability by multitask learning. Experiments of VC were performed on a male Korean emotional text-speech dataset, and it is shown that multitask learning is helpful to keep linguistic contents in VC.
△ Less
Submitted 27 November, 2019; v1 submitted 11 November, 2019;
originally announced November 2019.
-
The sound of my voice: speaker representation loss for target voice separation
Authors:
Seongkyu Mun,
Soyeon Choe,
Jaesung Huh,
Joon Son Chung
Abstract:
Content and style representations have been widely studied in the field of style transfer. In this paper, we propose a new loss function using speaker content representation for audio source separation, and we call it speaker representation loss. The objective is to extract the target speaker voice from the noisy input and also remove it from the residual components. Compared to the conventional s…
▽ More
Content and style representations have been widely studied in the field of style transfer. In this paper, we propose a new loss function using speaker content representation for audio source separation, and we call it speaker representation loss. The objective is to extract the target speaker voice from the noisy input and also remove it from the residual components. Compared to the conventional spectral reconstruction, our proposed framework maximizes the use of target speaker information by minimizing the distance between the speaker representations of reference and source separation output. We also propose triplet speaker representation loss as an additional criterion to remove the target speaker information from residual spectrogram output. VoiceFilter framework is adopted to evaluate source separation performance using the VCTK database, and we achieved improved performances compared to the baseline loss function without any additional network parameters.
△ Less
Submitted 27 February, 2020; v1 submitted 6 November, 2019;
originally announced November 2019.
-
Adaptive Control for Marine Vessels Against Harsh Environmental Variation
Authors:
Fangwen Tu,
Shuzhi Sam Ge,
Yoo Sang Choo,
Chang Chieh Hang
Abstract:
In this paper, robust control with sea state observer and dynamic thrust allocation is proposed for the Dynamic Positioning (DP) of an accommodation vessel in the presence of unknown hydrodynamic force variation and the input time delay. In order to overcome the huge force variation due to the adjoining Floating Production Storage and Offloading (FPSO) and accommodation vessel, a novel sea state o…
▽ More
In this paper, robust control with sea state observer and dynamic thrust allocation is proposed for the Dynamic Positioning (DP) of an accommodation vessel in the presence of unknown hydrodynamic force variation and the input time delay. In order to overcome the huge force variation due to the adjoining Floating Production Storage and Offloading (FPSO) and accommodation vessel, a novel sea state observer is designed. The sea observer can effectively monitor the variation of the drift wave-induced force on the vessel and activate Neural Network (NN) compensator in the controller when large wave force is identified. Moreover, the wind drag coefficients can be adaptively approximated in the sea observer so that a feedforward control can be achieved. Based on this, a robust constrained control is developed to guarantee a safe operation. The time delay inside the control input is also considered. Dynamic thrust allocation module is presented to distribute the generalized control input among azimuth thrusters. Under the proposed sea observer and control, the boundedness of all the closed-loop signals are demonstrated via rigorous Lyapunov analysis. A set of simulation studies are conducted to verify the effectiveness of the proposed control scheme.
△ Less
Submitted 29 September, 2019;
originally announced September 2019.
-
Orthonormal Embedding-based Deep Clustering for Single-channel Speech Separation
Authors:
Soyeon Choe,
Soo-Whan Chung,
Youna Ji,
Hong-Goo Kang
Abstract:
Deep clustering is a deep neural network-based speech separation algorithm that first trains the mixed component of signals with high-dimensional embeddings, and then uses a clustering algorithm to separate each mixture of sources. In this paper, we extend the baseline criterion of deep clustering with an additional regularization term to further improve the overall performance. This term plays a…
▽ More
Deep clustering is a deep neural network-based speech separation algorithm that first trains the mixed component of signals with high-dimensional embeddings, and then uses a clustering algorithm to separate each mixture of sources. In this paper, we extend the baseline criterion of deep clustering with an additional regularization term to further improve the overall performance. This term plays a role in assigning a condition to the embeddings such that it gives less correlation to each embedding dimension, leading to better decomposition of the spectral bins. The regularization term helps to mitigate the unavoidable permutation problem in the conventional deep clustering method, which enables to bring better clustering through the formation of optimal embeddings. We evaluate the results by varying embedding dimension, signal-to-interference ratio (SIR), and gender dependency. The performance comparison with the source separation measurement metric, i.e. signal-to-distortion ratio (SDR), confirms that the proposed method outperforms the conventional deep clustering method.
△ Less
Submitted 15 January, 2019;
originally announced January 2019.
-
Context-adaptive Entropy Model for End-to-end Optimized Image Compression
Authors:
Jooyoung Lee,
Seunghyun Cho,
Seung-Kwon Beack
Abstract:
We propose a context-adaptive entropy model for use in end-to-end optimized image compression. Our model exploits two types of contexts, bit-consuming contexts and bit-free contexts, distinguished based upon whether additional bit allocation is required. Based on these contexts, we allow the model to more accurately estimate the distribution of each latent representation with a more generalized fo…
▽ More
We propose a context-adaptive entropy model for use in end-to-end optimized image compression. Our model exploits two types of contexts, bit-consuming contexts and bit-free contexts, distinguished based upon whether additional bit allocation is required. Based on these contexts, we allow the model to more accurately estimate the distribution of each latent representation with a more generalized form of the approximation models, which accordingly leads to an enhanced compression performance. Based on the experimental results, the proposed method outperforms the traditional image codecs, such as BPG and JPEG2000, as well as other previous artificial-neural-network (ANN) based approaches, in terms of the peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM) index.
△ Less
Submitted 6 May, 2019; v1 submitted 27 September, 2018;
originally announced September 2018.
-
Deep-neural-network based sinogram synthesis for sparse-view CT image reconstruction
Authors:
Hoyeon Lee,
Jongha Lee,
Hyeongseok Kim,
Byungchul Cho,
Seungryong Cho
Abstract:
Recently, a number of approaches to low-dose computed tomography (CT) have been developed and deployed in commercialized CT scanners. Tube current reduction is perhaps the most actively explored technology with advanced image reconstruction algorithms. Sparse data sampling is another viable option to the low-dose CT, and sparse-view CT has been particularly of interest among the researchers in CT…
▽ More
Recently, a number of approaches to low-dose computed tomography (CT) have been developed and deployed in commercialized CT scanners. Tube current reduction is perhaps the most actively explored technology with advanced image reconstruction algorithms. Sparse data sampling is another viable option to the low-dose CT, and sparse-view CT has been particularly of interest among the researchers in CT community. Since analytic image reconstruction algorithms would lead to severe image artifacts, various iterative algorithms have been developed for reconstructing images from sparsely view-sampled projection data. However, iterative algorithms take much longer computation time than the analytic algorithms, and images are usually prone to different types of image artifacts that heavily depend on the reconstruction parameters. Interpolation methods have also been utilized to fill the missing data in the sinogram of sparse-view CT thus providing synthetically full data for analytic image reconstruction. In this work, we introduce a deep-neural-network-enabled sinogram synthesis method for sparse-view CT, and show its outperformance to the existing interpolation methods and also to the iterative image reconstruction approach.
△ Less
Submitted 5 March, 2018; v1 submitted 1 March, 2018;
originally announced March 2018.
-
Autonomous Power Allocation based on Distributed Deep Learning for Device-to-Device Communication Underlaying Cellular Network
Authors:
Jeehyeong Kim,
Joohan Park,
Jaewon Noh,
Sunghyun Cho
Abstract:
For Device-to-device (D2D) communication of Internet-of-Things (IoT) enabled 5G system, there is a limit to allocating resources considering a complicated interference between different links in a centralized manner. If D2D link is controlled by an enhanced node base station (eNB), and thus, remains a burden on the eNB and it causes delayed latency. This paper proposes a fully autonomous power all…
▽ More
For Device-to-device (D2D) communication of Internet-of-Things (IoT) enabled 5G system, there is a limit to allocating resources considering a complicated interference between different links in a centralized manner. If D2D link is controlled by an enhanced node base station (eNB), and thus, remains a burden on the eNB and it causes delayed latency. This paper proposes a fully autonomous power allocation method for IoT-D2D communication underlaying cellular networks using deep learning. In the proposed scheme, an IoT-D2D transmitter decides the transmit power independently from an eNB and other IoT-D2D devices. In addition, the power set can be nearly optimized by deep learning with distributed manner to achieve higher cell throughput. We present a distributed deep learning architecture in which the devices are trained as a group but operate independently. The deep learning can attain near optimal cell throughput while suppressing interference to eNB.
△ Less
Submitted 8 June, 2020; v1 submitted 8 February, 2018;
originally announced February 2018.
-
Audio Cover Song Identification using Convolutional Neural Network
Authors:
Sungkyun Chang,
Juheon Lee,
Sang Keun Choe,
Kyogu Lee
Abstract:
In this paper, we propose a new approach to cover song identification using a CNN (convolutional neural network). Most previous studies extract the feature vectors that characterize the cover song relation from a pair of songs and used it to compute the (dis)similarity between the two songs. Based on the observation that there is a meaningful pattern between cover songs and that this can be learne…
▽ More
In this paper, we propose a new approach to cover song identification using a CNN (convolutional neural network). Most previous studies extract the feature vectors that characterize the cover song relation from a pair of songs and used it to compute the (dis)similarity between the two songs. Based on the observation that there is a meaningful pattern between cover songs and that this can be learned, we have reformulated the cover song identification problem in a machine learning framework. To do this, we first build the CNN using as an input a cross-similarity matrix generated from a pair of songs. We then construct the data set composed of cover song pairs and non-cover song pairs, which are used as positive and negative training samples, respectively. The trained CNN outputs the probability of being in the cover song relation given a cross-similarity matrix generated from any two pieces of music and identifies the cover song by ranking on the probability. Experimental results show that the proposed algorithm achieves performance better than or comparable to the state-of-the-art.
△ Less
Submitted 26 October, 2020; v1 submitted 30 November, 2017;
originally announced December 2017.