Search | arXiv e-print repository

Multilingual acoustic word embeddings for zero-resource languages

Abstract: This research addresses the challenge of develo** speech applications for zero-resource languages that lack labelled data. It specifically uses acoustic word embedding (AWE) -- fixed-dimensional representations of variable-duration speech segments -- employing multilingual transfer, where labelled data from several well-resourced languages are used for pertaining. The study introduces a new neur… ▽ More This research addresses the challenge of develo** speech applications for zero-resource languages that lack labelled data. It specifically uses acoustic word embedding (AWE) -- fixed-dimensional representations of variable-duration speech segments -- employing multilingual transfer, where labelled data from several well-resourced languages are used for pertaining. The study introduces a new neural network that outperforms existing AWE models on zero-resource languages. It explores the impact of the choice of well-resourced languages. AWEs are applied to a keyword-spotting system for hate speech detection in Swahili radio broadcasts, demonstrating robustness in real-world scenarios. Additionally, novel semantic AWE models improve semantic query-by-example search. △ Less

Submitted 23 January, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

Comments: PhD thesis

arXiv:2401.02192 [pdf]

Nodule detection and generation on chest X-rays: NODE21 Challenge

Authors: Ecem Sogancioglu, Bram van Ginneken, Finn Behrendt, Marcel Bengs, Alexander Schlaefer, Miron Radu, Di Xu, Ke Sheng, Fabien Scalzo, Eric Marcus, Samuele Papa, Jonas Teuwen, Ernst Th. Scholten, Steven Schalekamp, Nils Hendrix, Colin Jacobs, Ward Hendrix, Clara I Sánchez, Keelin Murphy

Abstract: Pulmonary nodules may be an early manifestation of lung cancer, the leading cause of cancer-related deaths among both men and women. Numerous studies have established that deep learning methods can yield high-performance levels in the detection of lung nodules in chest X-rays. However, the lack of gold-standard public datasets slows down the progression of the research and prevents benchmarking of… ▽ More Pulmonary nodules may be an early manifestation of lung cancer, the leading cause of cancer-related deaths among both men and women. Numerous studies have established that deep learning methods can yield high-performance levels in the detection of lung nodules in chest X-rays. However, the lack of gold-standard public datasets slows down the progression of the research and prevents benchmarking of methods for this task. To address this, we organized a public research challenge, NODE21, aimed at the detection and generation of lung nodules in chest X-rays. While the detection track assesses state-of-the-art nodule detection systems, the generation track determines the utility of nodule generation algorithms to augment training data and hence improve the performance of the detection systems. This paper summarizes the results of the NODE21 challenge and performs extensive additional experiments to examine the impact of the synthetically generated nodule training images on the detection algorithm performance. △ Less

Submitted 4 January, 2024; originally announced January 2024.

Comments: 15 pages, 5 figures

arXiv:2311.05032 [pdf, other]

Transfer learning from a sparsely annotated dataset of 3D medical images

Authors: Gabriel Efrain Humpire-Mamani, Colin Jacobs, Mathias Prokop, Bram van Ginneken, Nikolas Lessmann

Abstract: Transfer learning leverages pre-trained model features from a large dataset to save time and resources when training new models for various tasks, potentially enhancing performance. Due to the lack of large datasets in the medical imaging domain, transfer learning from one medical imaging model to other medical imaging models has not been widely explored. This study explores the use of transfer le… ▽ More Transfer learning leverages pre-trained model features from a large dataset to save time and resources when training new models for various tasks, potentially enhancing performance. Due to the lack of large datasets in the medical imaging domain, transfer learning from one medical imaging model to other medical imaging models has not been widely explored. This study explores the use of transfer learning to improve the performance of deep convolutional neural networks for organ segmentation in medical imaging. A base segmentation model (3D U-Net) was trained on a large and sparsely annotated dataset; its weights were used for transfer learning on four new down-stream segmentation tasks for which a fully annotated dataset was available. We analyzed the training set size's influence to simulate scarce data. The results showed that transfer learning from the base model was beneficial when small datasets were available, providing significant performance improvements; where fine-tuning the base model is more beneficial than updating all the network weights with vanilla transfer learning. Transfer learning with fine-tuning increased the performance by up to 0.129 (+28\%) Dice score than experiments trained from scratch, and on average 23 experiments increased the performance by 0.029 Dice score in the new segmentation tasks. The study also showed that cross-modality transfer learning using CT scans was beneficial. The findings of this study demonstrate the potential of transfer learning to improve the efficiency of annotation and increase the accessibility of accurate organ segmentation in medical imaging, ultimately leading to improved patient care. We made the network definition and weights publicly available to benefit other users and researchers. △ Less

Submitted 8 November, 2023; originally announced November 2023.

arXiv:2310.10851 [pdf, other]

doi 10.1007/978-3-031-43129-6_12

Tracking China's cross-strait bot networks against Taiwan

Authors: Charity S. Jacobs, Lynnette Hui Xian Ng, Kathleen M. Carley

Abstract: The cross-strait relationship between China and Taiwan is marked by increasing hostility around potential reunification. We analyze an unattributed bot network and how repeater bots engaged in an influence campaign against Taiwan following US House Speaker Nancy Pelosi's visit to Taiwan in 2022. We examine the message amplification tactics employed by four key bot sub-communities, the widespread d… ▽ More The cross-strait relationship between China and Taiwan is marked by increasing hostility around potential reunification. We analyze an unattributed bot network and how repeater bots engaged in an influence campaign against Taiwan following US House Speaker Nancy Pelosi's visit to Taiwan in 2022. We examine the message amplification tactics employed by four key bot sub-communities, the widespread dissemination of information across multiple platforms through URLs, and the potential targeted audiences of this bot network. We find that URL link sharing reveals circumvention around YouTube suspensions, in addition to the potential effectiveness of algorithmic bot connectivity to appear less bot-like, and detail a sequence of coordination within a sub-community for message amplification. We additionally find the narratives and targeted audience potentially shifting after account activity discrepancies, demonstrating how dynamic these bot networks can operate. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: 10 pages with 5 figures. Published in Conference Proceedings for Social, Cultural, and Behavioral Modeling (SBP-BRiMS 2023)

arXiv:2309.03383 [pdf, other]

Kidney abnormality segmentation in thorax-abdomen CT scans

Authors: Gabriel Efrain Humpire Mamani, Nikolas Lessmann, Ernst Th. Scholten, Mathias Prokop, Colin Jacobs, Bram van Ginneken

Abstract: In this study, we introduce a deep learning approach for segmenting kidney parenchyma and kidney abnormalities to support clinicians in identifying and quantifying renal abnormalities such as cysts, lesions, masses, metastases, and primary tumors. Our end-to-end segmentation method was trained on 215 contrast-enhanced thoracic-abdominal CT scans, with half of these scans containing one or more abn… ▽ More In this study, we introduce a deep learning approach for segmenting kidney parenchyma and kidney abnormalities to support clinicians in identifying and quantifying renal abnormalities such as cysts, lesions, masses, metastases, and primary tumors. Our end-to-end segmentation method was trained on 215 contrast-enhanced thoracic-abdominal CT scans, with half of these scans containing one or more abnormalities. We began by implementing our own version of the original 3D U-Net network and incorporated four additional components: an end-to-end multi-resolution approach, a set of task-specific data augmentations, a modified loss function using top-$k$, and spatial dropout. Furthermore, we devised a tailored post-processing strategy. Ablation studies demonstrated that each of the four modifications enhanced kidney abnormality segmentation performance, while three out of four improved kidney parenchyma segmentation. Subsequently, we trained the nnUNet framework on our dataset. By ensembling the optimized 3D U-Net and the nnUNet with our specialized post-processing, we achieved marginally superior results. Our best-performing model attained Dice scores of 0.965 and 0.947 for segmenting kidney parenchyma in two test sets (20 scans without abnormalities and 30 with abnormalities), outperforming an independent human observer who scored 0.944 and 0.925, respectively. In segmenting kidney abnormalities within the 30 test scans containing them, the top-performing method achieved a Dice score of 0.585, while an independent second human observer reached a score of 0.664, suggesting potential for further improvement in computerized methods. All training data is available to the research community under a CC-BY 4.0 license on https://doi.org/10.5281/zenodo.8014289 △ Less

Submitted 6 September, 2023; originally announced September 2023.

arXiv:2309.02576 [pdf, other]

doi 10.1038/s41598-023-40116-6

Emphysema Subty** on Thoracic Computed Tomography Scans using Deep Neural Networks

Authors: Weiyi Xie, Colin Jacobs, Jean-Paul Charbonnier, Dirk Jan Slebos, Bram van Ginneken

Abstract: Accurate identification of emphysema subtypes and severity is crucial for effective management of COPD and the study of disease heterogeneity. Manual analysis of emphysema subtypes and severity is laborious and subjective. To address this challenge, we present a deep learning-based approach for automating the Fleischner Society's visual score system for emphysema subty** and severity analysis. W… ▽ More Accurate identification of emphysema subtypes and severity is crucial for effective management of COPD and the study of disease heterogeneity. Manual analysis of emphysema subtypes and severity is laborious and subjective. To address this challenge, we present a deep learning-based approach for automating the Fleischner Society's visual score system for emphysema subty** and severity analysis. We trained and evaluated our algorithm using 9650 subjects from the COPDGene study. Our algorithm achieved the predictive accuracy at 52\%, outperforming a previously published method's accuracy of 45\%. In addition, the agreement between the predicted scores of our method and the visual scores was good, where the previous method obtained only moderate agreement. Our approach employs a regression training strategy to generate categorical labels while simultaneously producing high-resolution localized activation maps for visualizing the network predictions. By leveraging these dense activation maps, our method possesses the capability to compute the percentage of emphysema involvement per lung in addition to categorical severity scores. Furthermore, the proposed method extends its predictive capabilities beyond centrilobular emphysema to include paraseptal emphysema subtypes. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Journal ref: Sci Rep. 2023 Aug 29;13(1):14147

arXiv:2308.07179 [pdf, other]

Incorporating Annotator Uncertainty into Representations of Discourse Relations

Authors: S. Magalí López Cortez, Cassandra L. Jacobs

Abstract: Annotation of discourse relations is a known difficult task, especially for non-expert annotators. In this paper, we investigate novice annotators' uncertainty on the annotation of discourse relations on spoken conversational data. We find that dialogue context (single turn, pair of turns within speaker, and pair of turns across speakers) is a significant predictor of confidence scores. We compute… ▽ More Annotation of discourse relations is a known difficult task, especially for non-expert annotators. In this paper, we investigate novice annotators' uncertainty on the annotation of discourse relations on spoken conversational data. We find that dialogue context (single turn, pair of turns within speaker, and pair of turns across speakers) is a significant predictor of confidence scores. We compute distributed representations of discourse relations from co-occurrence statistics that incorporate information about confidence scores and dialogue context. We perform a hierarchical clustering analysis using these representations and show that weighting discourse relation representations with information about confidence and dialogue context coherently models our annotators' uncertainty about discourse relation labels. △ Less

Submitted 14 August, 2023; originally announced August 2023.

arXiv:2307.03645 [pdf, other]

The distribution of discourse relations within and across turns in spontaneous conversation

Authors: S. Magalí López Cortez, Cassandra L. Jacobs

Abstract: Time pressure and topic negotiation may impose constraints on how people leverage discourse relations (DRs) in spontaneous conversational contexts. In this work, we adapt a system of DRs for written language to spontaneous dialogue using crowdsourced annotations from novice annotators. We then test whether discourse relations are used differently across several types of multi-utterance contexts. W… ▽ More Time pressure and topic negotiation may impose constraints on how people leverage discourse relations (DRs) in spontaneous conversational contexts. In this work, we adapt a system of DRs for written language to spontaneous dialogue using crowdsourced annotations from novice annotators. We then test whether discourse relations are used differently across several types of multi-utterance contexts. We compare the patterns of DR annotation within and across speakers and within and across turns. Ultimately, we find that different discourse contexts produce distinct distributions of discourse relations, with single-turn annotations creating the most uncertainty for annotators. Additionally, we find that the discourse relation annotations are of sufficient quality to predict from embeddings of discourse units. △ Less

Submitted 7 July, 2023; originally announced July 2023.

Comments: Proceedings of Computational Approaches to Discourse 2023, collocated with the 2023 meeting of the Association for Computational Linguistics, Toronto, Canada

arXiv:2307.02083 [pdf, other]

Leveraging multilingual transfer for unsupervised semantic acoustic word embeddings

Authors: Christiaan Jacobs, Herman Kamper

Abstract: Acoustic word embeddings (AWEs) are fixed-dimensional vector representations of speech segments that encode phonetic content so that different realisations of the same word have similar embeddings. In this paper we explore semantic AWE modelling. These AWEs should not only capture phonetics but also the meaning of a word (similar to textual word embeddings). We consider the scenario where we only… ▽ More Acoustic word embeddings (AWEs) are fixed-dimensional vector representations of speech segments that encode phonetic content so that different realisations of the same word have similar embeddings. In this paper we explore semantic AWE modelling. These AWEs should not only capture phonetics but also the meaning of a word (similar to textual word embeddings). We consider the scenario where we only have untranscribed speech in a target language. We introduce a number of strategies leveraging a pre-trained multilingual AWE model -- a phonetic AWE model trained on labelled data from multiple languages excluding the target. Our best semantic AWE approach involves clustering word segments using the multilingual AWE model, deriving soft pseudo-word labels from the cluster centroids, and then training a Skipgram-like model on the soft vectors. In an intrinsic word similarity task measuring semantics, this multilingual transfer approach outperforms all previous semantic AWE methods. We also show -- for the first time -- that AWEs can be used for downstream semantic query-by-example search. △ Less

Submitted 5 July, 2023; originally announced July 2023.

Comments: Submitted to IEEE SPL

arXiv:2306.00410 [pdf, other]

Towards hate speech detection in low-resource languages: Comparing ASR to acoustic word embeddings on Wolof and Swahili

Authors: Christiaan Jacobs, Nathanaël Carraz Rakotonirina, Everlyn Asiko Chimoto, Bruce A. Bassett, Herman Kamper

Abstract: We consider hate speech detection through keyword spotting on radio broadcasts. One approach is to build an automatic speech recognition (ASR) system for the target low-resource language. We compare this to using acoustic word embedding (AWE) models that map speech segments to a space where matching words have similar vectors. We specifically use a multilingual AWE model trained on labelled data f… ▽ More We consider hate speech detection through keyword spotting on radio broadcasts. One approach is to build an automatic speech recognition (ASR) system for the target low-resource language. We compare this to using acoustic word embedding (AWE) models that map speech segments to a space where matching words have similar vectors. We specifically use a multilingual AWE model trained on labelled data from well-resourced languages to spot keywords in data in the unseen target language. In contrast to ASR, the AWE approach only requires a few keyword exemplars. In controlled experiments on Wolof and Swahili where training and test data are from the same domain, an ASR model trained on just five minutes of data outperforms the AWE approach. But in an in-the-wild test on Swahili radio broadcasts with actual hate speech keywords, the AWE model (using one minute of template data) is more robust, giving similar performance to an ASR system trained on 30 hours of labelled data. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: Accepted to Interspeech 2023

arXiv:2208.01561 [pdf, other]

Lost in Space Marking

Authors: Cassandra L. Jacobs, Yuval Pinter

Abstract: We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one. Based on surface-level considerations of efficiency and cohesion, as well as morphological coverage, we find that a Unigram LM tokenizer trained on pre-tokenized English text is better off marking the word-initial token, while one… ▽ More We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one. Based on surface-level considerations of efficiency and cohesion, as well as morphological coverage, we find that a Unigram LM tokenizer trained on pre-tokenized English text is better off marking the word-initial token, while one trained on raw text benefits from marking word ends. Our findings generalize across domains. △ Less

Submitted 2 August, 2022; originally announced August 2022.

Comments: Submission to SIGMORPHON 2021

arXiv:2201.04532 [pdf, other]

Structure and position-aware graph neural network for airway labeling

Authors: Weiyi Xie, Colin Jacobs, Jean-Paul Charbonnier, Bram van Ginneken

Abstract: We present a novel graph-based approach for labeling the anatomical branches of a given airway tree segmentation. The proposed method formulates airway labeling as a branch classification problem in the airway tree graph, where branch features are extracted using convolutional neural networks (CNN) and enriched using graph neural networks. Our graph neural network is structure-aware by having each… ▽ More We present a novel graph-based approach for labeling the anatomical branches of a given airway tree segmentation. The proposed method formulates airway labeling as a branch classification problem in the airway tree graph, where branch features are extracted using convolutional neural networks (CNN) and enriched using graph neural networks. Our graph neural network is structure-aware by having each node aggregate information from its local neighbors and position-aware by encoding node positions in the graph. We evaluated the proposed method on 220 airway trees from subjects with various severity stages of Chronic Obstructive Pulmonary Disease (COPD). The results demonstrate that our approach is computationally efficient and significantly improves branch classification performance than the baseline method. The overall average accuracy of our method reaches 91.18\% for labeling all 18 segmental airway branches, compared to 83.83\% obtained by the standard CNN method. We published our source code at https://github.com/DIAGNijmegen/spgnn. The proposed algorithm is also publicly available at https://grand-challenge.org/algorithms/airway-anatomical-labeling/. △ Less

Submitted 12 January, 2022; originally announced January 2022.

arXiv:2106.12834 [pdf, other]

Multilingual transfer of acoustic word embeddings improves when training on languages related to the target zero-resource language

Authors: Christiaan Jacobs, Herman Kamper

Abstract: Acoustic word embedding models map variable duration speech segments to fixed dimensional vectors, enabling efficient speech search and discovery. Previous work explored how embeddings can be obtained in zero-resource settings where no labelled data is available in the target language. The current best approach uses transfer learning: a single supervised multilingual model is trained using labelle… ▽ More Acoustic word embedding models map variable duration speech segments to fixed dimensional vectors, enabling efficient speech search and discovery. Previous work explored how embeddings can be obtained in zero-resource settings where no labelled data is available in the target language. The current best approach uses transfer learning: a single supervised multilingual model is trained using labelled data from multiple well-resourced languages and then applied to a target zero-resource language (without fine-tuning). However, it is still unclear how the specific choice of training languages affect downstream performance. Concretely, here we ask whether it is beneficial to use training languages related to the target. Using data from eleven languages spoken in Southern Africa, we experiment with adding data from different language families while controlling for the amount of data per language. In word discrimination and query-by-example search evaluations, we show that training on languages from the same family gives large improvements. Through finer-grained analysis, we show that training on even just a single related language gives the largest gain. We also find that adding data from unrelated languages generally doesn't hurt performance. △ Less

Submitted 24 June, 2021; originally announced June 2021.

Comments: Accepted to Interspeech 2021

arXiv:2106.01351 [pdf, other]

Deep Clustering Activation Maps for Emphysema Subty**

Authors: Weiyi Xie, Colin Jacobs, Bram van Ginneken

Abstract: We propose a deep learning clustering method that exploits dense features from a segmentation network for emphysema subty** from computed tomography (CT) scans. Using dense features enables high-resolution visualization of image regions corresponding to the cluster assignment via dense clustering activation maps (dCAMs). This approach provides model interpretability. We evaluated clustering resu… ▽ More We propose a deep learning clustering method that exploits dense features from a segmentation network for emphysema subty** from computed tomography (CT) scans. Using dense features enables high-resolution visualization of image regions corresponding to the cluster assignment via dense clustering activation maps (dCAMs). This approach provides model interpretability. We evaluated clustering results on 500 subjects from the COPDGenestudy, where radiologists manually annotated emphysema sub-types according to their visual CT assessment. We achieved a 43% unsupervised clustering accuracy, outperforming our baseline at 41% and yielding results comparable to supervised classification at 45%. The proposed method also offers a better cluster formation than the baseline, achieving0.54 in silhouette coefficient and 0.55 in David-Bouldin scores. △ Less

Submitted 1 June, 2021; originally announced June 2021.

arXiv:2105.11748 [pdf, other]

Dense Regression Activation Maps For Lesion Segmentation in CT scans of COVID-19 patients

Authors: Weiyi Xie, Colin Jacobs, Jean-Paul Charbonnier, Bram van Ginneken

Abstract: Automatic lesion segmentation on thoracic CT enables rapid quantitative analysis of lung involvement in COVID-19 infections. However, obtaining a large amount of voxel-level annotations for training segmentation networks is prohibitively expensive. Therefore, we propose a weakly-supervised segmentation method based on dense regression activation maps (dRAMs). Most weakly-supervised segmentation ap… ▽ More Automatic lesion segmentation on thoracic CT enables rapid quantitative analysis of lung involvement in COVID-19 infections. However, obtaining a large amount of voxel-level annotations for training segmentation networks is prohibitively expensive. Therefore, we propose a weakly-supervised segmentation method based on dense regression activation maps (dRAMs). Most weakly-supervised segmentation approaches exploit class activation maps (CAMs) to localize objects. However, because CAMs were trained for classification, they do not align precisely with the object segmentations. Instead, we produce high-resolution activation maps using dense features from a segmentation network that was trained to estimate a per-lobe lesion percentage. In this way, the network can exploit knowledge regarding the required lesion volume. In addition, we propose an attention neural network module to refine dRAMs, optimized together with the main regression task. We evaluated our algorithm on 90 subjects. Results show our method achieved 70.2% Dice coefficient, substantially outperforming the CAM-based baseline at 48.6%. △ Less

Submitted 18 November, 2021; v1 submitted 25 May, 2021; originally announced May 2021.

arXiv:2103.10731 [pdf, other]

Acoustic word embeddings for zero-resource languages using self-supervised contrastive learning and multilingual adaptation

Authors: Christiaan Jacobs, Yevgen Matusevych, Herman Kamper

Abstract: Acoustic word embeddings (AWEs) are fixed-dimensional representations of variable-length speech segments. For zero-resource languages where labelled data is not available, one AWE approach is to use unsupervised autoencoder-based recurrent models. Another recent approach is to use multilingual transfer: a supervised AWE model is trained on several well-resourced languages and then applied to an un… ▽ More Acoustic word embeddings (AWEs) are fixed-dimensional representations of variable-length speech segments. For zero-resource languages where labelled data is not available, one AWE approach is to use unsupervised autoencoder-based recurrent models. Another recent approach is to use multilingual transfer: a supervised AWE model is trained on several well-resourced languages and then applied to an unseen zero-resource language. We consider how a recent contrastive learning loss can be used in both the purely unsupervised and multilingual transfer settings. Firstly, we show that terms from an unsupervised term discovery system can be used for contrastive self-supervision, resulting in improvements over previous unsupervised monolingual AWE models. Secondly, we consider how multilingual AWE models can be adapted to a specific zero-resource language using discovered terms. We find that self-supervised contrastive adaptation outperforms adapted multilingual correspondence autoencoder and Siamese AWE models, giving the best overall results in a word discrimination task on six zero-resource languages. △ Less

Submitted 19 March, 2021; originally announced March 2021.

Comments: Accepted to SLT 2021

arXiv:2009.09725 [pdf, other]

Improving Automated COVID-19 Grading with Convolutional Neural Networks in Computed Tomography Scans: An Ablation Study

Authors: Coen de Vente, Luuk H. Boulogne, Kiran Vaidhya Venkadesh, Cheryl Sital, Nikolas Lessmann, Colin Jacobs, Clara I. Sánchez, Bram van Ginneken

Abstract: Amidst the ongoing pandemic, several studies have shown that COVID-19 classification and grading using computed tomography (CT) images can be automated with convolutional neural networks (CNNs). Many of these studies focused on reporting initial results of algorithms that were assembled from commonly used components. The choice of these components was often pragmatic rather than systematic. For in… ▽ More Amidst the ongoing pandemic, several studies have shown that COVID-19 classification and grading using computed tomography (CT) images can be automated with convolutional neural networks (CNNs). Many of these studies focused on reporting initial results of algorithms that were assembled from commonly used components. The choice of these components was often pragmatic rather than systematic. For instance, several studies used 2D CNNs even though these might not be optimal for handling 3D CT volumes. This paper identifies a variety of components that increase the performance of CNN-based algorithms for COVID-19 grading from CT images. We investigated the effectiveness of using a 3D CNN instead of a 2D CNN, of using transfer learning to initialize the network, of providing automatically computed lesion maps as additional network input, and of predicting a continuous instead of a categorical output. A 3D CNN with these components achieved an area under the ROC curve (AUC) of 0.934 on our test set of 105 CT scans and an AUC of 0.923 on a publicly available set of 742 CT scans, a substantial improvement in comparison with a previously published 2D CNN. An ablation study demonstrated that in addition to using a 3D CNN instead of a 2D CNN transfer learning contributed the most and continuous output contributed the least to improving the model performance. △ Less

Submitted 21 September, 2020; originally announced September 2020.

Comments: 9 pages, 6 figures

arXiv:2009.09123 [pdf, other]

Will it Unblend?

Authors: Yuval Pinter, Cassandra L. Jacobs, Jacob Eisenstein

Abstract: Natural language processing systems often struggle with out-of-vocabulary (OOV) terms, which do not appear in training data. Blends, such as "innoventor", are one particularly challenging class of OOV, as they are formed by fusing together two or more bases that relate to the intended meaning in unpredictable manners and degrees. In this work, we run experiments on a novel dataset of English OOV b… ▽ More Natural language processing systems often struggle with out-of-vocabulary (OOV) terms, which do not appear in training data. Blends, such as "innoventor", are one particularly challenging class of OOV, as they are formed by fusing together two or more bases that relate to the intended meaning in unpredictable manners and degrees. In this work, we run experiments on a novel dataset of English OOV blends to quantify the difficulty of interpreting the meanings of blends by large-scale contextual language models such as BERT. We first show that BERT's processing of these blends does not fully access the component meanings, leaving their contextual representations semantically impoverished. We find this is mostly due to the loss of characters resulting from blend formation. Then, we assess how easily different models can recognize the structure and recover the origin of blends, and find that context-aware embedding systems outperform character-level and context-free embeddings, although their results are still far from satisfactory. △ Less

Submitted 18 September, 2020; originally announced September 2020.

Comments: Findings of EMNLP 2020

arXiv:2004.07443 [pdf, other]

doi 10.1109/TMI.2020.2995108

Relational Modeling for Robust and Efficient Pulmonary Lobe Segmentation in CT Scans

Authors: Weiyi Xie, Colin Jacobs, Jean-Paul Charbonnier, Bram van Ginneken

Abstract: Pulmonary lobe segmentation in computed tomography scans is essential for regional assessment of pulmonary diseases. Recent works based on convolution neural networks have achieved good performance for this task. However, they are still limited in capturing structured relationships due to the nature of convolution. The shape of the pulmonary lobes affect each other and their borders relate to the… ▽ More Pulmonary lobe segmentation in computed tomography scans is essential for regional assessment of pulmonary diseases. Recent works based on convolution neural networks have achieved good performance for this task. However, they are still limited in capturing structured relationships due to the nature of convolution. The shape of the pulmonary lobes affect each other and their borders relate to the appearance of other structures, such as vessels, airways, and the pleural wall. We argue that such structural relationships play a critical role in the accurate delineation of pulmonary lobes when the lungs are affected by diseases such as COVID-19 or COPD. In this paper, we propose a relational approach (RTSU-Net) that leverages structured relationships by introducing a novel non-local neural network module. The proposed module learns both visual and geometric relationships among all convolution features to produce self-attention weights. With a limited amount of training data available from COVID-19 subjects, we initially train and validate RTSU-Net on a cohort of 5000 subjects from the COPDGene study (4000 for training and 1000 for evaluation). Using models pre-trained on COPDGene, we apply transfer learning to retrain and evaluate RTSU-Net on 470 COVID-19 suspects (370 for retraining and 100 for evaluation). Experimental results show that RTSU-Net outperforms three baselines and performs robustly on cases with severe lung infection due to COVID-19. △ Less

Submitted 12 May, 2020; v1 submitted 15 April, 2020; originally announced April 2020.

arXiv:2003.03444 [pdf, ps, other]

NYTWIT: A Dataset of Novel Words in the New York Times

Authors: Yuval Pinter, Cassandra L. Jacobs, Max Bittker

Abstract: We present the New York Times Word Innovation Types dataset, or NYTWIT, a collection of over 2,500 novel English words published in the New York Times between November 2017 and March 2019, manually annotated for their class of novelty (such as lexical derivation, dialectal variation, blending, or compounding). We present baseline results for both uncontextual and contextual prediction of novelty c… ▽ More We present the New York Times Word Innovation Types dataset, or NYTWIT, a collection of over 2,500 novel English words published in the New York Times between November 2017 and March 2019, manually annotated for their class of novelty (such as lexical derivation, dialectal variation, blending, or compounding). We present baseline results for both uncontextual and contextual prediction of novelty class, showing that there is room for improvement even for state-of-the-art NLP systems. We hope this resource will prove useful for linguists and NLP practitioners by providing a real-world environment of novel word appearance. △ Less

Submitted 23 October, 2020; v1 submitted 6 March, 2020; originally announced March 2020.

Comments: COLING 2020

arXiv:2001.09078 [pdf, other]

Adaptive Low-level Storage of Very Large Knowledge Graphs

Authors: Jacopo Urbani, Ceriel Jacobs

Abstract: The increasing availability and usage of Knowledge Graphs (KGs) on the Web calls for scalable and general-purpose solutions to store this type of data structures. We propose Trident, a novel storage architecture for very large KGs on centralized systems. Trident uses several interlinked data structures to provide fast access to nodes and edges, with the physical storage changing depending on the t… ▽ More The increasing availability and usage of Knowledge Graphs (KGs) on the Web calls for scalable and general-purpose solutions to store this type of data structures. We propose Trident, a novel storage architecture for very large KGs on centralized systems. Trident uses several interlinked data structures to provide fast access to nodes and edges, with the physical storage changing depending on the topology of the graph to reduce the memory footprint. In contrast to single architectures designed for single tasks, our approach offers an interface with few low-level and general-purpose primitives that can be used to implement tasks like SPARQL query answering, reasoning, or graph analytics. Our experiments show that Trident can handle graphs with 10^11 edges using inexpensive hardware, delivering competitive performance on multiple workloads. △ Less

Submitted 24 January, 2020; originally announced January 2020.

Comments: Accepted WWW 2020

arXiv:1901.04056 [pdf, other]

doi 10.1016/j.media.2022.102680

The Liver Tumor Segmentation Benchmark (LiTS)

Authors: Patrick Bilic, Patrick Christ, Hongwei Bran Li, Eugene Vorontsov, Avi Ben-Cohen, Georgios Kaissis, Adi Szeskin, Colin Jacobs, Gabriel Efrain Humpire Mamani, Gabriel Chartrand, Fabian Lohöfer, Julian Walter Holch, Wieland Sommer, Felix Hofmann, Alexandre Hostettler, Naama Lev-Cohain, Michal Drozdzal, Michal Marianne Amitai, Refael Vivantik, Jacob Sosna, Ivan Ezhov, Anjany Sekuboyina, Fernando Navarro, Florian Kofler, Johannes C. Paetzold , et al. (84 additional authors not shown)

Abstract: In this work, we report the set-up and results of the Liver Tumor Segmentation Benchmark (LiTS), which was organized in conjunction with the IEEE International Symposium on Biomedical Imaging (ISBI) 2017 and the International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2017 and 2018. The image dataset is diverse and contains primary and secondary tumors with… ▽ More In this work, we report the set-up and results of the Liver Tumor Segmentation Benchmark (LiTS), which was organized in conjunction with the IEEE International Symposium on Biomedical Imaging (ISBI) 2017 and the International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2017 and 2018. The image dataset is diverse and contains primary and secondary tumors with varied sizes and appearances with various lesion-to-background levels (hyper-/hypo-dense), created in collaboration with seven hospitals and research institutions. Seventy-five submitted liver and liver tumor segmentation algorithms were trained on a set of 131 computed tomography (CT) volumes and were tested on 70 unseen test images acquired from different patients. We found that not a single algorithm performed best for both liver and liver tumors in the three events. The best liver segmentation algorithm achieved a Dice score of 0.963, whereas, for tumor segmentation, the best algorithms achieved Dices scores of 0.674 (ISBI 2017), 0.702 (MICCAI 2017), and 0.739 (MICCAI 2018). Retrospectively, we performed additional analysis on liver tumor detection and revealed that not all top-performing segmentation algorithms worked well for tumor detection. The best liver tumor detection method achieved a lesion-wise recall of 0.458 (ISBI 2017), 0.515 (MICCAI 2017), and 0.554 (MICCAI 2018), indicating the need for further research. LiTS remains an active benchmark and resource for research, e.g., contributing the liver-related segmentation tasks in \url{http://medicaldecathlon.com/}. In addition, both data and online evaluation are accessible via \url{www.lits-challenge.com}. △ Less

Submitted 25 November, 2022; v1 submitted 13 January, 2019; originally announced January 2019.

Comments: Patrick Bilic, Patrick Christ, Hongwei Bran Li, and Eugene Vorontsov made equal contributions to this work. Published in Medical Image Analysis

Journal ref: Medical Image Analysis (2022) Pg. 102680

arXiv:1811.12789 [pdf, other]

iW-Net: an automatic and minimalistic interactive lung nodule segmentation deep network

Authors: Guilherme Aresta, Colin Jacobs, Teresa Araújo, António Cunha, Isabel Ramos, Bram van Ginneken, Aurélio Campilho

Abstract: We propose iW-Net, a deep learning model that allows for both automatic and interactive segmentation of lung nodules in computed tomography images. iW-Net is composed of two blocks: the first one provides an automatic segmentation and the second one allows to correct it by analyzing 2 points introduced by the user in the nodule's boundary. For this purpose, a physics inspired weight map that takes… ▽ More We propose iW-Net, a deep learning model that allows for both automatic and interactive segmentation of lung nodules in computed tomography images. iW-Net is composed of two blocks: the first one provides an automatic segmentation and the second one allows to correct it by analyzing 2 points introduced by the user in the nodule's boundary. For this purpose, a physics inspired weight map that takes the user input into account is proposed, which is used both as a feature map and in the system's loss function. Our approach is extensively evaluated on the public LIDC-IDRI dataset, where we achieve a state-of-the-art performance of 0.55 intersection over union vs the 0.59 inter-observer agreement. Also, we show that iW-Net allows to correct the segmentation of small nodules, essential for proper patient referral decision, as well as improve the segmentation of the challenging non-solid nodules and thus may be an important tool for increasing the early diagnosis of lung cancer. △ Less

Submitted 30 November, 2018; originally announced November 2018.

Comments: Pre-print submitted to IEEE Transactions on Biomedical Engineering

arXiv:1709.09713 [pdf, other]

Energy efficiency of finite difference algorithms on multicore CPUs, GPUs, and Intel Xeon Phi processors

Authors: Satya P. Jammy, Christian T. Jacobs, David J. Lusher, Neil D. Sandham

Abstract: In addition to hardware wall-time restrictions commonly seen in high-performance computing systems, it is likely that future systems will also be constrained by energy budgets. In the present work, finite difference algorithms of varying computational and memory intensity are evaluated with respect to both energy efficiency and runtime on an Intel Ivy Bridge CPU node, an Intel Xeon Phi Knights Lan… ▽ More In addition to hardware wall-time restrictions commonly seen in high-performance computing systems, it is likely that future systems will also be constrained by energy budgets. In the present work, finite difference algorithms of varying computational and memory intensity are evaluated with respect to both energy efficiency and runtime on an Intel Ivy Bridge CPU node, an Intel Xeon Phi Knights Landing processor, and an NVIDIA Tesla K40c GPU. The conventional way of storing the discretised derivatives to global arrays for solution advancement is found to be inefficient in terms of energy consumption and runtime. In contrast, a class of algorithms in which the discretised derivatives are evaluated on-the-fly or stored as thread-/process-local variables (yielding high compute intensity) is optimal both with respect to energy consumption and runtime. On all three hardware architectures considered, a speed-up of ~2 and an energy saving of ~2 are observed for the high compute intensive algorithms compared to the memory intensive algorithm. The energy consumption is found to be proportional to runtime, irrespective of the power consumed and the GPU has an energy saving of ~5 compared to the same algorithm on a CPU node. △ Less

Submitted 27 September, 2017; originally announced September 2017.

Comments: Submitted to Computers and Fluids

arXiv:1708.03183 [pdf, other]

Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modelling

Authors: Fabio Luporini, Michael Lange, Christian T. Jacobs, Gerard J. Gorman, J. Ramanujam, Paul H. J. Kelly

Abstract: Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]] -- hence the name "sparse". One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of th… ▽ More Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]] -- hence the name "sparse". One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementation -- not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing: a revisited, more efficient sparse tiling algorithm; support for distributed-memory parallelism; a range of fine-grained optimizations for increased run-time performance; implementation in a publicly-available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28x on a platform consisting of 896 Intel Broadwell cores. △ Less

Submitted 19 June, 2019; v1 submitted 10 August, 2017; originally announced August 2017.

Comments: 29 pages including supplementary materials and references

ACM Class: D.1.2; G.4

arXiv:1704.08368 [pdf, ps, other]

doi 10.1002/fld.4395

Surface-sampled simulations of turbulent flow at high Reynolds number

Authors: Neil D. Sandham, Roderick Johnstone, Christian T. Jacobs

Abstract: A new approach to turbulence simulation, based on a combination of large-eddy simulation (LES) for the whole flow and an array of non-space-filling quasi-direct numerical simulations (QDNS), which sample the response of near-wall turbulence to large-scale forcing, is proposed and evaluated. The technique overcomes some of the cost limitations of turbulence simulation, since the main flow is treate… ▽ More A new approach to turbulence simulation, based on a combination of large-eddy simulation (LES) for the whole flow and an array of non-space-filling quasi-direct numerical simulations (QDNS), which sample the response of near-wall turbulence to large-scale forcing, is proposed and evaluated. The technique overcomes some of the cost limitations of turbulence simulation, since the main flow is treated with a coarse-grid LES, with the equivalent of wall functions supplied by the near-wall sampled QDNS. Two cases are tested, at friction Reynolds number Re$_τ$=4200 and 20,000. The total grid node count for the first case is less than half a million and less than two million for the second case, with the calculations only requiring a desktop computer. A good agreement with published DNS is found at Re$_τ$=4200, both in terms of the mean velocity profile and the streamwise velocity fluctuation statistics, which correctly show a substantial increase in near-wall turbulence levels due to a modulation of near-wall streaks by large-scale structures. The trend continues at Re$_τ$=20,000, in agreement with experiment, which represents one of the major achievements of the new approach. A number of detailed aspects of the model, including numerical resolution, LES-QDNS coupling strategy and sub-grid model are explored. A low level of grid sensitivity is demonstrated for both the QDNS and LES aspects. Since the method does not assume a law of the wall, it can in principle be applied to flows that are out of equilibrium. △ Less

Submitted 26 April, 2017; originally announced April 2017.

Comments: Author accepted version. Accepted for publication in the International Journal for Numerical Methods in Fluids on 26 April 2017

Journal ref: International Journal for Numerical Methods in Fluids 85(9):525-537, 2017

arXiv:1612.08012 [pdf, other]

doi 10.1016/j.media.2017.06.015

Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge

Authors: Arnaud Arindra Adiyoso Setio, Alberto Traverso, Thomas de Bel, Moira S. N. Berens, Cas van den Bogaard, Piergiorgio Cerello, Hao Chen, Qi Dou, Maria Evelina Fantacci, Bram Geurts, Robbert van der Gugten, Pheng Ann Heng, Bart Jansen, Michael M. J. de Kaste, Valentin Kotov, Jack Yu-Hung Lin, Jeroen T. M. C. Manders, Alexander Sónora-Mengana, Juan Carlos García-Naranjo, Evgenia Papavasileiou, Mathias Prokop, Marco Saletta, Cornelia M Schaefer-Prokop, Ernst T. Scholten, Luuk Scholten , et al. (7 additional authors not shown)

Abstract: Automatic detection of pulmonary nodules in thoracic computed tomography (CT) scans has been an active area of research for the last two decades. However, there have only been few studies that provide a comparative performance evaluation of different systems on a common database. We have therefore set up the LUNA16 challenge, an objective evaluation framework for automatic nodule detection algorit… ▽ More Automatic detection of pulmonary nodules in thoracic computed tomography (CT) scans has been an active area of research for the last two decades. However, there have only been few studies that provide a comparative performance evaluation of different systems on a common database. We have therefore set up the LUNA16 challenge, an objective evaluation framework for automatic nodule detection algorithms using the largest publicly available reference database of chest CT scans, the LIDC-IDRI data set. In LUNA16, participants develop their algorithm and upload their predictions on 888 CT scans in one of the two tracks: 1) the complete nodule detection track where a complete CAD system should be developed, or 2) the false positive reduction track where a provided set of nodule candidates should be classified. This paper describes the setup of LUNA16 and presents the results of the challenge so far. Moreover, the impact of combining individual systems on the detection performance was also investigated. It was observed that the leading solutions employed convolutional networks and used the provided set of nodule candidates. The combination of these solutions achieved an excellent sensitivity of over 95% at fewer than 1.0 false positives per scan. This highlights the potential of combining algorithms to improve the detection performance. Our observer study with four expert readers has shown that the best system detects nodules that were missed by expert readers who originally annotated the LIDC-IDRI data. We released this set of additional nodules for further development of CAD systems. △ Less

Submitted 15 July, 2017; v1 submitted 23 December, 2016; originally announced December 2016.

arXiv:1610.09157 [pdf, other]

doi 10.1038/srep46479

Towards automatic pulmonary nodule management in lung cancer screening with deep learning

Authors: Francesco Ciompi, Kaman Chung, Sarah J. van Riel, Arnaud Arindra Adiyoso Setio, Paul K. Gerke, Colin Jacobs, Ernst Th. Scholten, Cornelia Schaefer-Prokop, Mathilde M. W. Wille, Alfonso Marchiano, Ugo Pastorino, Mathias Prokop, Bram van Ginneken

Abstract: The introduction of lung cancer screening programs will produce an unprecedented amount of chest CT scans in the near future, which radiologists will have to read in order to decide on a patient follow-up strategy. According to the current guidelines, the workup of screen-detected nodules strongly relies on nodule size and nodule type. In this paper, we present a deep learning system based on mult… ▽ More The introduction of lung cancer screening programs will produce an unprecedented amount of chest CT scans in the near future, which radiologists will have to read in order to decide on a patient follow-up strategy. According to the current guidelines, the workup of screen-detected nodules strongly relies on nodule size and nodule type. In this paper, we present a deep learning system based on multi-stream multi-scale convolutional networks, which automatically classifies all nodule types relevant for nodule workup. The system processes raw CT data containing a nodule without the need for any additional information such as nodule segmentation or nodule size and learns a representation of 3D data by analyzing an arbitrary number of 2D views of a given nodule. The deep learning system was trained with data from the Italian MILD screening trial and validated on an independent set of data from the Danish DLCST screening trial. We analyze the advantage of processing nodules at multiple scales with a multi-stream convolutional network architecture, and we show that the proposed deep learning system achieves performance at classifying nodule type that surpasses the one of classical machine learning approaches and is within the inter-observer variability among four experienced human observers. △ Less

Submitted 23 May, 2017; v1 submitted 28 October, 2016; originally announced October 2016.

Comments: Published on Scientific Reports

Journal ref: Sci. Rep. 7, 46479; (2017)

arXiv:1610.09146 [pdf, other]

Performance evaluation of explicit finite difference algorithms with varying amounts of computational and memory intensity

Authors: Satya P. Jammy, Christian T. Jacobs, Neil D. Sandham

Abstract: Future architectures designed to deliver exascale performance motivate the need for novel algorithmic changes in order to fully exploit their capabilities. In this paper, the performance of several numerical algorithms, characterised by varying degrees of memory and computational intensity, are evaluated in the context of finite difference methods for fluid dynamics problems. It is shown that, by… ▽ More Future architectures designed to deliver exascale performance motivate the need for novel algorithmic changes in order to fully exploit their capabilities. In this paper, the performance of several numerical algorithms, characterised by varying degrees of memory and computational intensity, are evaluated in the context of finite difference methods for fluid dynamics problems. It is shown that, by storing some of the evaluated derivatives as single thread- or process-local variables in memory, or recomputing the derivatives on-the-fly, a speed-up of ~2 can be obtained compared to traditional algorithms that store all derivatives in global arrays. △ Less

Submitted 28 October, 2016; originally announced October 2016.

Comments: Author accepted version. Accepted for publication in Journal of Computational Science on 27 October 2016

arXiv:1609.01277 [pdf, other]

doi 10.1016/j.jocs.2016.11.001

OpenSBLI: A framework for the automated derivation and parallel execution of finite difference solvers on a range of computer architectures

Authors: Christian T. Jacobs, Satya P. Jammy, Neil D. Sandham

Abstract: Exascale computing will feature novel and potentially disruptive hardware architectures. Exploiting these to their full potential is non-trivial. Numerical modelling frameworks involving finite difference methods are currently limited by the 'static' nature of the hand-coded discretisation schemes and repeatedly may have to be re-written to run efficiently on new hardware. In contrast, OpenSBLI us… ▽ More Exascale computing will feature novel and potentially disruptive hardware architectures. Exploiting these to their full potential is non-trivial. Numerical modelling frameworks involving finite difference methods are currently limited by the 'static' nature of the hand-coded discretisation schemes and repeatedly may have to be re-written to run efficiently on new hardware. In contrast, OpenSBLI uses code generation to derive the model's code from a high-level specification. Users focus on the equations to solve, whilst not concerning themselves with the detailed implementation. Source-to-source translation is used to tailor the code and enable its execution on a variety of hardware. △ Less

Submitted 14 November, 2016; v1 submitted 5 September, 2016; originally announced September 2016.

Comments: Author accepted version, with a small amendment: the link in the "Code Availability" section has been updated, and now refers to the OpenSBLI source code repository on GitHub. Accepted for publication in the Journal of Computational Science on 8 November 2016

Journal ref: Journal of Computational Science 18 (2017) 12-23

arXiv:1606.05741

Connecting web-based map** services with scientific data repositories: collaborative curation and retrieval of simulation data via a geospatial interface

Authors: Christian T. Jacobs, Alexandros Avdis

Abstract: Increasing quantities of scientific data are becoming readily accessible via online repositories such as those provided by Figshare and Zenodo. Geoscientific simulations in particular generate large quantities of data, with several research groups studying many, often overlap** areas of the world. When studying a particular area, being able to keep track of one's own simulations as well as those… ▽ More Increasing quantities of scientific data are becoming readily accessible via online repositories such as those provided by Figshare and Zenodo. Geoscientific simulations in particular generate large quantities of data, with several research groups studying many, often overlap** areas of the world. When studying a particular area, being able to keep track of one's own simulations as well as those of collaborators can be challenging. This paper describes the design, implementation, and evaluation of a new tool for visually cataloguing and retrieving data associated with a given geographical location through a web-based Google Maps interface. Each data repository is pin-pointed on the map with a marker based on the geographical location that the dataset corresponds to. By clicking on the markers, users can quickly inspect the metadata of the repositories and download the associated data files. The crux of the approach lies in the ability to easily query and retrieve data from multiple sources via a common interface. While many advances are being made in terms of scientific data repositories, the development of this new tool has uncovered several issues and limitations of the current state-of-the-art which are discussed herein, along with some ideas for the future. △ Less

Submitted 21 September, 2016; v1 submitted 18 June, 2016; originally announced June 2016.

Comments: Submission withdrawn from the International Journal of Digital Curation on 9 September 2016 in order to prepare a joint paper with additional colleagues

arXiv:1601.08091 [pdf, other]

On the validity of tidal turbine array configurations obtained from steady-state adjoint optimisation

Authors: Christian T. Jacobs, Matthew D. Piggott, Stephan C. Kramer, Simon W. Funke

Abstract: Extracting the optimal amount of power from an array of tidal turbines requires an intricate understanding of tidal dynamics and the effects of turbine placement on the local and regional scale flow. Numerical models have contributed significantly towards this understanding, and more recently, adjoint-based modelling has been employed to optimise the positioning of the turbines in an array in an a… ▽ More Extracting the optimal amount of power from an array of tidal turbines requires an intricate understanding of tidal dynamics and the effects of turbine placement on the local and regional scale flow. Numerical models have contributed significantly towards this understanding, and more recently, adjoint-based modelling has been employed to optimise the positioning of the turbines in an array in an automated way and improve on simple, regular man-made configurations. Adjoint-based optimisation of high-resolution and ideally 3D transient models is generally a very computationally expensive problem. As a result, existing work on the adjoint optimisation of tidal turbine placement has been mostly limited to steady-state simulations in which very high, non-physical values of the background viscosity are required to ensure that a steady-state solution exists. However, such compromises may affect the reliability of the modelled turbines, their wakes and interactions, and thus bring into question the validity of the computed optimal turbine positions. This work considers a suite of idealised simulations of flow past tidal turbine arrays in a 2D channel. It compares four regular array configurations, detailed by Divett et al. (2013), with the configuration found through adjoint optimisation in a steady-state, high-viscosity setup. The optimised configuration produces considerably more power. The same configurations are then used to produce a suite of transient simulations that do not use constant high-viscosity, and instead use large eddy simulation (LES) to parameterise the resulting turbulent structures. It is shown that the LES simulations produce less power than that predicted by the constant high-viscosity runs. Nevertheless, they still follow the same trends in the power curve throughout time, with optimised layouts continuing to perform significantly better than simplified configurations. △ Less

Submitted 29 January, 2016; originally announced January 2016.

Comments: Conference paper comprising 15 pages and 13 figures. Submitted to the Proceedings of the ECCOMAS Congress 2016 (VII European Congress on Computational Methods in Applied Sciences and Engineering), held in Crete, Greece on 5-10 June 2016

arXiv:1511.08915 [pdf, other]

Column-Oriented Datalog Materialization for Large Knowledge Graphs (Extended Technical Report)

Authors: Jacopo Urbani, Ceriel Jacobs, Markus Krötzsch

Abstract: The evaluation of Datalog rules over large Knowledge Graphs (KGs) is essential for many applications. In this paper, we present a new method of materializing Datalog inferences, which combines a column-based memory layout with novel optimization methods that avoid redundant inferences at runtime. The pro-active caching of certain subqueries further increases efficiency. Our empirical evaluation sh… ▽ More The evaluation of Datalog rules over large Knowledge Graphs (KGs) is essential for many applications. In this paper, we present a new method of materializing Datalog inferences, which combines a column-based memory layout with novel optimization methods that avoid redundant inferences at runtime. The pro-active caching of certain subqueries further increases efficiency. Our empirical evaluation shows that this approach can often match or even surpass the performance of state-of-the-art systems, especially under restricted resources. △ Less

Submitted 11 February, 2016; v1 submitted 28 November, 2015; originally announced November 2015.

ACM Class: I.2.3; H.2.4

arXiv:1510.01560 [pdf, other]

Shoreline and Bathymetry Approximation in Mesh Generation for Tidal Renewable Simulations

Authors: Alexandros Avdis, Christian T. Jacobs, Jon Hill, Matthew D. Piggott, Gerard J. Gorman

Abstract: Due to the fractal nature of the domain geometry in geophysical flow simulations, a completely accurate description of the domain in terms of a computational mesh is frequently deemed infeasible. Shoreline and bathymetry simplification methods are used to remove small scale details in the geometry, particularly in areas away from the region of interest. To that end, a novel method for shoreline an… ▽ More Due to the fractal nature of the domain geometry in geophysical flow simulations, a completely accurate description of the domain in terms of a computational mesh is frequently deemed infeasible. Shoreline and bathymetry simplification methods are used to remove small scale details in the geometry, particularly in areas away from the region of interest. To that end, a novel method for shoreline and bathymetry simplification is presented. Existing shoreline simplification methods typically remove points if the resultant geometry satisfies particular geometric criteria. Bathymetry is usually simplified using traditional filtering techniques, that remove unwanted Fourier modes. Principal Component Analysis (PCA) has been used in other fields to isolate small-scale structures from larger scale coherent features in a robust way, underpinned by a rigorous but simple mathematical framework. Here we present a method based on principal component analysis aimed towards simplification of shorelines and bathymetry. We present the algorithm in detail and show simplified shorelines and bathymetry in the wider region around the North Sea. Finally, the methods are used in the context of unstructured mesh generation aimed at tidal resource assessment simulations in the coastal regions around the UK. △ Less

Submitted 6 October, 2015; originally announced October 2015.

Comments: Pre-print of conference publication accepted in the Proceedings of 11th European Wave & Tidal Energy Conference (EWTEC 2015, http://www.ewtec.org/ewtec2015/ ). This paper was presented at the EWTEC 2015 conference on Tuesday 8 September 2015 in Nantes, France. Number of pages: 7. Number of figures: 6

arXiv:1509.04729 [pdf, other]

Integrating Research Data Management into Geographical Information Systems

Authors: Christian T. Jacobs, Alexandros Avdis, Simon L. Mouradian, Matthew D. Piggott

Abstract: Ocean modelling requires the production of high-fidelity computational meshes upon which to solve the equations of motion. The production of such meshes by hand is often infeasible, considering the complexity of the bathymetry and coastlines. The use of Geographical Information Systems (GIS) is therefore a key component to discretising the region of interest and producing a mesh appropriate to res… ▽ More Ocean modelling requires the production of high-fidelity computational meshes upon which to solve the equations of motion. The production of such meshes by hand is often infeasible, considering the complexity of the bathymetry and coastlines. The use of Geographical Information Systems (GIS) is therefore a key component to discretising the region of interest and producing a mesh appropriate to resolve the dynamics. However, all data associated with the production of a mesh must be provided in order to contribute to the overall recomputability of the subsequent simulation. This work presents the integration of research data management in QMesh, a tool for generating meshes using GIS. The tool uses the PyRDM library to provide a quick and easy way for scientists to publish meshes, and all data required to regenerate them, to persistent online repositories. These repositories are assigned unique identifiers to enable proper citation of the meshes in journal articles. △ Less

Submitted 15 September, 2015; originally announced September 2015.

Comments: Accepted, camera-ready version. To appear in the Proceedings of the 5th International Workshop on Semantic Digital Archives (http://sda2015.dke-research.de/), held in Poznań, Poland on 18 September 2015 as part of the 19th International Conference on Theory and Practice of Digital Libraries (http://tpdl2015.info/)

arXiv:1505.05425 [pdf, other]

doi 10.5408/15-101.1

Experiences with efficient methodologies for teaching computer programming to geoscientists

Authors: Christian T. Jacobs, Gerard J. Gorman, Huw E. Rees, Lorraine Craig

Abstract: Computer programming was once thought of as a skill required only by professional software developers. But today, given the ubiquitous nature of computation and data science it is quickly becoming necessary for all scientists and engineers to have at least a basic knowledge of how to program. Teaching how to program, particularly to those students with little or no computing background, is well-kn… ▽ More Computer programming was once thought of as a skill required only by professional software developers. But today, given the ubiquitous nature of computation and data science it is quickly becoming necessary for all scientists and engineers to have at least a basic knowledge of how to program. Teaching how to program, particularly to those students with little or no computing background, is well-known to be a difficult task. However, there is also a wealth of evidence-based teaching practices for teaching programming skills which can be applied to greatly improve learning outcomes and the student experience. Adopting these practices naturally gives rise to greater learning efficiency - this is critical if programming is to be integrated into an already busy geoscience curriculum. This paper considers an undergraduate computer programming course, run during the last 5 years in the Department of Earth Science and Engineering at Imperial College London. The teaching methodologies that were used each year are discussed alongside the challenges that were encountered, and how the methodologies affected student performance. Anonymised student marks and feedback are used to highlight this, and also how the adjustments made to the course eventually resulted in a highly effective learning environment. △ Less

Submitted 9 June, 2016; v1 submitted 20 May, 2015; originally announced May 2015.

Comments: Second revised version. This version was accepted for publication in the Journal of Geoscience Education on 9 June 2016. Contains 5 figures. The main change is the inclusion of a new section on outlook and future work

Journal ref: Journal of Geoscience Education 64(3):183-198, 2016

arXiv:1405.7290 [pdf]

doi 10.5334/jors.bj

PyRDM: A Python-based library for automating the management and online publication of scientific software and data

Authors: Christian T. Jacobs, Alexandros Avdis, Gerard J. Gorman, Matthew D. Piggott

Abstract: The recomputability and reproducibility of results from scientific software requires access to both the source code and all associated input and output data. However, the full collection of these resources often does not accompany the key findings published in journal articles, thereby making it difficult or impossible for the wider scientific community to verify the correctness of a result or to… ▽ More The recomputability and reproducibility of results from scientific software requires access to both the source code and all associated input and output data. However, the full collection of these resources often does not accompany the key findings published in journal articles, thereby making it difficult or impossible for the wider scientific community to verify the correctness of a result or to build further research on it. This paper presents a new Python-based library, PyRDM, whose functionality aims to automate the process of sharing the software and data via online, citable repositories such as Figshare. The library is integrated into the workflow of an open-source computational fluid dynamics package, Fluidity, to demonstrate an example of its usage. △ Less

Submitted 21 August, 2014; v1 submitted 28 May, 2014; originally announced May 2014.

Comments: Revised version. The main changes are: Added pdfLaTeX to the dependencies list; Improved Figure 1 to show the 'publish' option selected in Diamond; Added two paragraphs to explain why users would want to use PyRDM; Added some content on the PyRDM roadmap, and also some content regarding engagement with libraries and research software engineers

Journal ref: Journal of Open Research Software 2:e28 (2014) 1-6

arXiv:cs/0306110 [pdf]

Run Control and Monitor System for the CMS Experiment

Authors: M. Bellato, L. Berti, V. Brigljevic, G. Bruno, E. Cano, S. Cittolin, A. Csilling, S. Erhan, D. Gigi, F. Glege, R. Gomez-Reino, M. Gulmini, J. Gutleber, C. Jacobs, M. Kozlovszky, H. Larsen, I. Magrans, G. Maron, F. Meijers, E. Meschi, S. Murray, A. Oh, L. Orsini, L. Pollet, A. Racz , et al. (8 additional authors not shown)

Abstract: The Run Control and Monitor System (RCMS) of the CMS experiment is the set of hardware and software components responsible for controlling and monitoring the experiment during data-taking. It provides users with a "virtual counting room", enabling them to operate the experiment and to monitor detector status and data quality from any point in the world. This paper describes the architecture of t… ▽ More The Run Control and Monitor System (RCMS) of the CMS experiment is the set of hardware and software components responsible for controlling and monitoring the experiment during data-taking. It provides users with a "virtual counting room", enabling them to operate the experiment and to monitor detector status and data quality from any point in the world. This paper describes the architecture of the RCMS with particular emphasis on its scalability through a distributed collection of nodes arranged in a tree-based hierarchy. The current implementation of the architecture in a prototype RCMS used in test beam setups, detector validations and DAQ demonstrators is documented. A discussion of the key technologies used, including Web Services, and the results of tests performed with a 128-node system are presented. △ Less

Submitted 18 June, 2003; originally announced June 2003.

Comments: Talk from the 2003 Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, Ca, USA, March 2003, 8 pages, PSN THGT002

ACM Class: C.2.4; J.2

Showing 1–38 of 38 results for author: Jacobs, C