Search | arXiv e-print repository

A Comparative Investigation of Compositional Syntax and Semantics in DALL-E 2

Authors: Elliot Murphy, Jill de Villiers, Sofia Lucero Morales

Abstract: In this study we compared how well DALL-E 2 visually represented the meaning of linguistic prompts also given to young children in comprehension tests. Sentences representing fundamental components of grammatical knowledge were selected from assessment tests used with several hundred English-speaking children aged 2-7 years for whom we had collected original item-level data. DALL-E 2 was given the… ▽ More In this study we compared how well DALL-E 2 visually represented the meaning of linguistic prompts also given to young children in comprehension tests. Sentences representing fundamental components of grammatical knowledge were selected from assessment tests used with several hundred English-speaking children aged 2-7 years for whom we had collected original item-level data. DALL-E 2 was given these prompts five times to generate 20 cartoons per item, for 9 adult judges to score. Results revealed no conditions in which DALL-E 2-generated images that matched the semantic accuracy of children, even at the youngest age (2 years). DALL-E 2 failed to assign the appropriate roles in reversible forms; it failed on negation despite an easier contrastive prompt than the children received; it often assigned the adjective to the wrong noun; it ignored implicit agents in passives. This work points to a clear absence of compositional sentence representations for DALL-E 2. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2311.10361 [pdf, other]

doi 10.1016/j.eswa.2024.124156

Video-based Sequential Bayesian Homography Estimation for Soccer Field Registration

Authors: Paul J. Claasen, J. P. de Villiers

Abstract: A novel Bayesian framework is proposed, which explicitly relates the homography of one video frame to the next through an affine transformation while explicitly modelling keypoint uncertainty. The literature has previously used differential homography between subsequent frames, but not in a Bayesian setting. In cases where Bayesian methods have been applied, camera motion is not adequately modelle… ▽ More A novel Bayesian framework is proposed, which explicitly relates the homography of one video frame to the next through an affine transformation while explicitly modelling keypoint uncertainty. The literature has previously used differential homography between subsequent frames, but not in a Bayesian setting. In cases where Bayesian methods have been applied, camera motion is not adequately modelled, and keypoints are treated as deterministic. The proposed method, Bayesian Homography Inference from Tracked Keypoints (BHITK), employs a two-stage Kalman filter and significantly improves existing methods. Existing keypoint detection methods may be easily augmented with BHITK. It enables less sophisticated and less computationally expensive methods to outperform the state-of-the-art approaches in most homography evaluation metrics. Furthermore, the homography annotations of the WorldCup and TS-WorldCup datasets have been refined using a custom homography annotation tool that has been released for public use. The refined datasets are consolidated and released as the consolidated and refined WorldCup (CARWC) dataset. △ Less

Submitted 4 May, 2024; v1 submitted 17 November, 2023; originally announced November 2023.

Comments: Accepted to Expert Systems with Applications

arXiv:2211.00731 [pdf, other]

Comparision Of Adversarial And Non-Adversarial LSTM Music Generative Models

Authors: Moseli Mots'oehli, Anna Sergeevna Bosman, Johan Pieter De Villiers

Abstract: Algorithmic music composition is a way of composing musical pieces with minimal to no human intervention. While recurrent neural networks are traditionally applied to many sequence-to-sequence prediction tasks, including successful implementations of music composition, their standard supervised learning approach based on input-to-output map** leads to a lack of note variety. These models can the… ▽ More Algorithmic music composition is a way of composing musical pieces with minimal to no human intervention. While recurrent neural networks are traditionally applied to many sequence-to-sequence prediction tasks, including successful implementations of music composition, their standard supervised learning approach based on input-to-output map** leads to a lack of note variety. These models can therefore be seen as potentially unsuitable for tasks such as music generation. Generative adversarial networks learn the generative distribution of data and lead to varied samples. This work implements and compares adversarial and non-adversarial training of recurrent neural network music composers on MIDI data. The resulting music samples are evaluated by human listeners, their preferences recorded. The evaluation indicates that adversarial training produces more aesthetically pleasing music. △ Less

Submitted 1 November, 2022; originally announced November 2022.

Comments: Submitted to a 2023 conference, 20 pages, 13 figures

arXiv:2011.10834 [pdf, other]

doi 10.1016/j.eswa.2021.116335

The complementarity of a diverse range of deep learning features extracted from video content for video recommendation

Authors: Adolfo Almeida, Johan Pieter de Villiers, Allan De Freitas, Mergandran Velayudan

Abstract: Following the popularisation of media streaming, a number of video streaming services are continuously buying new video content to mine the potential profit from them. As such, the newly added content has to be handled well to be recommended to suitable users. In this paper, we address the new item cold-start problem by exploring the potential of various deep learning features to provide video rec… ▽ More Following the popularisation of media streaming, a number of video streaming services are continuously buying new video content to mine the potential profit from them. As such, the newly added content has to be handled well to be recommended to suitable users. In this paper, we address the new item cold-start problem by exploring the potential of various deep learning features to provide video recommendations. The deep learning features investigated include features that capture the visual-appearance, audio and motion information from video content. We also explore different fusion methods to evaluate how well these feature modalities can be combined to fully exploit the complementary information captured by them. Experiments on a real-world video dataset for movie recommendations show that deep learning features outperform hand-crafted features. In particular, recommendations generated with deep learning audio features and action-centric deep learning features are superior to MFCC and state-of-the-art iDT features. In addition, the combination of various deep learning features with hand-crafted features and textual metadata yields significant improvement in recommendations compared to combining only the former. △ Less

Submitted 31 December, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

Journal ref: Expert Systems with Applications 192 (2022) 116335

arXiv:2002.11226 [pdf, other]

doi 10.1007/978-3-030-36808-1_50

Deep Learning and Statistical Models for Time-Critical Pedestrian Behaviour Prediction

Authors: Joel Janek Dabrowski, Johan Pieter de Villiers, Ashfaqur Rahman, Conrad Beyers

Abstract: The time it takes for a classifier to make an accurate prediction can be crucial in many behaviour recognition problems. For example, an autonomous vehicle should detect hazardous pedestrian behaviour early enough for it to take appropriate measures. In this context, we compare the switching linear dynamical system (SLDS) and a three-layered bi-directional long short-term memory (LSTM) neural netw… ▽ More The time it takes for a classifier to make an accurate prediction can be crucial in many behaviour recognition problems. For example, an autonomous vehicle should detect hazardous pedestrian behaviour early enough for it to take appropriate measures. In this context, we compare the switching linear dynamical system (SLDS) and a three-layered bi-directional long short-term memory (LSTM) neural network, which are applied to infer pedestrian behaviour from motion tracks. We show that, though the neural network model achieves an accuracy of 80%, it requires long sequences to achieve this (100 samples or more). The SLDS, has a lower accuracy of 74%, but it achieves this result with short sequences (10 samples). To our knowledge, such a comparison on sequence length has not been considered in the literature before. The results provide a key intuition of the suitability of the models in time-critical problems. △ Less

Submitted 25 February, 2020; originally announced February 2020.

Journal ref: In: Gedeon T., Wong K., Lee M. (eds) Neural Information Processing. ICONIP 2019. Communications in Computer and Information Science, vol 1142. Springer, Cham

Showing 1–5 of 5 results for author: de Villiers, J