Search | arXiv e-print repository

Concept-aware Data Construction Improves In-context Learning of Language Models

Authors: Michal Štefánik, Marek Kadlčík, Petr Sojka

Abstract: Many recent language models (LMs) are capable of in-context learning (ICL), manifested in the LMs' ability to perform a new task solely from natural-language instruction. Previous work curating in-context learners assumes that ICL emerges from a vast over-parametrization or the scale of multi-task training. However, recent theoretical work attributes the ICL ability to concept-dependent training d… ▽ More Many recent language models (LMs) are capable of in-context learning (ICL), manifested in the LMs' ability to perform a new task solely from natural-language instruction. Previous work curating in-context learners assumes that ICL emerges from a vast over-parametrization or the scale of multi-task training. However, recent theoretical work attributes the ICL ability to concept-dependent training data and creates functional in-context learners even in small-scale, synthetic settings. In this work, we practically explore this newly identified axis of ICL quality. We propose Concept-aware Training (CoAT), a framework for constructing training scenarios that make it beneficial for the LM to learn to utilize the analogical reasoning concepts from demonstrations. We find that by using CoAT, pre-trained transformers can learn to better utilise new latent concepts from demonstrations and that such ability makes ICL more robust to the functional deficiencies of the previous models. Finally, we show that concept-aware in-context learning is more effective for a majority of new tasks when compared to traditional instruction tuning, resulting in a performance comparable to the previous in-context learners using magnitudes of more training data. △ Less

Submitted 28 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

Comments: Long paper to appear in Findings of ACL 2024

arXiv:2305.06841 [pdf, other]

Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models

Authors: Lukáš Mikula, Michal Štefánik, Marek Petrovič, Petr Sojka

Abstract: While the Large Language Models (LLMs) dominate a majority of language understanding tasks, previous work shows that some of these results are supported by modelling spurious correlations of training datasets. Authors commonly assess model robustness by evaluating their models on out-of-distribution (OOD) datasets of the same task, but these datasets might share the bias of the training dataset.… ▽ More While the Large Language Models (LLMs) dominate a majority of language understanding tasks, previous work shows that some of these results are supported by modelling spurious correlations of training datasets. Authors commonly assess model robustness by evaluating their models on out-of-distribution (OOD) datasets of the same task, but these datasets might share the bias of the training dataset. We propose a simple method for measuring a scale of models' reliance on any identified spurious feature and assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA). We find that while existing debiasing methods can mitigate reliance on a chosen spurious feature, the OOD performance gains of these methods can not be explained by mitigated reliance on biased features, suggesting that biases are shared among different QA datasets. Finally, we evidence this to be the case by measuring that the performance of models trained on different QA datasets relies comparably on the same bias features. We hope these results will motivate future work to refine the reports of LMs' robustness to a level of adversarial samples addressing specific spurious features. △ Less

Submitted 6 February, 2024; v1 submitted 11 May, 2023; originally announced May 2023.

Comments: Long paper in Proceedings of EACL 2024: Main track

arXiv:2304.01922 [pdf, other]

Resources and Few-shot Learners for In-context Learning in Slavic Languages

Authors: Michal Štefánik, Marek Kadlčík, Piotr Gramacki, Petr Sojka

Abstract: Despite the rapid recent progress in creating accurate and compact in-context learners, most recent work focuses on in-context learning (ICL) for tasks in English. However, the ability to interact with users of languages outside English presents a great potential for broadening the applicability of language technologies to non-English speakers. In this work, we collect the infrastructure necessa… ▽ More Despite the rapid recent progress in creating accurate and compact in-context learners, most recent work focuses on in-context learning (ICL) for tasks in English. However, the ability to interact with users of languages outside English presents a great potential for broadening the applicability of language technologies to non-English speakers. In this work, we collect the infrastructure necessary for training and evaluation of ICL in a selection of Slavic languages: Czech, Polish, and Russian. We link a diverse set of datasets and cast these into a unified instructional format through a set of transformations and newly-crafted templates written purely in target languages. Using the newly-curated dataset, we evaluate a set of the most recent in-context learners and compare their results to the supervised baselines. Finally, we train, evaluate and publish a set of in-context learning models that we train on the collected resources and compare their performance to previous work. We find that ICL models tuned in English are also able to learn some tasks from non-English contexts, but multilingual instruction fine-tuning consistently improves the ICL ability. We also find that the massive multitask training can be outperformed by single-task training in the target language, uncovering the potential for specializing in-context learners to the language(s) of their application. △ Less

Submitted 4 April, 2023; originally announced April 2023.

Comments: EACL 2023 SlavicNLP Long Paper. New instructional templates and models are available on https://github.com/fewshot-goes-multilingual/slavic-incontext-learning

arXiv:2211.16550 [pdf, other]

Soft Alignment Objectives for Robust Adaptation of Language Generation

Authors: Michal Štefánik, Marek Kadlčík, Petr Sojka

Abstract: Domain adaptation allows generative language models to address specific flaws caused by the domain shift of their application. However, the traditional adaptation by further training on in-domain data rapidly weakens the model's ability to generalize to other domains, making the open-ended deployments of the adapted models prone to errors. This work introduces novel training objectives built upon… ▽ More Domain adaptation allows generative language models to address specific flaws caused by the domain shift of their application. However, the traditional adaptation by further training on in-domain data rapidly weakens the model's ability to generalize to other domains, making the open-ended deployments of the adapted models prone to errors. This work introduces novel training objectives built upon a semantic similarity of the predicted tokens to the reference. Our results show that (1) avoiding the common assumption of a single correct prediction by constructing the training target from tokens' semantic similarity can mitigate catastrophic forgetting during domain adaptation, while (2) preserving the quality of the adaptation, (3) with negligible additions to compute costs. In the broader context, the objectives grounded in a continuous token similarity pioneer the exploration of the middle ground between the efficient but naïve exact-match token-level objectives and expressive but computationally- and resource-intensive sequential objectives. △ Less

Submitted 26 May, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: Annual Meeting of The ACL 2023: Main conference long paper

arXiv:2206.06714 [pdf, other]

Interpretable Gait Recognition by Granger Causality

Authors: Michal Balazia, Katerina Hlavackova-Schindler, Petr Sojka, Claudia Plant

Abstract: Which joint interactions in the human gait cycle can be used as biometric characteristics? Most current methods on gait recognition suffer from the lack of interpretability. We propose an interpretable feature representation of gait sequences by the graphical Granger causal inference. Gait sequence of a person in the standardized motion capture format, constituting a set of 3D joint spatial trajec… ▽ More Which joint interactions in the human gait cycle can be used as biometric characteristics? Most current methods on gait recognition suffer from the lack of interpretability. We propose an interpretable feature representation of gait sequences by the graphical Granger causal inference. Gait sequence of a person in the standardized motion capture format, constituting a set of 3D joint spatial trajectories, is envisaged as a causal system of joints interacting in time. We apply the graphical Granger model (GGM) to obtain the so-called Granger causal graph among joints as a discriminative and visually interpretable representation of a person's gait. We evaluate eleven distance functions in the GGM feature space by established classification and class-separability evaluation metrics. Our experiments indicate that, depending on the metric, the most appropriate distance functions for the GGM are the total norm distance and the Ky-Fan 1-norm distance. Experiments also show that the GGM is able to detect the most discriminative joint interactions and that it outperforms five related interpretable models in correct classification rate and in Davies-Bouldin index. The proposed GGM model can serve as a complementary tool for gait analysis in kinesiology or for gait recognition in video surveillance. △ Less

Submitted 7 December, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

Comments: Preprint. Full paper accepted at the IEEE/IAPR International Conference on Pattern Recognition (ICPR), Montreal, Canada, August 2022. 7 pages

MSC Class: 68T05; 68T10 ACM Class: I.5

arXiv:2203.03989 [pdf, other]

Adaptor: Objective-Centric Adaptation Framework for Language Models

Authors: Michal Štefánik, Vít Novotný, Nikola Groverová, Petr Sojka

Abstract: Progress in natural language processing research is catalyzed by the possibilities given by the widespread software frameworks. This paper introduces Adaptor library that transposes the traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training process by applications of selected objectives. We survey research directions th… ▽ More Progress in natural language processing research is catalyzed by the possibilities given by the widespread software frameworks. This paper introduces Adaptor library that transposes the traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training process by applications of selected objectives. We survey research directions that can benefit from enhanced objective-centric experimentation in multitask training, custom objectives development, dynamic training curricula, or domain adaptation. Adaptor aims to ease reproducibility of these research directions in practice. Finally, we demonstrate the practical applicability of Adaptor in selected unsupervised domain adaptation scenarios. △ Less

Submitted 20 May, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

Comments: 60th Annual Meeting of the ACL (ACL 2022): System Demonstrations paper

arXiv:2110.04040 [pdf, other]

Towards Math-Aware Automated Classification and Similarity Search of Scientific Publications: Methods of Mathematical Content Representations

Authors: Michal Růžička, Petr Sojka

Abstract: In this paper, we investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents using standard machine learning algorithms: the Latent Dirichlet Allocation (LDA) and the Latent Semantic Indexing (LSI). The methods are evaluated on a subset of arXiv.org papers with the Mathematics Subject Classification (MSC) as a reference… ▽ More In this paper, we investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents using standard machine learning algorithms: the Latent Dirichlet Allocation (LDA) and the Latent Semantic Indexing (LSI). The methods are evaluated on a subset of arXiv.org papers with the Mathematics Subject Classification (MSC) as a reference classification and using the standard precision/recall/F1-measure metrics. The results give insight into how different math representations may influence the performance of the classification and similarity search tasks in STEM repositories. Non-surprisingly, machine learning methods are able to grab distributional semantics from textual tokens. A proper selection of weighted tokens representing math may improve the quality of the results slightly. A structured math representation that imitates successful text-processing techniques with math is shown to yield better results than flat TeX tokens. △ Less

Submitted 8 October, 2021; originally announced October 2021.

MSC Class: 97E40 (Primary) 00Axx; 68T50; 97-XX (Secondary) ACM Class: H.3; H.4; I.2; I.7; I.1

arXiv:2109.07242 [pdf, other]

Regressive Ensemble for Machine Translation Quality Evaluation

Authors: Michal Štefánik, Vít Novotný, Petr Sojka

Abstract: This work introduces a simple regressive ensemble for evaluating machine translation quality based on a set of novel and established metrics. We evaluate the ensemble using a correlation to expert-based MQM scores of the WMT 2021 Metrics workshop. In both monolingual and zero-shot cross-lingual settings, we show a significant performance improvement over single metrics. In the cross-lingual settin… ▽ More This work introduces a simple regressive ensemble for evaluating machine translation quality based on a set of novel and established metrics. We evaluate the ensemble using a correlation to expert-based MQM scores of the WMT 2021 Metrics workshop. In both monolingual and zero-shot cross-lingual settings, we show a significant performance improvement over single metrics. In the cross-lingual settings, we also demonstrate that an ensemble approach is well-applicable to unseen languages. Furthermore, we identify a strong reference-free baseline that consistently outperforms the commonly-used BLEU and METEOR measures and significantly improves our ensemble's performance. △ Less

Submitted 15 September, 2021; originally announced September 2021.

Comments: 8 pages incl. references, Proceedings of EMNLP 2021 Sixth Conference on Machine Translation (WMT 21)

arXiv:2106.00411 [pdf, other]

WebMIaS on Docker: Deploying Math-Aware Search in a Single Line of Code

Authors: Dávid Lupták, Vít Novotný, Michal Štefánik, Petr Sojka

Abstract: Math informational retrieval (MIR) search engines are absent in the wide-spread production use, even though documents in the STEM fields contain many mathematical formulae, which are sometimes more important than text for understanding. We have developed and open-sourced the WebMIaS MIR search engine that has been successfully deployed in the European Digital Mathematics Library (EuDML). However,… ▽ More Math informational retrieval (MIR) search engines are absent in the wide-spread production use, even though documents in the STEM fields contain many mathematical formulae, which are sometimes more important than text for understanding. We have developed and open-sourced the WebMIaS MIR search engine that has been successfully deployed in the European Digital Mathematics Library (EuDML). However, its deployment is difficult to automate due to the complexity of this task. Moreover, the solutions developed so far to tackle this challenge are imperfect in terms of speed, maintenance, and robustness. In this paper, we will describe the virtualization of WebMIaS using Docker that solves all three problems and allows anyone to deploy containerized WebMIaS in a single line of code. The publicly available Docker image will also help the community push the development of math-aware search engines in the ARQMath workshop series. △ Less

Submitted 14 July, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

Comments: Accepted to be published in: Intelligent Computer Mathematics 14th International Conference, CICM 2021, Timisoara, Romania, July 26--31, 2021, Proceedings, Fairouz Kamareddine and Claudio Sacerdotti-Coen (eds.), Lecture Notes in Artificial Intelligence, Springer, Cham, 2021

MSC Class: 68V35 (Primary); 68V30 (Secondary) ACM Class: H.3.3; H.3.4; H.3.5; H.3.6; H.3.7

arXiv:2104.09691 [pdf, other]

doi 10.3897/jucs.69619

When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting

Authors: Vít Novotný, Michal Štefánik, Eniafe Festus Ayetiran, Petr Sojka, Radim Řehůřek

Abstract: In 2018, Mikolov et al. introduced the positional language model, which has characteristics of attention-based neural machine translation models and which achieved state-of-the-art performance on the intrinsic word analogy task. However, the positional model is not practically fast and it has never been evaluated on qualitative criteria or extrinsic tasks. We propose a constrained positional model… ▽ More In 2018, Mikolov et al. introduced the positional language model, which has characteristics of attention-based neural machine translation models and which achieved state-of-the-art performance on the intrinsic word analogy task. However, the positional model is not practically fast and it has never been evaluated on qualitative criteria or extrinsic tasks. We propose a constrained positional model, which adapts the sparse attention mechanism from neural machine translation to improve the speed of the positional model. We evaluate the positional and constrained positional models on three novel qualitative criteria and on language modeling. We show that the positional and constrained positional models contain interpretable information about the grammatical properties of words and outperform other shallow models on language modeling. We also show that our constrained model outperforms the positional model on language modeling and trains twice as fast. △ Less

Submitted 28 February, 2022; v1 submitted 19 April, 2021; originally announced April 2021.

MSC Class: 68T50 ACM Class: I.2.7

Journal ref: J. Univers. Comput. Sci. 28:2 (2022) 181-201

arXiv:2103.00232 [pdf]

doi 10.1016/j.knosys.2021.106902

EDS-MEMBED: Multi-sense embeddings based on enhanced distributional semantic structures via a graph walk over word senses

Authors: Eniafe Festus Ayetiran, Petr Sojka, Vít Novotný

Abstract: Several language applications often require word semantics as a core part of their processing pipeline, either as precise meaning inference or semantic similarity. Multi-sense embeddings (M-SE) can be exploited for this important requirement. M-SE seeks to represent each word by their distinct senses in order to resolve the conflation of meanings of words as used in different contexts. Previous wo… ▽ More Several language applications often require word semantics as a core part of their processing pipeline, either as precise meaning inference or semantic similarity. Multi-sense embeddings (M-SE) can be exploited for this important requirement. M-SE seeks to represent each word by their distinct senses in order to resolve the conflation of meanings of words as used in different contexts. Previous works usually approach this task by training a model on a large corpus and often ignore the effect and usefulness of the semantic relations offered by lexical resources. However, even with large training data, coverage of all possible word senses is still an issue. In addition, a considerable percentage of contextual semantic knowledge are never learned because a huge amount of possible distributional semantic structures are never explored. In this paper, we leverage the rich semantic structures in WordNet using a graph-theoretic walk technique over word senses to enhance the quality of multi-sense embeddings. This algorithm composes enriched texts from the original texts. Furthermore, we derive new distributional semantic similarity measures for M-SE from prior ones. We adapt these measures to word sense disambiguation (WSD) aspect of our experiment. We report evaluation results on 11 benchmark datasets involving WSD and Word Similarity tasks and show that our method for enhancing distributional semantic structures improves embeddings quality on the baselines. Despite the small training data, it achieves state-of-the-art performance on some of the datasets. △ Less

Submitted 27 February, 2021; originally announced March 2021.

MSC Class: 68T50 ACM Class: I.2.7

Journal ref: Knowledge-Based Systems. 219 (2021) 106902

arXiv:2102.02585 [pdf, other]

doi 10.26615/978-954-452-072-4_121

One Size Does Not Fit All: Finding the Optimal Subword Sizes for FastText Models across Languages

Authors: Vít Novotný, Eniafe Festus Ayetiran, Dalibor Bačovský, Dávid Lupták, Michal Štefánik, Petr Sojka

Abstract: Unsupervised representation learning of words from large multilingual corpora is useful for downstream tasks such as word sense disambiguation, semantic text similarity, and information retrieval. The representation precision of log-bilinear fastText models is mostly due to their use of subword information. In previous work, the optimization of fastText's subword sizes has not been fully explored,… ▽ More Unsupervised representation learning of words from large multilingual corpora is useful for downstream tasks such as word sense disambiguation, semantic text similarity, and information retrieval. The representation precision of log-bilinear fastText models is mostly due to their use of subword information. In previous work, the optimization of fastText's subword sizes has not been fully explored, and non-English fastText models were trained using subword sizes optimized for English and German word analogy tasks. In our work, we find the optimal subword sizes on the English, German, Czech, Italian, Spanish, French, Hindi, Turkish, and Russian word analogy tasks. We then propose a simple n-gram coverage model and we show that it predicts better-than-default subword sizes on the Spanish, French, Hindi, Turkish, and Russian word analogy tasks. We show that the optimization of fastText's subword sizes matters and results in a 14% improvement on the Czech word analogy task. We also show that expensive parameter optimization can be replaced by a simple n-gram coverage model that consistently improves the accuracy of fastText models on the word analogy tasks by up to 3% compared to the default subword sizes, and that it is within 1% accuracy of the optimal subword sizes. △ Less

Submitted 20 September, 2021; v1 submitted 4 February, 2021; originally announced February 2021.

MSC Class: 68T50 ACM Class: I.2.7

Journal ref: RANLP (2021) 1072-1078

arXiv:2003.05019 [pdf, other]

Text classification with word embedding regularization and soft similarity measure

Authors: Vít Novotný, Eniafe Festus Ayetiran, Michal Štefánik, Petr Sojka

Abstract: Since the seminal work of Mikolov et al., word embeddings have become the preferred word representations for many natural language processing tasks. Document similarity measures extracted from word embeddings, such as the soft cosine measure (SCM) and the Word Mover's Distance (WMD), were reported to achieve state-of-the-art performance on semantic text similarity and text classification. Despit… ▽ More Since the seminal work of Mikolov et al., word embeddings have become the preferred word representations for many natural language processing tasks. Document similarity measures extracted from word embeddings, such as the soft cosine measure (SCM) and the Word Mover's Distance (WMD), were reported to achieve state-of-the-art performance on semantic text similarity and text classification. Despite the strong performance of the WMD on text classification and semantic text similarity, its super-cubic average time complexity is impractical. The SCM has quadratic worst-case time complexity, but its performance on text classification has never been compared with the WMD. Recently, two word embedding regularization techniques were shown to reduce storage and memory costs, and to improve training speed, document processing speed, and task performance on word analogy, word similarity, and semantic text similarity. However, the effect of these techniques on text classification has not yet been studied. In our work, we investigate the individual and joint effect of the two word embedding regularization techniques on the document processing speed and the task performance of the SCM and the WMD on text classification. For evaluation, we use the $k$NN classifier and six standard datasets: BBCSPORT, TWITTER, OHSUMED, REUTERS-21578, AMAZON, and 20NEWS. We show 39% average $k$NN test error reduction with regularized word embeddings compared to non-regularized word embeddings. We describe a practical procedure for deriving such regularized embeddings through Cholesky factorization. We also show that the SCM with regularized word embeddings significantly outperforms the WMD on text classification and is over 10,000 times faster. △ Less

Submitted 10 March, 2020; originally announced March 2020.

MSC Class: 68P20 ACM Class: F.2.1; G.1.3; H.3.3; I.2.7

arXiv:1808.09224 [pdf, other]

doi 10.1145/3269206.3269233

MIaS: Math-Aware Retrieval in Digital Mathematical Libraries

Authors: Petr Sojka, Michal Růžička, Vít Novotný

Abstract: Digital mathematical libraries (DMLs) such as arXiv, Numdam, and EuDML contain mainly documents from STEM fields, where mathematical formulae are often more important than text for understanding. Conventional information retrieval (IR) systems are unable to represent formulae and they are therefore ill-suited for math information retrieval (MIR). To fill the gap, we have developed, and open-source… ▽ More Digital mathematical libraries (DMLs) such as arXiv, Numdam, and EuDML contain mainly documents from STEM fields, where mathematical formulae are often more important than text for understanding. Conventional information retrieval (IR) systems are unable to represent formulae and they are therefore ill-suited for math information retrieval (MIR). To fill the gap, we have developed, and open-sourced the MIaS MIR system. MIaS is based on the full-text search engine Apache Lucene. On top of text retrieval, MIaS also incorporates a set of tools for preprocessing mathematical formulae. We describe the design of the system and present speed, and quality evaluation results. We show that MIaS is both efficient, and effective, as evidenced by our victory in the NTCIR-11 Math-2 task. △ Less

Submitted 28 August, 2018; originally announced August 2018.

Comments: This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in The 27th ACM International Conference on Information and Knowledge Management (CIKM '18), October 22-26, 2018, Torino, Italy, https://doi.org/10.1145/3269206.3269233

arXiv:1708.07755 [pdf, other]

doi 10.1145/3152124

Gait Recognition from Motion Capture Data

Authors: Michal Balazia, Petr Sojka

Abstract: Gait recognition from motion capture data, as a pattern classification discipline, can be improved by the use of machine learning. This paper contributes to the state-of-the-art with a statistical approach for extracting robust gait features directly from raw data by a modification of Linear Discriminant Analysis with Maximum Margin Criterion. Experiments on the CMU MoCap database show that the su… ▽ More Gait recognition from motion capture data, as a pattern classification discipline, can be improved by the use of machine learning. This paper contributes to the state-of-the-art with a statistical approach for extracting robust gait features directly from raw data by a modification of Linear Discriminant Analysis with Maximum Margin Criterion. Experiments on the CMU MoCap database show that the suggested method outperforms thirteen relevant methods based on geometric features and a method to learn the features by a combination of Principal Component Analysis and Linear Discriminant Analysis. The methods are evaluated in terms of the distribution of biometric templates in respective feature spaces expressed in a number of class separability coefficients and classification metrics. Results also indicate a high portability of learned features, that means, we can learn what aspects of walk people generally differ in and extract those as general gait features. Recognizing people without needing group-specific features is convenient as particular people might not always provide annotated learning data. As a contribution to reproducible research, our evaluation framework and database have been made publicly available. This research makes motion capture technology directly applicable for human recognition. △ Less

Submitted 7 December, 2022; v1 submitted 24 August, 2017; originally announced August 2017.

Comments: Preprint. Full paper accepted at the ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), special issue on Representation, Analysis and Recognition of 3D Humans, February 2018. 18 pages. arXiv admin note: substantial text overlap with arXiv:1701.00995, arXiv:1609.04392, arXiv:1609.06936

MSC Class: 68T05; 68T10 ACM Class: I.5

arXiv:1706.09443 [pdf, other]

doi 10.1109/BTAS.2017.8272700

You Are How You Walk: Uncooperative MoCap Gait Identification for Video Surveillance with Incomplete and Noisy Data

Authors: Michal Balazia, Petr Sojka

Abstract: This work offers a design of a video surveillance system based on a soft biometric -- gait identification from MoCap data. The main focus is on two substantial issues of the video surveillance scenario: (1) the walkers do not cooperate in providing learning data to establish their identities and (2) the data are often noisy or incomplete. We show that only a few examples of human gait cycles are r… ▽ More This work offers a design of a video surveillance system based on a soft biometric -- gait identification from MoCap data. The main focus is on two substantial issues of the video surveillance scenario: (1) the walkers do not cooperate in providing learning data to establish their identities and (2) the data are often noisy or incomplete. We show that only a few examples of human gait cycles are required to learn a projection of raw MoCap data onto a low-dimensional sub-space where the identities are well separable. Latent features learned by Maximum Margin Criterion (MMC) method discriminate better than any collection of geometric features. The MMC method is also highly robust to noisy data and works properly even with only a fraction of joints tracked. The overall workflow of the design is directly applicable for a day-to-day operation based on the available MoCap technology and algorithms for gait analysis. In the concept we introduce, a walker's identity is represented by a cluster of gait data collected at their incidents within the surveillance system: They are how they walk. △ Less

Submitted 7 December, 2022; v1 submitted 28 June, 2017; originally announced June 2017.

Comments: Preprint. Full paper accepted at the IEEE/IAPR International Joint Conference on Biometrics (IJCB), Denver, USA, October 2017. 8 pages

MSC Class: 68T05; 68T10 ACM Class: I.5

arXiv:1706.00957 [pdf, other]

Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines

Authors: Jan Rygl, Jan Pomikálek, Radim Řehůřek, Michal Růžička, Vít Novotný, Petr Sojka

Abstract: Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to `vector similarity searching' over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. We show that this… ▽ More Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to `vector similarity searching' over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. We show that this approach allows the indexing and querying of dense vectors in text domains. This opens up exciting avenues for major efficiency gains, along with simpler deployment, scaling and monitoring. The end result is a fast and scalable vector database with a tunable trade-off between vector search performance and quality, backed by a standard fulltext engine such as Elasticsearch. We empirically demonstrate its querying performance and quality by applying this solution to the task of semantic searching over a dense vector representation of the entire English Wikipedia. △ Less

Submitted 3 June, 2017; originally announced June 2017.

Comments: Preprint of the paper accepted to the ACL 2017 (http://acl2017.org/) workshop RepL4NLP 2017 (https://sites.google.com/site/repl4nlp2017/)

arXiv:1701.00995 [pdf, other]

doi 10.1007/978-3-319-56414-2_3

An Evaluation Framework and Database for MoCap-Based Gait Recognition Methods

Authors: Michal Balazia, Petr Sojka

Abstract: As a contribution to reproducible research, this paper presents a framework and a database to improve the development, evaluation and comparison of methods for gait recognition from motion capture (MoCap) data. The evaluation framework provides implementation details and source codes of state-of-the-art human-interpretable geometric features as well as our own approaches where gait features are le… ▽ More As a contribution to reproducible research, this paper presents a framework and a database to improve the development, evaluation and comparison of methods for gait recognition from motion capture (MoCap) data. The evaluation framework provides implementation details and source codes of state-of-the-art human-interpretable geometric features as well as our own approaches where gait features are learned by a modification of Fisher's Linear Discriminant Analysis with the Maximum Margin Criterion, and by a combination of Principal Component Analysis and Linear Discriminant Analysis. It includes a description and source codes of a mechanism for evaluating four class separability coefficients of feature space and four rank-based classifier performance metrics. This framework also contains a tool for learning a custom classifier and for classifying a custom query on a custom gallery. We provide an experimental database along with source codes for its extraction from the general CMU MoCap database. △ Less

Submitted 7 December, 2022; v1 submitted 4 January, 2017; originally announced January 2017.

Comments: Preprint. Full paper published at the 1st IAPR Workshop on Proceedings of Reproducible Research in Pattern Recognition (RRPR), Cancun, Mexico, December 2016. 13 pages. arXiv admin note: text overlap with arXiv:1609.06936

MSC Class: 68T05; 68T10 ACM Class: I.5

arXiv:1609.06936 [pdf, other]

doi 10.1007/978-3-319-49055-7_28

Walker-Independent Features for Gait Recognition from Motion Capture Data

Authors: Michal Balazia, Petr Sojka

Abstract: MoCap-based human identification, as a pattern recognition discipline, can be optimized using a machine learning approach. Yet in some applications such as video surveillance new identities can appear on the fly and labeled data for all encountered people may not always be available. This work introduces the concept of learning walker-independent gait features directly from raw joint coordinates b… ▽ More MoCap-based human identification, as a pattern recognition discipline, can be optimized using a machine learning approach. Yet in some applications such as video surveillance new identities can appear on the fly and labeled data for all encountered people may not always be available. This work introduces the concept of learning walker-independent gait features directly from raw joint coordinates by a modification of the Fisher Linear Discriminant Analysis with Maximum Margin Criterion. Our new approach shows not only that these features can discriminate different people than who they are learned on, but also that the number of learning identities can be much smaller than the number of walkers encountered in the real operation. △ Less

Submitted 7 December, 2022; v1 submitted 22 September, 2016; originally announced September 2016.

Comments: Preprint. Full paper published at the Joint IAPR International Workshops on Structural and Syntactic Pattern Recognition and Statistical Techniques in Pattern Recognition (S+SSPR), Merida, Mexico, November 2016. 11 pages. arXiv admin note: substantial text overlap with arXiv:1609.04392

MSC Class: 68T05; 68T10 ACM Class: I.5

arXiv:1609.04392 [pdf, other]

doi 10.1109/ICPR.2016.7899750

Learning Robust Features for Gait Recognition by Maximum Margin Criterion

Authors: Michal Balazia, Petr Sojka

Abstract: In the field of gait recognition from motion capture data, designing human-interpretable gait features is a common practice of many fellow researchers. To refrain from ad-hoc schemes and to find maximally discriminative features we may need to explore beyond the limits of human interpretability. This paper contributes to the state-of-the-art with a machine learning approach for extracting robust g… ▽ More In the field of gait recognition from motion capture data, designing human-interpretable gait features is a common practice of many fellow researchers. To refrain from ad-hoc schemes and to find maximally discriminative features we may need to explore beyond the limits of human interpretability. This paper contributes to the state-of-the-art with a machine learning approach for extracting robust gait features directly from raw joint coordinates. The features are learned by a modification of Linear Discriminant Analysis with Maximum Margin Criterion so that the identities are maximally separated and, in combination with an appropriate classifier, used for gait recognition. Experiments on the CMU MoCap database show that this method outperforms eight other relevant methods in terms of the distribution of biometric templates in respective feature spaces expressed in four class separability coefficients. Additional experiments indicate that this method is a leading concept for rank-based classifier systems. △ Less

Submitted 7 December, 2022; v1 submitted 14 September, 2016; originally announced September 2016.

Comments: Preprint. Full paper published at the 23rd IEEE/IAPR International Conference on Pattern Recognition (ICPR), Cancun, Mexico, December 2016. 6 pages

MSC Class: 68T05; 68T10 ACM Class: I.5

arXiv:1508.01929 [pdf, other]

doi 10.1145/2810355.2810359

Combining Text and Formula Queries in Math Information Retrieval: Evaluation of Query Results Merging Strategies

Authors: Martin Líška, Petr Sojka, Michal Růžička

Abstract: Specific to Math Information Retrieval is combining text with mathematical formulae both in documents and in queries. Rigorous evaluation of query expansion and merging strategies combining math and standard textual keyword terms in a query are given. It is shown that techniques similar to those known from textual query processing may be applied in math information retrieval as well, and lead to a… ▽ More Specific to Math Information Retrieval is combining text with mathematical formulae both in documents and in queries. Rigorous evaluation of query expansion and merging strategies combining math and standard textual keyword terms in a query are given. It is shown that techniques similar to those known from textual query processing may be applied in math information retrieval as well, and lead to a cutting edge performance. Stri** and merging partial results from subqueries is one technique that improves results measured by information retrieval evaluation metrics like Bpref. △ Less

Submitted 8 August, 2015; originally announced August 2015.

ACM Class: H.3.3; I.7

arXiv:1404.6476 [pdf, other]

Math Indexer and Searcher Web Interface: Towards Fulfillment of Mathematicians' Information Needs

Authors: Martin Líška, Petr Sojka, Michal Růžička

Abstract: We are designing and develo** a web user interface for digital mathematics libraries called WebMIaS. It allows queries to be expressed by mathematicians through a faceted search interface. Users can combine standard textual autocompleted keywords with keywords in the form of mathematical formulae in LaTeX or MathML formats. Formulae are shown rendered by the web browser on-the-fly for users' fee… ▽ More We are designing and develo** a web user interface for digital mathematics libraries called WebMIaS. It allows queries to be expressed by mathematicians through a faceted search interface. Users can combine standard textual autocompleted keywords with keywords in the form of mathematical formulae in LaTeX or MathML formats. Formulae are shown rendered by the web browser on-the-fly for users' feedback. We describe WebMIaS design principles and our experiences deploying in the European Digital Mathematics Library (EuDML). We further describe the issues addressed by formulae canonicalization and by extending the MIaS indexing engine with Content MathML support. △ Less

Submitted 19 May, 2014; v1 submitted 25 April, 2014; originally announced April 2014.

Comments: Preprint of CICM 2014 (http://cicm-conference.org/2014/) paper: S.M. Watt et al. (Eds.): CICM 2014, LNAI 8543, pp. 444-448, Springer International Publishing Switzerland 2014

ACM Class: H.3.7; H.3.6; H.5.3

Showing 1–22 of 22 results for author: Sojka, P