Skip to main content

Showing 1–22 of 22 results for author: Sojka, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.09703  [pdf, other

    cs.CL cs.AI

    Concept-aware Data Construction Improves In-context Learning of Language Models

    Authors: Michal Štefánik, Marek Kadlčík, Petr Sojka

    Abstract: Many recent language models (LMs) are capable of in-context learning (ICL), manifested in the LMs' ability to perform a new task solely from natural-language instruction. Previous work curating in-context learners assumes that ICL emerges from a vast over-parametrization or the scale of multi-task training. However, recent theoretical work attributes the ICL ability to concept-dependent training d… ▽ More

    Submitted 28 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

    Comments: Long paper to appear in Findings of ACL 2024

  2. arXiv:2305.06841  [pdf, other

    cs.CL cs.AI

    Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models

    Authors: Lukáš Mikula, Michal Štefánik, Marek Petrovič, Petr Sojka

    Abstract: While the Large Language Models (LLMs) dominate a majority of language understanding tasks, previous work shows that some of these results are supported by modelling spurious correlations of training datasets. Authors commonly assess model robustness by evaluating their models on out-of-distribution (OOD) datasets of the same task, but these datasets might share the bias of the training dataset.… ▽ More

    Submitted 6 February, 2024; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: Long paper in Proceedings of EACL 2024: Main track

  3. arXiv:2304.01922  [pdf, other

    cs.CL

    Resources and Few-shot Learners for In-context Learning in Slavic Languages

    Authors: Michal Štefánik, Marek Kadlčík, Piotr Gramacki, Petr Sojka

    Abstract: Despite the rapid recent progress in creating accurate and compact in-context learners, most recent work focuses on in-context learning (ICL) for tasks in English. However, the ability to interact with users of languages outside English presents a great potential for broadening the applicability of language technologies to non-English speakers. In this work, we collect the infrastructure necessa… ▽ More

    Submitted 4 April, 2023; originally announced April 2023.

    Comments: EACL 2023 SlavicNLP Long Paper. New instructional templates and models are available on https://github.com/fewshot-goes-multilingual/slavic-incontext-learning

  4. arXiv:2211.16550  [pdf, other

    cs.CL cs.AI cs.NE

    Soft Alignment Objectives for Robust Adaptation of Language Generation

    Authors: Michal Štefánik, Marek Kadlčík, Petr Sojka

    Abstract: Domain adaptation allows generative language models to address specific flaws caused by the domain shift of their application. However, the traditional adaptation by further training on in-domain data rapidly weakens the model's ability to generalize to other domains, making the open-ended deployments of the adapted models prone to errors. This work introduces novel training objectives built upon… ▽ More

    Submitted 26 May, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

    Comments: Annual Meeting of The ACL 2023: Main conference long paper

  5. arXiv:2206.06714  [pdf, other

    cs.CV

    Interpretable Gait Recognition by Granger Causality

    Authors: Michal Balazia, Katerina Hlavackova-Schindler, Petr Sojka, Claudia Plant

    Abstract: Which joint interactions in the human gait cycle can be used as biometric characteristics? Most current methods on gait recognition suffer from the lack of interpretability. We propose an interpretable feature representation of gait sequences by the graphical Granger causal inference. Gait sequence of a person in the standardized motion capture format, constituting a set of 3D joint spatial trajec… ▽ More

    Submitted 7 December, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

    Comments: Preprint. Full paper accepted at the IEEE/IAPR International Conference on Pattern Recognition (ICPR), Montreal, Canada, August 2022. 7 pages

    MSC Class: 68T05; 68T10 ACM Class: I.5

  6. arXiv:2203.03989  [pdf, other

    cs.CL cs.AI cs.LG

    Adaptor: Objective-Centric Adaptation Framework for Language Models

    Authors: Michal Štefánik, Vít Novotný, Nikola Groverová, Petr Sojka

    Abstract: Progress in natural language processing research is catalyzed by the possibilities given by the widespread software frameworks. This paper introduces Adaptor library that transposes the traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training process by applications of selected objectives. We survey research directions th… ▽ More

    Submitted 20 May, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

    Comments: 60th Annual Meeting of the ACL (ACL 2022): System Demonstrations paper

  7. arXiv:2110.04040  [pdf, other

    cs.IR cs.AI cs.CL

    Towards Math-Aware Automated Classification and Similarity Search of Scientific Publications: Methods of Mathematical Content Representations

    Authors: Michal Růžička, Petr Sojka

    Abstract: In this paper, we investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents using standard machine learning algorithms: the Latent Dirichlet Allocation (LDA) and the Latent Semantic Indexing (LSI). The methods are evaluated on a subset of arXiv.org papers with the Mathematics Subject Classification (MSC) as a reference… ▽ More

    Submitted 8 October, 2021; originally announced October 2021.

    MSC Class: 97E40 (Primary) 00Axx; 68T50; 97-XX (Secondary) ACM Class: H.3; H.4; I.2; I.7; I.1

  8. arXiv:2109.07242  [pdf, other

    cs.CL

    Regressive Ensemble for Machine Translation Quality Evaluation

    Authors: Michal Štefánik, Vít Novotný, Petr Sojka

    Abstract: This work introduces a simple regressive ensemble for evaluating machine translation quality based on a set of novel and established metrics. We evaluate the ensemble using a correlation to expert-based MQM scores of the WMT 2021 Metrics workshop. In both monolingual and zero-shot cross-lingual settings, we show a significant performance improvement over single metrics. In the cross-lingual settin… ▽ More

    Submitted 15 September, 2021; originally announced September 2021.

    Comments: 8 pages incl. references, Proceedings of EMNLP 2021 Sixth Conference on Machine Translation (WMT 21)

  9. arXiv:2106.00411  [pdf, other

    cs.DL cs.IR

    WebMIaS on Docker: Deploying Math-Aware Search in a Single Line of Code

    Authors: Dávid Lupták, Vít Novotný, Michal Štefánik, Petr Sojka

    Abstract: Math informational retrieval (MIR) search engines are absent in the wide-spread production use, even though documents in the STEM fields contain many mathematical formulae, which are sometimes more important than text for understanding. We have developed and open-sourced the WebMIaS MIR search engine that has been successfully deployed in the European Digital Mathematics Library (EuDML). However,… ▽ More

    Submitted 14 July, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

    Comments: Accepted to be published in: Intelligent Computer Mathematics 14th International Conference, CICM 2021, Timisoara, Romania, July 26--31, 2021, Proceedings, Fairouz Kamareddine and Claudio Sacerdotti-Coen (eds.), Lecture Notes in Artificial Intelligence, Springer, Cham, 2021

    MSC Class: 68V35 (Primary); 68V30 (Secondary) ACM Class: H.3.3; H.3.4; H.3.5; H.3.6; H.3.7

  10. When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting

    Authors: Vít Novotný, Michal Štefánik, Eniafe Festus Ayetiran, Petr Sojka, Radim Řehůřek

    Abstract: In 2018, Mikolov et al. introduced the positional language model, which has characteristics of attention-based neural machine translation models and which achieved state-of-the-art performance on the intrinsic word analogy task. However, the positional model is not practically fast and it has never been evaluated on qualitative criteria or extrinsic tasks. We propose a constrained positional model… ▽ More

    Submitted 28 February, 2022; v1 submitted 19 April, 2021; originally announced April 2021.

    MSC Class: 68T50 ACM Class: I.2.7

    Journal ref: J. Univers. Comput. Sci. 28:2 (2022) 181-201

  11. EDS-MEMBED: Multi-sense embeddings based on enhanced distributional semantic structures via a graph walk over word senses

    Authors: Eniafe Festus Ayetiran, Petr Sojka, Vít Novotný

    Abstract: Several language applications often require word semantics as a core part of their processing pipeline, either as precise meaning inference or semantic similarity. Multi-sense embeddings (M-SE) can be exploited for this important requirement. M-SE seeks to represent each word by their distinct senses in order to resolve the conflation of meanings of words as used in different contexts. Previous wo… ▽ More

    Submitted 27 February, 2021; originally announced March 2021.

    MSC Class: 68T50 ACM Class: I.2.7

    Journal ref: Knowledge-Based Systems. 219 (2021) 106902

  12. One Size Does Not Fit All: Finding the Optimal Subword Sizes for FastText Models across Languages

    Authors: Vít Novotný, Eniafe Festus Ayetiran, Dalibor Bačovský, Dávid Lupták, Michal Štefánik, Petr Sojka

    Abstract: Unsupervised representation learning of words from large multilingual corpora is useful for downstream tasks such as word sense disambiguation, semantic text similarity, and information retrieval. The representation precision of log-bilinear fastText models is mostly due to their use of subword information. In previous work, the optimization of fastText's subword sizes has not been fully explored,… ▽ More

    Submitted 20 September, 2021; v1 submitted 4 February, 2021; originally announced February 2021.

    MSC Class: 68T50 ACM Class: I.2.7

    Journal ref: RANLP (2021) 1072-1078

  13. arXiv:2003.05019  [pdf, other

    cs.IR cs.CL cs.LG

    Text classification with word embedding regularization and soft similarity measure

    Authors: Vít Novotný, Eniafe Festus Ayetiran, Michal Štefánik, Petr Sojka

    Abstract: Since the seminal work of Mikolov et al., word embeddings have become the preferred word representations for many natural language processing tasks. Document similarity measures extracted from word embeddings, such as the soft cosine measure (SCM) and the Word Mover's Distance (WMD), were reported to achieve state-of-the-art performance on semantic text similarity and text classification. Despit… ▽ More

    Submitted 10 March, 2020; originally announced March 2020.

    MSC Class: 68P20 ACM Class: F.2.1; G.1.3; H.3.3; I.2.7

  14. MIaS: Math-Aware Retrieval in Digital Mathematical Libraries

    Authors: Petr Sojka, Michal Růžička, Vít Novotný

    Abstract: Digital mathematical libraries (DMLs) such as arXiv, Numdam, and EuDML contain mainly documents from STEM fields, where mathematical formulae are often more important than text for understanding. Conventional information retrieval (IR) systems are unable to represent formulae and they are therefore ill-suited for math information retrieval (MIR). To fill the gap, we have developed, and open-source… ▽ More

    Submitted 28 August, 2018; originally announced August 2018.

    Comments: This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in The 27th ACM International Conference on Information and Knowledge Management (CIKM '18), October 22-26, 2018, Torino, Italy, https://doi.org/10.1145/3269206.3269233

  15. Gait Recognition from Motion Capture Data

    Authors: Michal Balazia, Petr Sojka

    Abstract: Gait recognition from motion capture data, as a pattern classification discipline, can be improved by the use of machine learning. This paper contributes to the state-of-the-art with a statistical approach for extracting robust gait features directly from raw data by a modification of Linear Discriminant Analysis with Maximum Margin Criterion. Experiments on the CMU MoCap database show that the su… ▽ More

    Submitted 7 December, 2022; v1 submitted 24 August, 2017; originally announced August 2017.

    Comments: Preprint. Full paper accepted at the ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), special issue on Representation, Analysis and Recognition of 3D Humans, February 2018. 18 pages. arXiv admin note: substantial text overlap with arXiv:1701.00995, arXiv:1609.04392, arXiv:1609.06936

    MSC Class: 68T05; 68T10 ACM Class: I.5

  16. You Are How You Walk: Uncooperative MoCap Gait Identification for Video Surveillance with Incomplete and Noisy Data

    Authors: Michal Balazia, Petr Sojka

    Abstract: This work offers a design of a video surveillance system based on a soft biometric -- gait identification from MoCap data. The main focus is on two substantial issues of the video surveillance scenario: (1) the walkers do not cooperate in providing learning data to establish their identities and (2) the data are often noisy or incomplete. We show that only a few examples of human gait cycles are r… ▽ More

    Submitted 7 December, 2022; v1 submitted 28 June, 2017; originally announced June 2017.

    Comments: Preprint. Full paper accepted at the IEEE/IAPR International Joint Conference on Biometrics (IJCB), Denver, USA, October 2017. 8 pages

    MSC Class: 68T05; 68T10 ACM Class: I.5

  17. arXiv:1706.00957  [pdf, other

    cs.IR

    Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines

    Authors: Jan Rygl, Jan Pomikálek, Radim Řehůřek, Michal Růžička, Vít Novotný, Petr Sojka

    Abstract: Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to `vector similarity searching' over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. We show that this… ▽ More

    Submitted 3 June, 2017; originally announced June 2017.

    Comments: Preprint of the paper accepted to the ACL 2017 (http://acl2017.org/) workshop RepL4NLP 2017 (https://sites.google.com/site/repl4nlp2017/)

  18. An Evaluation Framework and Database for MoCap-Based Gait Recognition Methods

    Authors: Michal Balazia, Petr Sojka

    Abstract: As a contribution to reproducible research, this paper presents a framework and a database to improve the development, evaluation and comparison of methods for gait recognition from motion capture (MoCap) data. The evaluation framework provides implementation details and source codes of state-of-the-art human-interpretable geometric features as well as our own approaches where gait features are le… ▽ More

    Submitted 7 December, 2022; v1 submitted 4 January, 2017; originally announced January 2017.

    Comments: Preprint. Full paper published at the 1st IAPR Workshop on Proceedings of Reproducible Research in Pattern Recognition (RRPR), Cancun, Mexico, December 2016. 13 pages. arXiv admin note: text overlap with arXiv:1609.06936

    MSC Class: 68T05; 68T10 ACM Class: I.5

  19. Walker-Independent Features for Gait Recognition from Motion Capture Data

    Authors: Michal Balazia, Petr Sojka

    Abstract: MoCap-based human identification, as a pattern recognition discipline, can be optimized using a machine learning approach. Yet in some applications such as video surveillance new identities can appear on the fly and labeled data for all encountered people may not always be available. This work introduces the concept of learning walker-independent gait features directly from raw joint coordinates b… ▽ More

    Submitted 7 December, 2022; v1 submitted 22 September, 2016; originally announced September 2016.

    Comments: Preprint. Full paper published at the Joint IAPR International Workshops on Structural and Syntactic Pattern Recognition and Statistical Techniques in Pattern Recognition (S+SSPR), Merida, Mexico, November 2016. 11 pages. arXiv admin note: substantial text overlap with arXiv:1609.04392

    MSC Class: 68T05; 68T10 ACM Class: I.5

  20. Learning Robust Features for Gait Recognition by Maximum Margin Criterion

    Authors: Michal Balazia, Petr Sojka

    Abstract: In the field of gait recognition from motion capture data, designing human-interpretable gait features is a common practice of many fellow researchers. To refrain from ad-hoc schemes and to find maximally discriminative features we may need to explore beyond the limits of human interpretability. This paper contributes to the state-of-the-art with a machine learning approach for extracting robust g… ▽ More

    Submitted 7 December, 2022; v1 submitted 14 September, 2016; originally announced September 2016.

    Comments: Preprint. Full paper published at the 23rd IEEE/IAPR International Conference on Pattern Recognition (ICPR), Cancun, Mexico, December 2016. 6 pages

    MSC Class: 68T05; 68T10 ACM Class: I.5

  21. Combining Text and Formula Queries in Math Information Retrieval: Evaluation of Query Results Merging Strategies

    Authors: Martin Líška, Petr Sojka, Michal Růžička

    Abstract: Specific to Math Information Retrieval is combining text with mathematical formulae both in documents and in queries. Rigorous evaluation of query expansion and merging strategies combining math and standard textual keyword terms in a query are given. It is shown that techniques similar to those known from textual query processing may be applied in math information retrieval as well, and lead to a… ▽ More

    Submitted 8 August, 2015; originally announced August 2015.

    ACM Class: H.3.3; I.7

  22. arXiv:1404.6476  [pdf, other

    cs.DL

    Math Indexer and Searcher Web Interface: Towards Fulfillment of Mathematicians' Information Needs

    Authors: Martin Líška, Petr Sojka, Michal Růžička

    Abstract: We are designing and develo** a web user interface for digital mathematics libraries called WebMIaS. It allows queries to be expressed by mathematicians through a faceted search interface. Users can combine standard textual autocompleted keywords with keywords in the form of mathematical formulae in LaTeX or MathML formats. Formulae are shown rendered by the web browser on-the-fly for users' fee… ▽ More

    Submitted 19 May, 2014; v1 submitted 25 April, 2014; originally announced April 2014.

    Comments: Preprint of CICM 2014 (http://cicm-conference.org/2014/) paper: S.M. Watt et al. (Eds.): CICM 2014, LNAI 8543, pp. 444-448, Springer International Publishing Switzerland 2014

    ACM Class: H.3.7; H.3.6; H.5.3