Search | arXiv e-print repository

arXiv:1911.00466 [pdf, other]

VR with Older Adults: Participatory Design of a Virtual ATM Training Simulation

Authors: Wiesław Kopeć, Marcin Wichrowski, Krzysztof Kalinowski, Anna Jaskulska, Kinga Skorupska, Daniel Cnotkowski, Jakub Tyszka, Agata Popieluch, Anna Voitenkova, Rafał Masłyk, Piotr Gago, Maciej Krzywicki, Monika Kornacka, Cezary Biele, Paweł Kobyliński, Jarosław Kowalski, Katarzyna Abramczuk, Aldona Zdrodowska, Grzegorz Pochwatko, Jakub Możaryn, Krzysztof Marasek

Abstract: In this paper we report on a study conducted with a group of older adults in which they engaged in participatory design workshops to create a VR ATM training simulation. Based on observation, recordings and the developed VR application we present the results of the workshops and offer considerations and recommendations for organizing opportunities for end users, in this case older adults, to direc… ▽ More In this paper we report on a study conducted with a group of older adults in which they engaged in participatory design workshops to create a VR ATM training simulation. Based on observation, recordings and the developed VR application we present the results of the workshops and offer considerations and recommendations for organizing opportunities for end users, in this case older adults, to directly engage in co-creation of cutting-edge ICT solutions. These include co-designing interfaces and interaction schemes for emerging technologies like VR and AR. We discuss such aspects as user engagement and hardware and software tools suitable for participatory prototy** of VR applications. Finally, we present ideas for further research in the area of VR participatory prototy** with users of various proficiency levels, taking steps towards develo** a unified framework for co-design in AR and VR. △ Less

Submitted 1 November, 2019; originally announced November 2019.

ACM Class: H.5; D.2.m; H.5.2

arXiv:1903.07195 [pdf, other]

doi 10.1145/3290607.3312973

Older Adults and Voice Interaction: A Pilot Study with Google Home

Authors: Jarosław Kowalski, Anna Jaskulska, Kinga Skorupska, Katarzyna Abramczuk, Cezary Biele, Wiesław Kopeć, Krzysztof Marasek

Abstract: In this paper we present the results of an exploratory study examining the potential of voice assistants (VA) for some groups of older adults in the context of Smart Home Technology (SHT). To research the aspect of older adults' interaction with voice user interfaces (VUI) we organized two workshops and gathered insights concerning possible benefits and barriers to the use of VA combined with SHT… ▽ More In this paper we present the results of an exploratory study examining the potential of voice assistants (VA) for some groups of older adults in the context of Smart Home Technology (SHT). To research the aspect of older adults' interaction with voice user interfaces (VUI) we organized two workshops and gathered insights concerning possible benefits and barriers to the use of VA combined with SHT by older adults. Apart from evaluating the participants' interaction with the devices during the two workshops we also discuss some improvements to the VA interaction paradigm. △ Less

Submitted 17 March, 2019; originally announced March 2019.

arXiv:1811.02213 [pdf, ps, other]

Hybrid Approach to Automation, RPA and Machine Learning: a Method for the Human-centered Design of Software Robots

Authors: Wiesław Kopeć, Marcin Skibiński, Cezary Biele, Kinga Skorupska, Dominika Tkaczyk, Anna Jaskulska, Katarzyna Abramczuk, Piotr Gago, Krzysztof Marasek

Abstract: One of the more prominent trends within Industry 4.0 is the drive to employ Robotic Process Automation (RPA), especially as one of the elements of the Lean approach. The full implementation of RPA is riddled with challenges relating both to the reality of everyday business operations, from SMEs to SSCs and beyond, and the social effects of the changing job market. To successfully address these poi… ▽ More One of the more prominent trends within Industry 4.0 is the drive to employ Robotic Process Automation (RPA), especially as one of the elements of the Lean approach. The full implementation of RPA is riddled with challenges relating both to the reality of everyday business operations, from SMEs to SSCs and beyond, and the social effects of the changing job market. To successfully address these points there is a need to develop a solution that would adjust to the existing business operations and at the same time lower the negative social impact of the automation process. To achieve these goals we propose a hybrid, human-centered approach to the development of software robots. This design and implementation method combines the Living Lab approach with empowerment through participatory design to kick-start the co-development and co-maintenance of hybrid software robots which, supported by variety of AI methods and tools, including interactive and collaborative ML in the cloud, transform menial job posts into higher-skilled positions, allowing former employees to stay on as robot co-designers and maintainers, i.e. as co-programmers who supervise the machine learning processes with the use of tailored high-level RPA Domain Specific Languages (DSLs) to adjust the functioning of the robots and maintain operational flexibility. △ Less

Submitted 6 November, 2018; originally announced November 2018.

ACM Class: K.4.3; K.6.3; H.5.2; D.2.11

arXiv:1804.02119 [pdf, other]

Impact of ultrasound image reconstruction method on breast lesion classification with neural transfer learning

Authors: Michal Byra, Tomasz Sznajder, Danijel Korzinek, Hanna Piotrzkowska-Wroblewska, Katarzyna Dobruch-Sobczak, Andrzej Nowicki, Krzysztof Marasek

Abstract: Deep learning algorithms, especially convolutional neural networks, have become a methodology of choice in medical image analysis. However, recent studies in computer vision show that even a small modification of input image intensities may cause a deep learning model to classify the image differently. In medical imaging, the distribution of image intensities is related to applied image reconstruc… ▽ More Deep learning algorithms, especially convolutional neural networks, have become a methodology of choice in medical image analysis. However, recent studies in computer vision show that even a small modification of input image intensities may cause a deep learning model to classify the image differently. In medical imaging, the distribution of image intensities is related to applied image reconstruction algorithm. In this paper we investigate the impact of ultrasound image reconstruction method on breast lesion classification with neural transfer learning. Due to high dynamic range raw ultrasonic signals are commonly compressed in order to reconstruct B-mode images. Based on raw data acquired from breast lesions, we reconstruct B-mode images using different compression levels. Next, transfer learning is applied for classification. Differently reconstructed images are employed for training and evaluation. We show that the modification of the reconstruction algorithm leads to decrease of classification performance. As a remedy, we propose a method of data augmentation. We show that the augmentation of the training set with differently reconstructed B-mode images leads to a more robust and efficient classification. Our study suggests that it is important to take into account image reconstruction algorithms implemented in medical scanners during development of computer aided diagnosis systems. △ Less

Submitted 5 April, 2018; originally announced April 2018.

Comments: 6 pages, 5 figures, 3 tables

arXiv:1801.03002 [pdf, other]

DeepStyle: Multimodal Search Engine for Fashion and Interior Design

Authors: Ivona Tautkute, Tomasz Trzcinski, Aleksander Skorupa, Lukasz Brocki, Krzysztof Marasek

Abstract: In this paper, we propose a multimodal search engine that combines visual and textual cues to retrieve items from a multimedia database aesthetically similar to the query. The goal of our engine is to enable intuitive retrieval of fashion merchandise such as clothes or furniture. Existing search engines treat textual input only as an additional source of information about the query image and do no… ▽ More In this paper, we propose a multimodal search engine that combines visual and textual cues to retrieve items from a multimedia database aesthetically similar to the query. The goal of our engine is to enable intuitive retrieval of fashion merchandise such as clothes or furniture. Existing search engines treat textual input only as an additional source of information about the query image and do not correspond to the real-life scenario where the user looks for 'the same shirt but of denim'. Our novel method, dubbed DeepStyle, mitigates those shortcomings by using a joint neural network architecture to model contextual dependencies between features of different modalities. We prove the robustness of this approach on two different challenging datasets of fashion items and furniture where our DeepStyle engine outperforms baseline methods by 18-21% on the tested datasets. Our search engine is commercially deployed and available through a Web-based application. △ Less

Submitted 20 February, 2019; v1 submitted 8 January, 2018; originally announced January 2018.

Comments: Copyright held by IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Journal ref: IEEE Access 2019

arXiv:1707.06907 [pdf, other]

doi 10.15439/2017F56

What Looks Good with my Sofa: Multimodal Search Engine for Interior Design

Authors: Ivona Tautkute, Aleksandra Możejko, Wojciech Stokowiec, Tomasz Trzciński, Łukasz Brocki, Krzysztof Marasek

Abstract: In this paper, we propose a multi-modal search engine for interior design that combines visual and textual queries. The goal of our engine is to retrieve interior objects, e.g. furniture or wall clocks, that share visual and aesthetic similarities with the query. Our search engine allows the user to take a photo of a room and retrieve with a high recall a list of items identical or visually simila… ▽ More In this paper, we propose a multi-modal search engine for interior design that combines visual and textual queries. The goal of our engine is to retrieve interior objects, e.g. furniture or wall clocks, that share visual and aesthetic similarities with the query. Our search engine allows the user to take a photo of a room and retrieve with a high recall a list of items identical or visually similar to those present in the photo. Additionally, it allows to return other items that aesthetically and stylistically fit well together. To achieve this goal, our system blends the results obtained using textual and visual modalities. Thanks to this blending strategy, we increase the average style similarity score of the retrieved items by 11%. Our work is implemented as a Web-based application and it is planned to be opened to the public. △ Less

Submitted 8 January, 2018; v1 submitted 21 July, 2017; originally announced July 2017.

Comments: FEDCSIS 5th Conference on Multimedia, Interaction, Design and Innovation (MIDI), 2017

Journal ref: Proceedings of the 2017 Federated Conference on Computer Science and Information Systems

arXiv:1707.06806 [pdf, other]

doi 10.1007/978-3-319-60438-1_14

Shallow reading with Deep Learning: Predicting popularity of online content using only its title

Authors: Wociech Stokowiec, Tomasz Trzcinski, Krzysztof Wolk, Krzysztof Marasek, Przemyslaw Rokita

Abstract: With the ever decreasing attention span of contemporary Internet users, the title of online content (such as a news article or video) can be a major factor in determining its popularity. To take advantage of this phenomenon, we propose a new method based on a bidirectional Long Short-Term Memory (LSTM) neural network designed to predict the popularity of online content using only its title. We eva… ▽ More With the ever decreasing attention span of contemporary Internet users, the title of online content (such as a news article or video) can be a major factor in determining its popularity. To take advantage of this phenomenon, we propose a new method based on a bidirectional Long Short-Term Memory (LSTM) neural network designed to predict the popularity of online content using only its title. We evaluate the proposed architecture on two distinct datasets of news articles and news videos distributed in social media that contain over 40,000 samples in total. On those datasets, our approach improves the performance over traditional shallow approaches by a margin of 15%. Additionally, we show that using pre-trained word vectors in the embedding layer improves the results of LSTM models, especially when the training set is small. To our knowledge, this is the first attempt of applying popularity prediction using only textual information from the title. △ Less

Submitted 21 July, 2017; originally announced July 2017.

arXiv:1706.00245 [pdf, other]

Polish Read Speech Corpus for Speech Tools and Services

Authors: Danijel Koržinek, Krzysztof Marasek, Łukasz Brocki, Krzysztof Wołk

Abstract: This paper describes the speech processing activities conducted at the Polish consortium of the CLARIN project. The purpose of this segment of the project was to develop specific tools that would allow for automatic and semi-automatic processing of large quantities of acoustic speech data. The tools include the following: grapheme-to-phoneme conversion, speech-to-text alignment, voice activity det… ▽ More This paper describes the speech processing activities conducted at the Polish consortium of the CLARIN project. The purpose of this segment of the project was to develop specific tools that would allow for automatic and semi-automatic processing of large quantities of acoustic speech data. The tools include the following: grapheme-to-phoneme conversion, speech-to-text alignment, voice activity detection, speaker diarization, keyword spotting and automatic speech transcription. Furthermore, in order to develop these tools, a large high-quality studio speech corpus was recorded and released under an open license, to encourage development in the area of Polish speech research. Another purpose of the corpus was to serve as a reference for studies in phonetics and pronunciation. All the tools and resources were released on the the Polish CLARIN website. This paper discusses the current status and future plans for the project. △ Less

Submitted 1 June, 2017; originally announced June 2017.

arXiv:1603.06785 [pdf]

Multi-domain machine translation enhancements by parallel data extraction from comparable corpora

Authors: Krzysztof Wołk, Emilia Rejmund, Krzysztof Marasek

Abstract: Parallel texts are a relatively rare language resource, however, they constitute a very useful research material with a wide range of applications. This study presents and analyses new methodologies we developed for obtaining such data from previously built comparable corpora. The methodologies are automatic and unsupervised which makes them good for large scale research. The task is highly practi… ▽ More Parallel texts are a relatively rare language resource, however, they constitute a very useful research material with a wide range of applications. This study presents and analyses new methodologies we developed for obtaining such data from previously built comparable corpora. The methodologies are automatic and unsupervised which makes them good for large scale research. The task is highly practical as non-parallel multilingual data occur much more frequently than parallel corpora and accessing them is easy, although parallel sentences are a considerably more useful resource. In this study, we propose a method of automatic web crawling in order to build topic-aligned comparable corpora, e.g. based on the Wikipedia or Euronews.com. We also developed new methods of obtaining parallel sentences from comparable data and proposed methods of filtration of corpora capable of selecting inconsistent or only partially equivalent translations. Our methods are easily scalable to other languages. Evaluation of the quality of the created corpora was performed by analysing the impact of their use on statistical machine translation systems. Experiments were presented on the basis of the Polish-English language pair for texts from different domains, i.e. lectures, phrasebooks, film dialogues, European Parliament proceedings and texts contained medicines leaflets. We also tested a second method of creating parallel corpora based on data from comparable corpora which allows for automatically expanding the existing corpus of sentences about a given domain on the basis of analogies found between them. It does not require, therefore, having past parallel resources in order to train a classifier. △ Less

Submitted 22 March, 2016; originally announced March 2016.

Comments: parallel corpus, Polish, English, machine learning, comparable corpora, NLP. in Gruszczyńska, Ewa; Leńko-Szymańska, Agnieszka, red. (2016). Polskojęzyczne korpusy równoległe. Polish-language Parallel Corpora. Warszawa: Instytut Lingwistyki Stosowanej. ISBN: 978-83-935320-4

arXiv:1512.01641 [pdf]

Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents

Authors: Krzysztof Wołk, Krzysztof Marasek

Abstract: The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on t… ▽ More The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on the quantity and quality of training data. Such systems have a very limited availability especially for some languages and very narrow text domains. In this research we present our improvements to current comparable corpora mining methodologies by re- implementation of the comparison algorithms (using Needleman-Wunch algorithm), introduction of a tuning script and computation time improvement by GPU acceleration. Experiments are carried out on bilingual data extracted from the Wikipedia, on various domains. For the Wikipedia itself, additional cross-lingual comparison heuristics were introduced. The modifications made a positive impact on the quality and quantity of mined data and on the translation quality. △ Less

Submitted 5 December, 2015; originally announced December 2015.

Comments: arXiv admin note: text overlap with arXiv:1509.08639

Journal ref: Proceedings of the 12th IWSLT, Da Nang, Vietnam, December 3-4, 2015, p.118-125

arXiv:1512.01639 [pdf]

PJAIT Systems for the IWSLT 2015 Evaluation Campaign Enhanced by Comparable Corpora

Authors: Krzysztof Wołk, Krzysztof Marasek

Abstract: In this paper, we attempt to improve Statistical Machine Translation (SMT) systems on a very diverse set of language pairs (in both directions): Czech - English, Vietnamese - English, French - English and German - English. To accomplish this, we performed translation model training, created adaptations of training settings for each language pair, and obtained comparable corpora for our SMT systems… ▽ More In this paper, we attempt to improve Statistical Machine Translation (SMT) systems on a very diverse set of language pairs (in both directions): Czech - English, Vietnamese - English, French - English and German - English. To accomplish this, we performed translation model training, created adaptations of training settings for each language pair, and obtained comparable corpora for our SMT systems. Innovative tools and data adaptation techniques were employed. The TED parallel text corpora for the IWSLT 2015 evaluation campaign were used to train language models, and to develop, tune, and test the system. In addition, we prepared Wikipedia-based comparable corpora for use with our SMT system. This data was specified as permissible for the IWSLT 2015 evaluation. We explored the use of domain adaptation techniques, symmetrized word alignment models, the unsupervised transliteration models and the KenLM language modeling tool. To evaluate the effects of different preparations on translation results, we conducted experiments and used the BLEU, NIST and TER metrics. Our results indicate that our approach produced a positive impact on SMT quality. △ Less

Submitted 5 December, 2015; originally announced December 2015.

Journal ref: Proceedings of the 12th International Workshop on Spoken Language Translation, Da Nang, Vietnam, December 3-4, 2015, p.101-104

arXiv:1511.09392 [pdf]

Enhancements in statistical spoken language translation by de-normalization of ASR results

Authors: Agnieszka Wołk, Krzysztof Wołk, Krzysztof Marasek

Abstract: Spoken language translation (SLT) has become very important in an increasingly globalized world. Machine translation (MT) for automatic speech recognition (ASR) systems is a major challenge of great interest. This research investigates that automatic sentence segmentation of speech that is important for enriching speech recognition output and for aiding downstream language processing. This article… ▽ More Spoken language translation (SLT) has become very important in an increasingly globalized world. Machine translation (MT) for automatic speech recognition (ASR) systems is a major challenge of great interest. This research investigates that automatic sentence segmentation of speech that is important for enriching speech recognition output and for aiding downstream language processing. This article focuses on the automatic sentence segmentation of speech and improving MT results. We explore the problem of identifying sentence boundaries in the transcriptions produced by automatic speech recognition systems in the Polish language. We also experiment with reverse normalization of the recognized speech samples. △ Less

Submitted 18 November, 2015; originally announced November 2015.

Comments: International Academy Publishing. arXiv admin note: text overlap with arXiv:1510.04500

Journal ref: Journal of Computers, 2016 VOL 11, ISSN: 1796-203X, p. 33-40, 2016

arXiv:1511.07788 [pdf]

Spoken Language Translation for Polish

Authors: Krzysztof Marasek, Łukasz Brocki, Danijel Korzinek, Krzysztof Wołk, Ryszard Gubrynowicz

Abstract: Spoken language translation (SLT) is becoming more important in the increasingly globalized world, both from a social and economic point of view. It is one of the major challenges for automatic speech recognition (ASR) and machine translation (MT), driving intense research activities in these areas. While past research in SLT, due to technology limitations, dealt mostly with speech recorded under… ▽ More Spoken language translation (SLT) is becoming more important in the increasingly globalized world, both from a social and economic point of view. It is one of the major challenges for automatic speech recognition (ASR) and machine translation (MT), driving intense research activities in these areas. While past research in SLT, due to technology limitations, dealt mostly with speech recorded under controlled conditions, today's major challenge is the translation of spoken language as it can be found in real life. Considered application scenarios range from portable translators for tourists, lectures and presentations translation, to broadcast news and shows with live captioning. We would like to present PJIIT's experiences in the SLT gained from the Eu-Bridge 7th framework project and the U-Star consortium activities for the Polish/English language pair. Presented research concentrates on ASR adaptation for Polish (state-of-the-art acoustic models: DBN-BLSTM training, Kaldi: LDA+MLLT+SAT+MMI), language modeling for ASR & MT (text normalization, RNN-based LMs, n-gram model domain interpolation) and statistical translation techniques (hierarchical models, factored translation models, automatic casing and punctuation, comparable and bilingual corpora preparation). While results for the well-defined domains (phrases for travelers, parliament speeches, medical documentation, movie subtitling) are very encouraging, less defined domains (presentation, lectures) still form a challenge. Our progress in the IWSLT TED task (MT only) will be presented, as well as current progress in the Polish ASR. △ Less

Submitted 24 November, 2015; originally announced November 2015.

Comments: Marasek K., Wołk K., Korzinek D., Brocki Ł., Spoken Language Translation for Polish, Proceedings of Forum Acuscticum 2014, Kraków. arXiv admin note: substantial text overlap with arXiv:1509.08909

arXiv:1511.06285 [pdf]

doi 10.1007/978-3-319-25252-0_46

Harvesting comparable corpora and mining them for equivalent bilingual sentences using statistical classification and analogy- based heuristics

Authors: Krzysztof Wołk, Emilia Rejmund, Krzysztof Marasek

Abstract: Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, bu… ▽ More Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from e.g. Wikipedia dumps and Euronews web page. The improvements in machine translation are shown on Polish-English language pair for various text domains. We also tested another method of building parallel corpora based on comparable corpora data. It lets automatically broad existing corpus of sentences from subject of corpora based on analogies between them. △ Less

Submitted 18 November, 2015; originally announced November 2015.

Comments: Springer p. 433-441, 2015

arXiv:1510.04600 [pdf]

doi 10.1016/j.compmedimag.2015.09.005

Telemedicine as a special case of Machine Translation

Authors: Krzysztof Wołk, Krzysztof Marasek, Wojciech Glinkowski

Abstract: Machine translation is evolving quite rapidly in terms of quality. Nowadays, we have several machine translation systems available in the web, which provide reasonable translations. However, these systems are not perfect, and their quality may decrease in some specific domains. This paper examines the effects of different training methods when it comes to Polish - English Statistical Machine Trans… ▽ More Machine translation is evolving quite rapidly in terms of quality. Nowadays, we have several machine translation systems available in the web, which provide reasonable translations. However, these systems are not perfect, and their quality may decrease in some specific domains. This paper examines the effects of different training methods when it comes to Polish - English Statistical Machine Translation system used for the medical data. Numerous elements of the EMEA parallel text corpora and not related OPUS Open Subtitles project were used as the ground for creation of phrase tables and different language models including the development, tuning and testing of these translation systems. The BLEU, NIST, METEOR, and TER metrics have been used in order to evaluate the results of various systems. Our experiments deal with the systems that include POS tagging, factored phrase models, hierarchical models, syntactic taggers, and other alignment methods. We also executed a deep analysis of Polish data as preparatory work before automatized data processing such as true casing or punctuation normalization phase. Normalized metrics was used to compare results. Scores lower than 15% mean that Machine Translation engine is unable to provide satisfying quality, scores greater than 30% mean that translations should be understandable without problems and scores over 50 reflect adequate translations. The average results of Polish to English translations scores for BLEU, NIST, METEOR, and TER were relatively high and ranged from 70,58 to 82,72. The lowest score was 64,38. The average results ranges for English to Polish translations were little lower (67,58 - 78,97). The real-life implementations of presented high quality Machine Translation Systems are anticipated in general medical practice and telemedicine. △ Less

Submitted 15 October, 2015; originally announced October 2015.

arXiv:1509.09097 [pdf]

Polish - English Speech Statistical Machine Translation Systems for the IWSLT 2013

Authors: Krzysztof Wołk, Krzysztof Marasek

Abstract: This research explores the effects of various training settings from Polish to English Statistical Machine Translation system for spoken language. Various elements of the TED parallel text corpora for the IWSLT 2013 evaluation campaign were used as the basis for training of language models, and for development, tuning and testing of the translation system. The BLEU, NIST, METEOR and TER metrics we… ▽ More This research explores the effects of various training settings from Polish to English Statistical Machine Translation system for spoken language. Various elements of the TED parallel text corpora for the IWSLT 2013 evaluation campaign were used as the basis for training of language models, and for development, tuning and testing of the translation system. The BLEU, NIST, METEOR and TER metrics were used to evaluate the effects of data preparations on translation results. Our experiments included systems, which use stems and morphological information on Polish words. We also conducted a deep analysis of provided Polish data as preparatory work for the automatic data correction and cleaning phase. △ Less

Submitted 30 September, 2015; originally announced September 2015.

Comments: statistical machine translation. arXiv admin note: substantial text overlap with arXiv:1509.08874, arXiv:1509.08909

Journal ref: Proceedings of the 10th International Workshop on Spoken Language Translation, Heidelberg, Germany, p. 113-119, 2013

arXiv:1509.09093 [pdf]

A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation

Authors: Krzysztof Wołk, Krzysztof Marasek

Abstract: Text alignment is crucial to the accuracy of Machine Translation (MT) systems, some NLP tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED Talks corpus, but can be used for any text domai… ▽ More Text alignment is crucial to the accuracy of Machine Translation (MT) systems, some NLP tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED Talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with described tool is shown. △ Less

Submitted 30 September, 2015; originally announced September 2015.

Comments: corpora filtration, text alignement, corpora improvement. arXiv admin note: text overlap with arXiv:1509.08881

Journal ref: Advances in Intelligent Systems and Computing volume 275, p.107-114, Publisher: Springer, ISSN 2194-5357, ISBN 978-3-319-05950-1, 2014

arXiv:1509.09090 [pdf]

Real-Time Statistical Speech Translation

Authors: Krzysztof Wołk, Krzysztof Marasek

Abstract: This research investigates the Statistical Machine Translation approaches to translate speech in real time automatically. Such systems can be used in a pipeline with speech recognition and synthesis software in order to produce a real-time voice communication system between foreigners. We obtained three main data sets from spoken proceedings that represent three different types of human speech. TE… ▽ More This research investigates the Statistical Machine Translation approaches to translate speech in real time automatically. Such systems can be used in a pipeline with speech recognition and synthesis software in order to produce a real-time voice communication system between foreigners. We obtained three main data sets from spoken proceedings that represent three different types of human speech. TED, Europarl, and OPUS parallel text corpora were used as the basis for training of language models, for developmental tuning and testing of the translation system. We also conducted experiments involving part of speech tagging, compound splitting, linear language model interpolation, TrueCasing and morphosyntactic analysis. We evaluated the effects of variety of data preparations on the translation results using the BLEU, NIST, METEOR and TER metrics and tried to give answer which metric is most suitable for PL-EN language pair. △ Less

Submitted 30 September, 2015; originally announced September 2015.

Comments: machine translation, polish english

Journal ref: Advances in Intelligent Systems and Computing volume 275, p.107-114, Publisher: Springer, ISSN 2194-5357, ISBN 978-3-319-05950-1, 2014

arXiv:1509.09088 [pdf]

Enhanced Bilingual Evaluation Understudy

Authors: Krzysztof Wołk, Krzysztof Marasek

Abstract: Our research extends the Bilingual Evaluation Understudy (BLEU) evaluation technique for statistical machine translation to make it more adjustable and robust. We intend to adapt it to resemble human evaluation more. We perform experiments to evaluate the performance of our technique against the primary existing evaluation methods. We describe and show the improvements it makes over existing metho… ▽ More Our research extends the Bilingual Evaluation Understudy (BLEU) evaluation technique for statistical machine translation to make it more adjustable and robust. We intend to adapt it to resemble human evaluation more. We perform experiments to evaluate the performance of our technique against the primary existing evaluation methods. We describe and show the improvements it makes over existing methods as well as correlation to them. When human translators translate a text, they often use synonyms, different word orders or style, and other similar variations. We propose an SMT evaluation technique that enhances the BLEU metric to consider variations such as those. △ Less

Submitted 30 September, 2015; originally announced September 2015.

Comments: machine translation evaluation, enchanced bleu. in Lecture Notes on Information Theory, ISSN: 2301-3788, 2014

arXiv:1509.08909 [pdf]

Polish -English Statistical Machine Translation of Medical Texts

Authors: Krzysztof Wołk, Krzysztof Marasek

Abstract: This new research explores the effects of various training methods on a Polish to English Statistical Machine Translation system for medical texts. Various elements of the EMEA parallel text corpora from the OPUS project were used as the basis for training of phrase tables and language models and for development, tuning and testing of the translation system. The BLEU, NIST, METEOR, RIBES and TER m… ▽ More This new research explores the effects of various training methods on a Polish to English Statistical Machine Translation system for medical texts. Various elements of the EMEA parallel text corpora from the OPUS project were used as the basis for training of phrase tables and language models and for development, tuning and testing of the translation system. The BLEU, NIST, METEOR, RIBES and TER metrics have been used to evaluate the effects of various system and data preparations on translation results. Our experiments included systems that used POS tagging, factored phrase models, hierarchical models, syntactic taggers, and many different alignment methods. We also conducted a deep analysis of Polish data as preparatory work for automatic data correction such as true casing and punctuation normalization phase. △ Less

Submitted 29 September, 2015; originally announced September 2015.

Comments: New Research in Multimedia and Internet Systems, Springer. 09/2014, ISSN: 1867-5662. arXiv admin note: text overlap with arXiv:1509.08874

arXiv:1509.08881 [pdf]

doi 10.1016/j.protcy.2014.11.024

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Authors: Krzysztof Wołk, Krzysztof Marasek

Abstract: Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but para… ▽ More Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from Wikipedia articles. We also introduce a method for extracting truly parallel sentences that are filtered out from noisy or just comparable sentence pairs. We describe our implementation of a specialized tool for this task as well as training and adaption of a machine translation system that supplies our filter with additional information about the similarity of comparable sentence pairs. △ Less

Submitted 29 September, 2015; originally announced September 2015.

Journal ref: Procedia Technology, 18, Elsevier, p.126-132, 2014

arXiv:1509.08874 [pdf]

Polish - English Speech Statistical Machine Translation Systems for the IWSLT 2014

Authors: Krzysztof Wołk, Krzysztof Marasek

Abstract: This research explores effects of various training settings between Polish and English Statistical Machine Translation systems for spoken language. Various elements of the TED parallel text corpora for the IWSLT 2014 evaluation campaign were used as the basis for training of language models, and for development, tuning and testing of the translation system as well as Wikipedia based comparable cor… ▽ More This research explores effects of various training settings between Polish and English Statistical Machine Translation systems for spoken language. Various elements of the TED parallel text corpora for the IWSLT 2014 evaluation campaign were used as the basis for training of language models, and for development, tuning and testing of the translation system as well as Wikipedia based comparable corpora prepared by us. The BLEU, NIST, METEOR and TER metrics were used to evaluate the effects of data preparations on translation results. Our experiments included systems, which use lemma and morphological information on Polish words. We also conducted a deep analysis of provided Polish data as preparatory work for the automatic data correction and cleaning phase. △ Less

Submitted 29 September, 2015; originally announced September 2015.

Comments: Machine Translation, West slavic, Proceedings of the 11th International Workshop on Spoken Language Translation, Tahoe Lake, USA, 2014. arXiv admin note: text overlap with arXiv:1409.0473 by other authors

arXiv:1509.08644 [pdf]

doi 10.1016/j.procs.2015.08.456

Neural-based machine translation for medical text domain. Based on European Medicines Agency leaflet texts

Authors: Krzysztof Wołk, Krzysztof Marasek

Abstract: The quality of machine translation is rapidly evolving. Today one can find several machine translation systems on the web that provide reasonable translations, although the systems are not perfect. In some specific domains, the quality may decrease. A recently proposed approach to this domain is neural machine translation. It aims at building a jointly-tuned single neural network that maximizes tr… ▽ More The quality of machine translation is rapidly evolving. Today one can find several machine translation systems on the web that provide reasonable translations, although the systems are not perfect. In some specific domains, the quality may decrease. A recently proposed approach to this domain is neural machine translation. It aims at building a jointly-tuned single neural network that maximizes translation performance, a very different approach from traditional statistical machine translation. Recently proposed neural machine translation models often belong to the encoder-decoder family in which a source sentence is encoded into a fixed length vector that is, in turn, decoded to generate a translation. The present research examines the effects of different training methods on a Polish-English Machine Translation system used for medical data. The European Medicines Agency parallel text corpus was used as the basis for training of neural and statistical network-based translation systems. The main machine translation evaluation metrics have also been used in analysis of the systems. A comparison and implementation of a real-time medical translator is the main focus of our experiments. △ Less

Submitted 29 September, 2015; originally announced September 2015.

Comments: machine translation, statistical machine translation, neural machine trasnlation, nlp, text processing, medical communication

Journal ref: Procedia Computer Science, 2015, 64: 2-9

arXiv:1509.08639 [pdf]

Tuned and GPU-accelerated parallel data mining from comparable corpora

Authors: Krzysztof Wołk, Krzysztof Marasek

Abstract: The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on t… ▽ More The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on the quantity and quality of training data. Such has a very limited availability especially for some languages and very narrow text domains. Is this research we present our improvements to Yalign mining methodology by reimplementing the comparison algorithm, introducing a tuning scripts and by improving performance using GPU computing acceleration. The experiments are conducted on various text domains and bi-data is extracted from the Wikipedia dumps. △ Less

Submitted 29 September, 2015; originally announced September 2015.

Comments: Machine translation, comparable corpora, Machine learning, NLP, Knowledge-free learning, Unsupervised bi-lingual data mining

Journal ref: Lecture Notes in Artificial Intelligence, p. 32-40, ISBN: 978-3-319-24032-9, Springer, 2015

Showing 1–24 of 24 results for author: Marasek, K