-
VR with Older Adults: Participatory Design of a Virtual ATM Training Simulation
Authors:
Wiesław Kopeć,
Marcin Wichrowski,
Krzysztof Kalinowski,
Anna Jaskulska,
Kinga Skorupska,
Daniel Cnotkowski,
Jakub Tyszka,
Agata Popieluch,
Anna Voitenkova,
Rafał Masłyk,
Piotr Gago,
Maciej Krzywicki,
Monika Kornacka,
Cezary Biele,
Paweł Kobyliński,
Jarosław Kowalski,
Katarzyna Abramczuk,
Aldona Zdrodowska,
Grzegorz Pochwatko,
Jakub Możaryn,
Krzysztof Marasek
Abstract:
In this paper we report on a study conducted with a group of older adults in which they engaged in participatory design workshops to create a VR ATM training simulation. Based on observation, recordings and the developed VR application we present the results of the workshops and offer considerations and recommendations for organizing opportunities for end users, in this case older adults, to direc…
▽ More
In this paper we report on a study conducted with a group of older adults in which they engaged in participatory design workshops to create a VR ATM training simulation. Based on observation, recordings and the developed VR application we present the results of the workshops and offer considerations and recommendations for organizing opportunities for end users, in this case older adults, to directly engage in co-creation of cutting-edge ICT solutions. These include co-designing interfaces and interaction schemes for emerging technologies like VR and AR. We discuss such aspects as user engagement and hardware and software tools suitable for participatory prototy** of VR applications. Finally, we present ideas for further research in the area of VR participatory prototy** with users of various proficiency levels, taking steps towards develo** a unified framework for co-design in AR and VR.
△ Less
Submitted 1 November, 2019;
originally announced November 2019.
-
Older Adults and Voice Interaction: A Pilot Study with Google Home
Authors:
Jarosław Kowalski,
Anna Jaskulska,
Kinga Skorupska,
Katarzyna Abramczuk,
Cezary Biele,
Wiesław Kopeć,
Krzysztof Marasek
Abstract:
In this paper we present the results of an exploratory study examining the potential of voice assistants (VA) for some groups of older adults in the context of Smart Home Technology (SHT). To research the aspect of older adults' interaction with voice user interfaces (VUI) we organized two workshops and gathered insights concerning possible benefits and barriers to the use of VA combined with SHT…
▽ More
In this paper we present the results of an exploratory study examining the potential of voice assistants (VA) for some groups of older adults in the context of Smart Home Technology (SHT). To research the aspect of older adults' interaction with voice user interfaces (VUI) we organized two workshops and gathered insights concerning possible benefits and barriers to the use of VA combined with SHT by older adults. Apart from evaluating the participants' interaction with the devices during the two workshops we also discuss some improvements to the VA interaction paradigm.
△ Less
Submitted 17 March, 2019;
originally announced March 2019.
-
Hybrid Approach to Automation, RPA and Machine Learning: a Method for the Human-centered Design of Software Robots
Authors:
Wiesław Kopeć,
Marcin Skibiński,
Cezary Biele,
Kinga Skorupska,
Dominika Tkaczyk,
Anna Jaskulska,
Katarzyna Abramczuk,
Piotr Gago,
Krzysztof Marasek
Abstract:
One of the more prominent trends within Industry 4.0 is the drive to employ Robotic Process Automation (RPA), especially as one of the elements of the Lean approach. The full implementation of RPA is riddled with challenges relating both to the reality of everyday business operations, from SMEs to SSCs and beyond, and the social effects of the changing job market. To successfully address these poi…
▽ More
One of the more prominent trends within Industry 4.0 is the drive to employ Robotic Process Automation (RPA), especially as one of the elements of the Lean approach. The full implementation of RPA is riddled with challenges relating both to the reality of everyday business operations, from SMEs to SSCs and beyond, and the social effects of the changing job market. To successfully address these points there is a need to develop a solution that would adjust to the existing business operations and at the same time lower the negative social impact of the automation process.
To achieve these goals we propose a hybrid, human-centered approach to the development of software robots. This design and implementation method combines the Living Lab approach with empowerment through participatory design to kick-start the co-development and co-maintenance of hybrid software robots which, supported by variety of AI methods and tools, including interactive and collaborative ML in the cloud, transform menial job posts into higher-skilled positions, allowing former employees to stay on as robot co-designers and maintainers, i.e. as co-programmers who supervise the machine learning processes with the use of tailored high-level RPA Domain Specific Languages (DSLs) to adjust the functioning of the robots and maintain operational flexibility.
△ Less
Submitted 6 November, 2018;
originally announced November 2018.
-
Impact of ultrasound image reconstruction method on breast lesion classification with neural transfer learning
Authors:
Michal Byra,
Tomasz Sznajder,
Danijel Korzinek,
Hanna Piotrzkowska-Wroblewska,
Katarzyna Dobruch-Sobczak,
Andrzej Nowicki,
Krzysztof Marasek
Abstract:
Deep learning algorithms, especially convolutional neural networks, have become a methodology of choice in medical image analysis. However, recent studies in computer vision show that even a small modification of input image intensities may cause a deep learning model to classify the image differently. In medical imaging, the distribution of image intensities is related to applied image reconstruc…
▽ More
Deep learning algorithms, especially convolutional neural networks, have become a methodology of choice in medical image analysis. However, recent studies in computer vision show that even a small modification of input image intensities may cause a deep learning model to classify the image differently. In medical imaging, the distribution of image intensities is related to applied image reconstruction algorithm. In this paper we investigate the impact of ultrasound image reconstruction method on breast lesion classification with neural transfer learning. Due to high dynamic range raw ultrasonic signals are commonly compressed in order to reconstruct B-mode images. Based on raw data acquired from breast lesions, we reconstruct B-mode images using different compression levels. Next, transfer learning is applied for classification. Differently reconstructed images are employed for training and evaluation. We show that the modification of the reconstruction algorithm leads to decrease of classification performance. As a remedy, we propose a method of data augmentation. We show that the augmentation of the training set with differently reconstructed B-mode images leads to a more robust and efficient classification. Our study suggests that it is important to take into account image reconstruction algorithms implemented in medical scanners during development of computer aided diagnosis systems.
△ Less
Submitted 5 April, 2018;
originally announced April 2018.
-
DeepStyle: Multimodal Search Engine for Fashion and Interior Design
Authors:
Ivona Tautkute,
Tomasz Trzcinski,
Aleksander Skorupa,
Lukasz Brocki,
Krzysztof Marasek
Abstract:
In this paper, we propose a multimodal search engine that combines visual and textual cues to retrieve items from a multimedia database aesthetically similar to the query. The goal of our engine is to enable intuitive retrieval of fashion merchandise such as clothes or furniture. Existing search engines treat textual input only as an additional source of information about the query image and do no…
▽ More
In this paper, we propose a multimodal search engine that combines visual and textual cues to retrieve items from a multimedia database aesthetically similar to the query. The goal of our engine is to enable intuitive retrieval of fashion merchandise such as clothes or furniture. Existing search engines treat textual input only as an additional source of information about the query image and do not correspond to the real-life scenario where the user looks for 'the same shirt but of denim'. Our novel method, dubbed DeepStyle, mitigates those shortcomings by using a joint neural network architecture to model contextual dependencies between features of different modalities. We prove the robustness of this approach on two different challenging datasets of fashion items and furniture where our DeepStyle engine outperforms baseline methods by 18-21% on the tested datasets. Our search engine is commercially deployed and available through a Web-based application.
△ Less
Submitted 20 February, 2019; v1 submitted 8 January, 2018;
originally announced January 2018.
-
What Looks Good with my Sofa: Multimodal Search Engine for Interior Design
Authors:
Ivona Tautkute,
Aleksandra Możejko,
Wojciech Stokowiec,
Tomasz Trzciński,
Łukasz Brocki,
Krzysztof Marasek
Abstract:
In this paper, we propose a multi-modal search engine for interior design that combines visual and textual queries. The goal of our engine is to retrieve interior objects, e.g. furniture or wall clocks, that share visual and aesthetic similarities with the query. Our search engine allows the user to take a photo of a room and retrieve with a high recall a list of items identical or visually simila…
▽ More
In this paper, we propose a multi-modal search engine for interior design that combines visual and textual queries. The goal of our engine is to retrieve interior objects, e.g. furniture or wall clocks, that share visual and aesthetic similarities with the query. Our search engine allows the user to take a photo of a room and retrieve with a high recall a list of items identical or visually similar to those present in the photo. Additionally, it allows to return other items that aesthetically and stylistically fit well together. To achieve this goal, our system blends the results obtained using textual and visual modalities. Thanks to this blending strategy, we increase the average style similarity score of the retrieved items by 11%. Our work is implemented as a Web-based application and it is planned to be opened to the public.
△ Less
Submitted 8 January, 2018; v1 submitted 21 July, 2017;
originally announced July 2017.
-
Shallow reading with Deep Learning: Predicting popularity of online content using only its title
Authors:
Wociech Stokowiec,
Tomasz Trzcinski,
Krzysztof Wolk,
Krzysztof Marasek,
Przemyslaw Rokita
Abstract:
With the ever decreasing attention span of contemporary Internet users, the title of online content (such as a news article or video) can be a major factor in determining its popularity. To take advantage of this phenomenon, we propose a new method based on a bidirectional Long Short-Term Memory (LSTM) neural network designed to predict the popularity of online content using only its title. We eva…
▽ More
With the ever decreasing attention span of contemporary Internet users, the title of online content (such as a news article or video) can be a major factor in determining its popularity. To take advantage of this phenomenon, we propose a new method based on a bidirectional Long Short-Term Memory (LSTM) neural network designed to predict the popularity of online content using only its title. We evaluate the proposed architecture on two distinct datasets of news articles and news videos distributed in social media that contain over 40,000 samples in total. On those datasets, our approach improves the performance over traditional shallow approaches by a margin of 15%. Additionally, we show that using pre-trained word vectors in the embedding layer improves the results of LSTM models, especially when the training set is small. To our knowledge, this is the first attempt of applying popularity prediction using only textual information from the title.
△ Less
Submitted 21 July, 2017;
originally announced July 2017.
-
Polish Read Speech Corpus for Speech Tools and Services
Authors:
Danijel Koržinek,
Krzysztof Marasek,
Łukasz Brocki,
Krzysztof Wołk
Abstract:
This paper describes the speech processing activities conducted at the Polish consortium of the CLARIN project. The purpose of this segment of the project was to develop specific tools that would allow for automatic and semi-automatic processing of large quantities of acoustic speech data. The tools include the following: grapheme-to-phoneme conversion, speech-to-text alignment, voice activity det…
▽ More
This paper describes the speech processing activities conducted at the Polish consortium of the CLARIN project. The purpose of this segment of the project was to develop specific tools that would allow for automatic and semi-automatic processing of large quantities of acoustic speech data. The tools include the following: grapheme-to-phoneme conversion, speech-to-text alignment, voice activity detection, speaker diarization, keyword spotting and automatic speech transcription. Furthermore, in order to develop these tools, a large high-quality studio speech corpus was recorded and released under an open license, to encourage development in the area of Polish speech research. Another purpose of the corpus was to serve as a reference for studies in phonetics and pronunciation. All the tools and resources were released on the the Polish CLARIN website. This paper discusses the current status and future plans for the project.
△ Less
Submitted 1 June, 2017;
originally announced June 2017.
-
Multi-domain machine translation enhancements by parallel data extraction from comparable corpora
Authors:
Krzysztof Wołk,
Emilia Rejmund,
Krzysztof Marasek
Abstract:
Parallel texts are a relatively rare language resource, however, they constitute a very useful research material with a wide range of applications. This study presents and analyses new methodologies we developed for obtaining such data from previously built comparable corpora. The methodologies are automatic and unsupervised which makes them good for large scale research. The task is highly practi…
▽ More
Parallel texts are a relatively rare language resource, however, they constitute a very useful research material with a wide range of applications. This study presents and analyses new methodologies we developed for obtaining such data from previously built comparable corpora. The methodologies are automatic and unsupervised which makes them good for large scale research. The task is highly practical as non-parallel multilingual data occur much more frequently than parallel corpora and accessing them is easy, although parallel sentences are a considerably more useful resource. In this study, we propose a method of automatic web crawling in order to build topic-aligned comparable corpora, e.g. based on the Wikipedia or Euronews.com. We also developed new methods of obtaining parallel sentences from comparable data and proposed methods of filtration of corpora capable of selecting inconsistent or only partially equivalent translations. Our methods are easily scalable to other languages. Evaluation of the quality of the created corpora was performed by analysing the impact of their use on statistical machine translation systems. Experiments were presented on the basis of the Polish-English language pair for texts from different domains, i.e. lectures, phrasebooks, film dialogues, European Parliament proceedings and texts contained medicines leaflets. We also tested a second method of creating parallel corpora based on data from comparable corpora which allows for automatically expanding the existing corpus of sentences about a given domain on the basis of analogies found between them. It does not require, therefore, having past parallel resources in order to train a classifier.
△ Less
Submitted 22 March, 2016;
originally announced March 2016.
-
Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents
Authors:
Krzysztof Wołk,
Krzysztof Marasek
Abstract:
The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on t…
▽ More
The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on the quantity and quality of training data. Such systems have a very limited availability especially for some languages and very narrow text domains. In this research we present our improvements to current comparable corpora mining methodologies by re- implementation of the comparison algorithms (using Needleman-Wunch algorithm), introduction of a tuning script and computation time improvement by GPU acceleration. Experiments are carried out on bilingual data extracted from the Wikipedia, on various domains. For the Wikipedia itself, additional cross-lingual comparison heuristics were introduced. The modifications made a positive impact on the quality and quantity of mined data and on the translation quality.
△ Less
Submitted 5 December, 2015;
originally announced December 2015.
-
PJAIT Systems for the IWSLT 2015 Evaluation Campaign Enhanced by Comparable Corpora
Authors:
Krzysztof Wołk,
Krzysztof Marasek
Abstract:
In this paper, we attempt to improve Statistical Machine Translation (SMT) systems on a very diverse set of language pairs (in both directions): Czech - English, Vietnamese - English, French - English and German - English. To accomplish this, we performed translation model training, created adaptations of training settings for each language pair, and obtained comparable corpora for our SMT systems…
▽ More
In this paper, we attempt to improve Statistical Machine Translation (SMT) systems on a very diverse set of language pairs (in both directions): Czech - English, Vietnamese - English, French - English and German - English. To accomplish this, we performed translation model training, created adaptations of training settings for each language pair, and obtained comparable corpora for our SMT systems. Innovative tools and data adaptation techniques were employed. The TED parallel text corpora for the IWSLT 2015 evaluation campaign were used to train language models, and to develop, tune, and test the system. In addition, we prepared Wikipedia-based comparable corpora for use with our SMT system. This data was specified as permissible for the IWSLT 2015 evaluation. We explored the use of domain adaptation techniques, symmetrized word alignment models, the unsupervised transliteration models and the KenLM language modeling tool. To evaluate the effects of different preparations on translation results, we conducted experiments and used the BLEU, NIST and TER metrics. Our results indicate that our approach produced a positive impact on SMT quality.
△ Less
Submitted 5 December, 2015;
originally announced December 2015.
-
Enhancements in statistical spoken language translation by de-normalization of ASR results
Authors:
Agnieszka Wołk,
Krzysztof Wołk,
Krzysztof Marasek
Abstract:
Spoken language translation (SLT) has become very important in an increasingly globalized world. Machine translation (MT) for automatic speech recognition (ASR) systems is a major challenge of great interest. This research investigates that automatic sentence segmentation of speech that is important for enriching speech recognition output and for aiding downstream language processing. This article…
▽ More
Spoken language translation (SLT) has become very important in an increasingly globalized world. Machine translation (MT) for automatic speech recognition (ASR) systems is a major challenge of great interest. This research investigates that automatic sentence segmentation of speech that is important for enriching speech recognition output and for aiding downstream language processing. This article focuses on the automatic sentence segmentation of speech and improving MT results. We explore the problem of identifying sentence boundaries in the transcriptions produced by automatic speech recognition systems in the Polish language. We also experiment with reverse normalization of the recognized speech samples.
△ Less
Submitted 18 November, 2015;
originally announced November 2015.
-
Spoken Language Translation for Polish
Authors:
Krzysztof Marasek,
Łukasz Brocki,
Danijel Korzinek,
Krzysztof Wołk,
Ryszard Gubrynowicz
Abstract:
Spoken language translation (SLT) is becoming more important in the increasingly globalized world, both from a social and economic point of view. It is one of the major challenges for automatic speech recognition (ASR) and machine translation (MT), driving intense research activities in these areas. While past research in SLT, due to technology limitations, dealt mostly with speech recorded under…
▽ More
Spoken language translation (SLT) is becoming more important in the increasingly globalized world, both from a social and economic point of view. It is one of the major challenges for automatic speech recognition (ASR) and machine translation (MT), driving intense research activities in these areas. While past research in SLT, due to technology limitations, dealt mostly with speech recorded under controlled conditions, today's major challenge is the translation of spoken language as it can be found in real life. Considered application scenarios range from portable translators for tourists, lectures and presentations translation, to broadcast news and shows with live captioning. We would like to present PJIIT's experiences in the SLT gained from the Eu-Bridge 7th framework project and the U-Star consortium activities for the Polish/English language pair. Presented research concentrates on ASR adaptation for Polish (state-of-the-art acoustic models: DBN-BLSTM training, Kaldi: LDA+MLLT+SAT+MMI), language modeling for ASR & MT (text normalization, RNN-based LMs, n-gram model domain interpolation) and statistical translation techniques (hierarchical models, factored translation models, automatic casing and punctuation, comparable and bilingual corpora preparation). While results for the well-defined domains (phrases for travelers, parliament speeches, medical documentation, movie subtitling) are very encouraging, less defined domains (presentation, lectures) still form a challenge. Our progress in the IWSLT TED task (MT only) will be presented, as well as current progress in the Polish ASR.
△ Less
Submitted 24 November, 2015;
originally announced November 2015.
-
Harvesting comparable corpora and mining them for equivalent bilingual sentences using statistical classification and analogy- based heuristics
Authors:
Krzysztof Wołk,
Emilia Rejmund,
Krzysztof Marasek
Abstract:
Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, bu…
▽ More
Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from e.g. Wikipedia dumps and Euronews web page. The improvements in machine translation are shown on Polish-English language pair for various text domains. We also tested another method of building parallel corpora based on comparable corpora data. It lets automatically broad existing corpus of sentences from subject of corpora based on analogies between them.
△ Less
Submitted 18 November, 2015;
originally announced November 2015.
-
Telemedicine as a special case of Machine Translation
Authors:
Krzysztof Wołk,
Krzysztof Marasek,
Wojciech Glinkowski
Abstract:
Machine translation is evolving quite rapidly in terms of quality. Nowadays, we have several machine translation systems available in the web, which provide reasonable translations. However, these systems are not perfect, and their quality may decrease in some specific domains. This paper examines the effects of different training methods when it comes to Polish - English Statistical Machine Trans…
▽ More
Machine translation is evolving quite rapidly in terms of quality. Nowadays, we have several machine translation systems available in the web, which provide reasonable translations. However, these systems are not perfect, and their quality may decrease in some specific domains. This paper examines the effects of different training methods when it comes to Polish - English Statistical Machine Translation system used for the medical data. Numerous elements of the EMEA parallel text corpora and not related OPUS Open Subtitles project were used as the ground for creation of phrase tables and different language models including the development, tuning and testing of these translation systems. The BLEU, NIST, METEOR, and TER metrics have been used in order to evaluate the results of various systems. Our experiments deal with the systems that include POS tagging, factored phrase models, hierarchical models, syntactic taggers, and other alignment methods. We also executed a deep analysis of Polish data as preparatory work before automatized data processing such as true casing or punctuation normalization phase. Normalized metrics was used to compare results. Scores lower than 15% mean that Machine Translation engine is unable to provide satisfying quality, scores greater than 30% mean that translations should be understandable without problems and scores over 50 reflect adequate translations. The average results of Polish to English translations scores for BLEU, NIST, METEOR, and TER were relatively high and ranged from 70,58 to 82,72. The lowest score was 64,38. The average results ranges for English to Polish translations were little lower (67,58 - 78,97). The real-life implementations of presented high quality Machine Translation Systems are anticipated in general medical practice and telemedicine.
△ Less
Submitted 15 October, 2015;
originally announced October 2015.
-
Polish - English Speech Statistical Machine Translation Systems for the IWSLT 2013
Authors:
Krzysztof Wołk,
Krzysztof Marasek
Abstract:
This research explores the effects of various training settings from Polish to English Statistical Machine Translation system for spoken language. Various elements of the TED parallel text corpora for the IWSLT 2013 evaluation campaign were used as the basis for training of language models, and for development, tuning and testing of the translation system. The BLEU, NIST, METEOR and TER metrics we…
▽ More
This research explores the effects of various training settings from Polish to English Statistical Machine Translation system for spoken language. Various elements of the TED parallel text corpora for the IWSLT 2013 evaluation campaign were used as the basis for training of language models, and for development, tuning and testing of the translation system. The BLEU, NIST, METEOR and TER metrics were used to evaluate the effects of data preparations on translation results. Our experiments included systems, which use stems and morphological information on Polish words. We also conducted a deep analysis of provided Polish data as preparatory work for the automatic data correction and cleaning phase.
△ Less
Submitted 30 September, 2015;
originally announced September 2015.
-
A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation
Authors:
Krzysztof Wołk,
Krzysztof Marasek
Abstract:
Text alignment is crucial to the accuracy of Machine Translation (MT) systems, some NLP tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED Talks corpus, but can be used for any text domai…
▽ More
Text alignment is crucial to the accuracy of Machine Translation (MT) systems, some NLP tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED Talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with described tool is shown.
△ Less
Submitted 30 September, 2015;
originally announced September 2015.
-
Real-Time Statistical Speech Translation
Authors:
Krzysztof Wołk,
Krzysztof Marasek
Abstract:
This research investigates the Statistical Machine Translation approaches to translate speech in real time automatically. Such systems can be used in a pipeline with speech recognition and synthesis software in order to produce a real-time voice communication system between foreigners. We obtained three main data sets from spoken proceedings that represent three different types of human speech. TE…
▽ More
This research investigates the Statistical Machine Translation approaches to translate speech in real time automatically. Such systems can be used in a pipeline with speech recognition and synthesis software in order to produce a real-time voice communication system between foreigners. We obtained three main data sets from spoken proceedings that represent three different types of human speech. TED, Europarl, and OPUS parallel text corpora were used as the basis for training of language models, for developmental tuning and testing of the translation system. We also conducted experiments involving part of speech tagging, compound splitting, linear language model interpolation, TrueCasing and morphosyntactic analysis. We evaluated the effects of variety of data preparations on the translation results using the BLEU, NIST, METEOR and TER metrics and tried to give answer which metric is most suitable for PL-EN language pair.
△ Less
Submitted 30 September, 2015;
originally announced September 2015.
-
Enhanced Bilingual Evaluation Understudy
Authors:
Krzysztof Wołk,
Krzysztof Marasek
Abstract:
Our research extends the Bilingual Evaluation Understudy (BLEU) evaluation technique for statistical machine translation to make it more adjustable and robust. We intend to adapt it to resemble human evaluation more. We perform experiments to evaluate the performance of our technique against the primary existing evaluation methods. We describe and show the improvements it makes over existing metho…
▽ More
Our research extends the Bilingual Evaluation Understudy (BLEU) evaluation technique for statistical machine translation to make it more adjustable and robust. We intend to adapt it to resemble human evaluation more. We perform experiments to evaluate the performance of our technique against the primary existing evaluation methods. We describe and show the improvements it makes over existing methods as well as correlation to them. When human translators translate a text, they often use synonyms, different word orders or style, and other similar variations. We propose an SMT evaluation technique that enhances the BLEU metric to consider variations such as those.
△ Less
Submitted 30 September, 2015;
originally announced September 2015.
-
Polish -English Statistical Machine Translation of Medical Texts
Authors:
Krzysztof Wołk,
Krzysztof Marasek
Abstract:
This new research explores the effects of various training methods on a Polish to English Statistical Machine Translation system for medical texts. Various elements of the EMEA parallel text corpora from the OPUS project were used as the basis for training of phrase tables and language models and for development, tuning and testing of the translation system. The BLEU, NIST, METEOR, RIBES and TER m…
▽ More
This new research explores the effects of various training methods on a Polish to English Statistical Machine Translation system for medical texts. Various elements of the EMEA parallel text corpora from the OPUS project were used as the basis for training of phrase tables and language models and for development, tuning and testing of the translation system. The BLEU, NIST, METEOR, RIBES and TER metrics have been used to evaluate the effects of various system and data preparations on translation results. Our experiments included systems that used POS tagging, factored phrase models, hierarchical models, syntactic taggers, and many different alignment methods. We also conducted a deep analysis of Polish data as preparatory work for automatic data correction such as true casing and punctuation normalization phase.
△ Less
Submitted 29 September, 2015;
originally announced September 2015.
-
Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs
Authors:
Krzysztof Wołk,
Krzysztof Marasek
Abstract:
Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but para…
▽ More
Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from Wikipedia articles. We also introduce a method for extracting truly parallel sentences that are filtered out from noisy or just comparable sentence pairs. We describe our implementation of a specialized tool for this task as well as training and adaption of a machine translation system that supplies our filter with additional information about the similarity of comparable sentence pairs.
△ Less
Submitted 29 September, 2015;
originally announced September 2015.
-
Polish - English Speech Statistical Machine Translation Systems for the IWSLT 2014
Authors:
Krzysztof Wołk,
Krzysztof Marasek
Abstract:
This research explores effects of various training settings between Polish and English Statistical Machine Translation systems for spoken language. Various elements of the TED parallel text corpora for the IWSLT 2014 evaluation campaign were used as the basis for training of language models, and for development, tuning and testing of the translation system as well as Wikipedia based comparable cor…
▽ More
This research explores effects of various training settings between Polish and English Statistical Machine Translation systems for spoken language. Various elements of the TED parallel text corpora for the IWSLT 2014 evaluation campaign were used as the basis for training of language models, and for development, tuning and testing of the translation system as well as Wikipedia based comparable corpora prepared by us. The BLEU, NIST, METEOR and TER metrics were used to evaluate the effects of data preparations on translation results. Our experiments included systems, which use lemma and morphological information on Polish words. We also conducted a deep analysis of provided Polish data as preparatory work for the automatic data correction and cleaning phase.
△ Less
Submitted 29 September, 2015;
originally announced September 2015.
-
Neural-based machine translation for medical text domain. Based on European Medicines Agency leaflet texts
Authors:
Krzysztof Wołk,
Krzysztof Marasek
Abstract:
The quality of machine translation is rapidly evolving. Today one can find several machine translation systems on the web that provide reasonable translations, although the systems are not perfect. In some specific domains, the quality may decrease. A recently proposed approach to this domain is neural machine translation. It aims at building a jointly-tuned single neural network that maximizes tr…
▽ More
The quality of machine translation is rapidly evolving. Today one can find several machine translation systems on the web that provide reasonable translations, although the systems are not perfect. In some specific domains, the quality may decrease. A recently proposed approach to this domain is neural machine translation. It aims at building a jointly-tuned single neural network that maximizes translation performance, a very different approach from traditional statistical machine translation. Recently proposed neural machine translation models often belong to the encoder-decoder family in which a source sentence is encoded into a fixed length vector that is, in turn, decoded to generate a translation. The present research examines the effects of different training methods on a Polish-English Machine Translation system used for medical data. The European Medicines Agency parallel text corpus was used as the basis for training of neural and statistical network-based translation systems. The main machine translation evaluation metrics have also been used in analysis of the systems. A comparison and implementation of a real-time medical translator is the main focus of our experiments.
△ Less
Submitted 29 September, 2015;
originally announced September 2015.
-
Tuned and GPU-accelerated parallel data mining from comparable corpora
Authors:
Krzysztof Wołk,
Krzysztof Marasek
Abstract:
The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on t…
▽ More
The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on the quantity and quality of training data. Such has a very limited availability especially for some languages and very narrow text domains. Is this research we present our improvements to Yalign mining methodology by reimplementing the comparison algorithm, introducing a tuning scripts and by improving performance using GPU computing acceleration. The experiments are conducted on various text domains and bi-data is extracted from the Wikipedia dumps.
△ Less
Submitted 29 September, 2015;
originally announced September 2015.