Search | arXiv e-print repository

Attention-based Vocabulary Selection for NMT Decoding

Authors: Baskaran Sankaran, Markus Freitag, Yaser Al-Onaizan

Abstract: Neural Machine Translation (NMT) models usually use large target vocabulary sizes to capture most of the words in the target language. The vocabulary size is a big factor when decoding new sentences as the final softmax layer normalizes over all possible target words. To address this problem, it is widely common to restrict the target vocabulary with candidate lists based on the source sentence. U… ▽ More Neural Machine Translation (NMT) models usually use large target vocabulary sizes to capture most of the words in the target language. The vocabulary size is a big factor when decoding new sentences as the final softmax layer normalizes over all possible target words. To address this problem, it is widely common to restrict the target vocabulary with candidate lists based on the source sentence. Usually, the candidate lists are a combination of external word-to-word aligner, phrase table entries or most frequent words. In this work, we propose a simple and yet novel approach to learn candidate lists directly from the attention layer during NMT training. The candidate lists are highly optimized for the current NMT model and do not need any external computation of the candidate pool. We show significant decoding speedup compared with using the entire vocabulary, without losing any translation quality for two language pairs. △ Less

Submitted 12 June, 2017; originally announced June 2017.

Comments: Submitted to Second Conference on Machine Translation (WMT-17); 7 pages

arXiv:1702.01802 [pdf, ps, other]

Ensemble Distillation for Neural Machine Translation

Authors: Markus Freitag, Yaser Al-Onaizan, Baskaran Sankaran

Abstract: Knowledge distillation describes a method for training a student network to perform better by learning from a stronger teacher network. Translating a sentence with an Neural Machine Translation (NMT) engine is time expensive and having a smaller model speeds up this process. We demonstrate how to transfer the translation quality of an ensemble and an oracle BLEU teacher network into a single NMT s… ▽ More Knowledge distillation describes a method for training a student network to perform better by learning from a stronger teacher network. Translating a sentence with an Neural Machine Translation (NMT) engine is time expensive and having a smaller model speeds up this process. We demonstrate how to transfer the translation quality of an ensemble and an oracle BLEU teacher network into a single NMT system. Further, we present translation improvements from a teacher network that has the same architecture and dimensions of the student network. As the training of the student model is still expensive, we introduce a data filtering method based on the knowledge of the teacher model that not only speeds up the training, but also leads to better translation quality. Our techniques need no code change and can be easily reproduced with any NMT architecture to speed up the decoding process. △ Less

Submitted 7 August, 2017; v1 submitted 6 February, 2017; originally announced February 2017.

arXiv:1608.02927 [pdf, other]

Temporal Attention Model for Neural Machine Translation

Authors: Baskaran Sankaran, Haitao Mi, Yaser Al-Onaizan, Abe Ittycheriah

Abstract: Attention-based Neural Machine Translation (NMT) models suffer from attention deficiency issues as has been observed in recent research. We propose a novel mechanism to address some of these limitations and improve the NMT attention. Specifically, our approach memorizes the alignments temporally (within each sentence) and modulates the attention with the accumulated temporal memory, as the decoder… ▽ More Attention-based Neural Machine Translation (NMT) models suffer from attention deficiency issues as has been observed in recent research. We propose a novel mechanism to address some of these limitations and improve the NMT attention. Specifically, our approach memorizes the alignments temporally (within each sentence) and modulates the attention with the accumulated temporal memory, as the decoder generates the candidate translation. We compare our approach against the baseline NMT model and two other related approaches that address this issue either explicitly or implicitly. Large-scale experiments on two language pairs show that our approach achieves better and robust gains over the baseline and related NMT approaches. Our model further outperforms strong SMT baselines in some settings even without using ensembles. △ Less

Submitted 9 August, 2016; originally announced August 2016.

Comments: 8 pages

arXiv:1606.04164 [pdf, ps, other]

Zero-Resource Translation with Multi-Lingual Neural Machine Translation

Authors: Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan, Fatos T. Yarman Vural, Kyunghyun Cho

Abstract: In this paper, we propose a novel finetuning algorithm for the recently introduced multi-way, mulitlingual neural machine translate that enables zero-resource machine translation. When used together with novel many-to-one translation strategies, we empirically show that this finetuning algorithm allows the multi-way, multilingual model to translate a zero-resource language pair (1) as well as a si… ▽ More In this paper, we propose a novel finetuning algorithm for the recently introduced multi-way, mulitlingual neural machine translate that enables zero-resource machine translation. When used together with novel many-to-one translation strategies, we empirically show that this finetuning algorithm allows the multi-way, multilingual model to translate a zero-resource language pair (1) as well as a single-pair neural translation model trained with up to 1M direct parallel sentences of the same language pair and (2) better than pivot-based translation strategy, while kee** only one additional copy of attention-related parameters. △ Less

Submitted 13 June, 2016; originally announced June 2016.

arXiv:1605.03148 [pdf, other]

Coverage Embedding Models for Neural Machine Translation

Authors: Haitao Mi, Baskaran Sankaran, Zhiguo Wang, Abe Ittycheriah

Abstract: In this paper, we enhance the attention-based neural machine translation (NMT) by adding explicit coverage embedding models to alleviate issues of repeating and drop** translations in NMT. For each source word, our model starts with a full coverage embedding vector to track the coverage status, and then keeps updating it with neural networks as the translation goes. Experiments on the large-scal… ▽ More In this paper, we enhance the attention-based neural machine translation (NMT) by adding explicit coverage embedding models to alleviate issues of repeating and drop** translations in NMT. For each source word, our model starts with a full coverage embedding vector to track the coverage status, and then keeps updating it with neural networks as the translation goes. Experiments on the large-scale Chinese-to-English task show that our enhanced model improves the translation quality significantly on various test sets over the strong large vocabulary NMT system. △ Less

Submitted 29 August, 2016; v1 submitted 10 May, 2016; originally announced May 2016.

Comments: 6 pages; In Proceddings of EMNLP 2016

arXiv:1604.03670 [pdf, other]

doi 10.1109/TRO.2017.2721939

Interactive Perception: Leveraging Action in Perception and Perception in Action

Authors: Jeannette Bohg, Karol Hausman, Bharath Sankaran, Oliver Brock, Danica Kragic, Stefan Schaal, Gaurav Sukhatme

Abstract: Recent approaches in robotics follow the insight that perception is facilitated by interaction with the environment. These approaches are subsumed under the term of Interactive Perception (IP). It provides the following benefits: (i) interaction with the environment creates a rich sensory signal that would otherwise not be present and (ii) knowledge of the regularity in the combined space of senso… ▽ More Recent approaches in robotics follow the insight that perception is facilitated by interaction with the environment. These approaches are subsumed under the term of Interactive Perception (IP). It provides the following benefits: (i) interaction with the environment creates a rich sensory signal that would otherwise not be present and (ii) knowledge of the regularity in the combined space of sensory data and action parameters facilitate the prediction and interpretation of the signal. In this survey we postulate this as a principle and collect evidence in support by analyzing and categorizing existing work in this area. We also provide an overview of the most important applications of Interactive Perception. We close this survey by discussing remaining open questions. Thereby, we hope to define a field and inspire future work. △ Less

Submitted 5 December, 2017; v1 submitted 13 April, 2016; originally announced April 2016.

Comments: Equal contribution by first three authors

Journal ref: IEEE Transactions on Robotics 33 (2017) 1273-1291

arXiv:1505.01576 [pdf, other]

Learning and Optimization with Submodular Functions

Authors: Bharath Sankaran, Marjan Ghazvininejad, Xinran He, David Kale, Liron Cohen

Abstract: In many naturally occurring optimization problems one needs to ensure that the definition of the optimization problem lends itself to solutions that are tractable to compute. In cases where exact solutions cannot be computed tractably, it is beneficial to have strong guarantees on the tractable approximate solutions. In order operate under these criterion most optimization problems are cast under… ▽ More In many naturally occurring optimization problems one needs to ensure that the definition of the optimization problem lends itself to solutions that are tractable to compute. In cases where exact solutions cannot be computed tractably, it is beneficial to have strong guarantees on the tractable approximate solutions. In order operate under these criterion most optimization problems are cast under the umbrella of convexity or submodularity. In this report we will study design and optimization over a common class of functions called submodular functions. Set functions, and specifically submodular set functions, characterize a wide variety of naturally occurring optimization problems, and the property of submodularity of set functions has deep theoretical consequences with wide ranging applications. Informally, the property of submodularity of set functions concerns the intuitive "principle of diminishing returns. This property states that adding an element to a smaller set has more value than adding it to a larger set. Common examples of submodular monotone functions are entropies, concave functions of cardinality, and matroid rank functions; non-monotone examples include graph cuts, network flows, and mutual information. In this paper we will review the formal definition of submodularity; the optimization of submodular functions, both maximization and minimization; and finally discuss some applications in relation to learning and reasoning using submodular functions. △ Less

Submitted 7 May, 2015; originally announced May 2015.

Comments: Tech Report - USC Computer Science CS-599, Convex and Combinatorial Optimization

arXiv:1503.06375 [pdf, other]

Policy Learning with Hypothesis based Local Action Selection

Authors: Bharath Sankaran, Jeannette Bohg, Nathan Ratliff, Stefan Schaal

Abstract: For robots to be able to manipulate in unknown and unstructured environments the robot should be capable of operating under partial observability of the environment. Object occlusions and unmodeled environments are some of the factors that result in partial observability. A common scenario where this is encountered is manipulation in clutter. In the case that the robot needs to locate an object of… ▽ More For robots to be able to manipulate in unknown and unstructured environments the robot should be capable of operating under partial observability of the environment. Object occlusions and unmodeled environments are some of the factors that result in partial observability. A common scenario where this is encountered is manipulation in clutter. In the case that the robot needs to locate an object of interest and manipulate it, it needs to perform a series of decluttering actions to accurately detect the object of interest. To perform such a series of actions, the robot also needs to account for the dynamics of objects in the environment and how they react to contact. This is a non trivial problem since one needs to reason not only about robot-object interactions but also object-object interactions in the presence of contact. In the example scenario of manipulation in clutter, the state vector would have to account for the pose of the object of interest and the structure of the surrounding environment. The process model would have to account for all the aforementioned robot-object, object-object interactions. The complexity of the process model grows exponentially as the number of objects in the scene increases. This is commonly the case in unstructured environments. Hence it is not reasonable to attempt to model all object-object and robot-object interactions explicitly. Under this setting we propose a hypothesis based action selection algorithm where we construct a hypothesis set of the possible poses of an object of interest given the current evidence in the scene and select actions based on our current set of hypothesis. This hypothesis set tends to represent the belief about the structure of the environment and the number of poses the object of interest can take. The agent's only stop** criterion is when the uncertainty regarding the pose of the object is fully resolved. △ Less

Submitted 8 May, 2015; v1 submitted 21 March, 2015; originally announced March 2015.

Comments: RLDM abstract

arXiv:1309.5401 [pdf, other]

Nonmyopic View Planning for Active Object Detection

Authors: Nikolay Atanasov, Bharath Sankaran, Jerome Le Ny, George J. Pappas, Kostas Daniilidis

Abstract: One of the central problems in computer vision is the detection of semantically important objects and the estimation of their pose. Most of the work in object detection has been based on single image processing and its performance is limited by occlusions and ambiguity in appearance and geometry. This paper proposes an active approach to object detection by controlling the point of view of a mobil… ▽ More One of the central problems in computer vision is the detection of semantically important objects and the estimation of their pose. Most of the work in object detection has been based on single image processing and its performance is limited by occlusions and ambiguity in appearance and geometry. This paper proposes an active approach to object detection by controlling the point of view of a mobile depth camera. When an initial static detection phase identifies an object of interest, several hypotheses are made about its class and orientation. The sensor then plans a sequence of views, which balances the amount of energy used to move with the chance of identifying the correct hypothesis. We formulate an active hypothesis testing problem, which includes sensor mobility, and solve it using a point-based approximate POMDP algorithm. The validity of our approach is verified through simulation and real-world experiments with the PR2 robot. The results suggest that our approach outperforms the widely-used greedy view point selection and provides a significant improvement over static object detection. △ Less

Submitted 20 September, 2013; originally announced September 2013.

Comments: 12 pages (two-column); 7 figures; 2 tables; Manuscript submitted to the IEEE Transactions on Robotics (TRO)

Showing 1–9 of 9 results for author: Sankaran, B