Efficient multivariate sequence classification
Authors:
Pavel P. Kuksa
Abstract:
Kernel-based approaches for sequence classification have been successfully applied to a variety of domains, including the text categorization, image classification, speech analysis, biological sequence analysis, time series and music classification, where they show some of the most accurate results.
Typical kernel functions for sequences in these domains (e.g., bag-of-words, mismatch, or subsequ…
▽ More
Kernel-based approaches for sequence classification have been successfully applied to a variety of domains, including the text categorization, image classification, speech analysis, biological sequence analysis, time series and music classification, where they show some of the most accurate results.
Typical kernel functions for sequences in these domains (e.g., bag-of-words, mismatch, or subsequence kernels) are restricted to {\em discrete univariate} (i.e. one-dimensional) string data, such as sequences of words in the text analysis, codeword sequences in the image analysis, or nucleotide or amino acid sequences in the DNA and protein sequence analysis. However, original sequence data are often of real-valued multivariate nature, i.e. are not univariate and discrete as required by typical $k$-mer based sequence kernel functions.
In this work, we consider the problem of the {\em multivariate} sequence classification such as classification of multivariate music sequences, or multidimensional protein sequence representations. To this end, we extend {\em univariate} kernel functions typically used in sequence analysis and propose efficient {\em multivariate} similarity kernel method (MVDFQ-SK) based on (1) a direct feature quantization (DFQ) of each sequence dimension in the original {\em real-valued} multivariate sequences and (2) applying novel multivariate discrete kernel measures on these multivariate discrete DFQ sequence representations to more accurately capture similarity relationships among sequences and improve classification performance.
Experiments using the proposed MVDFQ-SK kernel method show excellent classification performance on three challenging music classification tasks as well as protein sequence classification with significant 25-40% improvements over univariate kernel methods and existing state-of-the-art sequence classification methods.
△ Less
Submitted 30 September, 2014; v1 submitted 29 September, 2014;
originally announced September 2014.
Natural Language Processing (almost) from Scratch
Authors:
Ronan Collobert,
Jason Weston,
Leon Bottou,
Michael Karlen,
Koray Kavukcuoglu,
Pavel Kuksa
Abstract:
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input…
▽ More
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
△ Less
Submitted 2 March, 2011;
originally announced March 2011.