Labeled Morphological Segmentation with Semi-Markov Models

Ryan Cotterell1,2 Thomas Müller2  Alexander Fraser2  Hinrich Schütze2
1Johns Hopkins University  1,2LMU Munich
[email protected][email protected]
Abstract

We present labeled morphological segmentation, an alternative view of morphological processing that unifies several tasks. From an annotation standpoint, we additionally introduce a new hierarchy of morphotactic tagsets. Finally, we develop Chipmunk, a discriminative morphological segmentation system that, contrary to previous work, explicitly models morphotactics. We show that Chipmunk yields improved performance on three tasks for all six languages: (i) morphological segmentation, (ii) stemming and (iii) morphological tag classification. On morphological segmentation, our method shows absolute improvements of 2–6 points F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over the baseline.

1 Introduction111The datasets created, an additional description of our tagsets and Chipmunk can be found at http://cistern.cis.lmu.de/chipmunk.

gençleşmelerin
UMS genç leş me ler in
Gloss young -ate -ion -s genitive marker
LMS genç leş me ler in
Root:Adjectival Suffix:Deriv:Verb Suffix:Deriv:Noun Suffix:Infl:Noun:Plural Suffix:Infl:Noun:Genitive
Root genç Stem gençleşme Morphological Tag Plural:Genitive
Figure 1: Examples of the tasks addressed for the Turkish word gençleşmelerin (‘of the rejuvenatings’): Traditional unlabeled segmentation (UMS), Labeled morphological segmentation (LMS), stemming / root detection and (inflectional) morphological tag classification. The morphotactic annotations produced by LMS allow us to solve these tasks using a single model.

Morphological processing is often an overlooked problem because many well-researched languages in NLP, e.g., Chinese and English, are morphologically impoverished. However, for languages with complex morphology, e.g., Finnish and Turkish, morphological processing is essential. A specific form of morphological processing, morphological segmentation, has shown its utility for machine translation Dyer et al. (2008), sentiment analysis Abdul-Mageed et al. (2012), bilingual word alignment Eyigöz et al. (2013), speech processing Creutz et al. (2007b) and keyword spotting Narasimhan et al. (2014), inter alia. We advance the state of the art in supervised morphological segmentation by describing a high-performance, data-driven tool for handling complex morphology, even in low-resource settings.

In this work, we make the distinction between unlabeled morphological segmentation (UMS), which is often just called morphological segmentation, and labeled morphological segmentation (LMS). The labels in our supervised discriminative model for LMS capture the distinctions between different types of morphemes and directly model the morphotactics of the language. We further create a hierarchical universal tagset for labeling morphemes, with different levels appropriate for different tasks. Our hierarchical tagset was designed by creating a standard representation from heterogeneous resources for six languages. We give an overview of the tasks addressed in this paper in Figure 1, which shows the expected output for the Turkish word gençleşmelerin (‘of the rejuvenatings’). In particular, it shows the full labeled morphological segmentation, from which three representations can be directly derived: the unlabeled morphological segmentation, the stem (or root)333Terminological notes: We use root to refer to a morpheme with concrete meaning, stem to refer to the concatenation of all roots and derivational affixes, root detection to refer to strip** both derivational and inflectional affixes, and stemming to refer to strip** only inflectional affixes. and the morphological tag containing part-of-speech (POS) and inflectional features.

We model these tasks with Chipmunk, a semi-Markov conditional random field (semi-CRF; Sarawagi and Cohen, 2004), a model that is well-suited for morphological segmentation. We provide an evaluation and analysis on six languages; Chipmunk yields strong results on all three tasks, including state-of-the-art accuracy on morphological segmentation.

Paper Outline.

Section 2 presents our LMS framework and the morphotactic tagsets we develop, i.e., the labels of the sequence prediction task Chipmunk solves. Section 3 introduces our semi-CRF model. Section 4 presents our novel features. Section 5 compares Chipmunk to previous work. Section 6 presents experiments on the three complementary tasks of segmentation (UMS), stemming, and morphological tag classification. Section 7 briefly discusses finite-state morphology.

2 Labeled Segmentation and Tagset

We define the framework of labeled morphological segmentation, an enhancement of morphological segmentation that—in addition to identifying the boundaries of segments—assigns a fine-grained morphotactic tag to each segment. LMS leads to both better modeling of segmentation and subsumes several other tasks, e.g., stemming.

Most previous approaches to morphological segmentation are either unlabeled or use a small, coarse-grained set such as {prefix,root,suffix}prefixrootsuffix\{\textsc{prefix},\textsc{root},\textsc{suffix}\}{ prefix , root , suffix }. In contrast, our labels are fine-grained. This finer granularity has two advantages. (i) The labels are needed for many tasks, for instance in sentiment analysis detecting morphologically encoded negation, as in Turkish, is crucial. In other words, for many applications UMS is insufficient. (ii) The LMS framework allows us to learn a probabilistic model of morphotactics. Working with LMS results in higher UMS accuracy. Thus, in applications that only need segments and no labels, LMS is beneficial. Note that the concatenation of labels across segments yields a bundle of morphological attributes similar to those found in the CoNLL datasets often used to train morphological taggers Buchholz and Marsi (2006)—thus LMS helps to unify UMS and morphological tagging. We believe that LMS is a needed extension of current work in morphological segmentation. Our framework concisely allows the model to capture interdependencies among various morphemes and model relations between entire morpheme classes—a neglected aspect of the problem.

              5 Prefix:Deriv:Verb Root:Noun Suffix:Deriv:Noun Suffix:Infl:Noun:Plural
4 Prefix:Deriv:Verb Root:Noun Suffix:Deriv:Noun Suffix:Infl:Noun:Number
3 Prefix:Deriv:Verb Root:Noun Suffix:Deriv:Noun Suffix:Infl:Noun
2 Prefix:Deriv Root Suffix:Deriv Suffix:Infl
1 Prefix Root Suffix Suffix
0 Segment Segment Segment Segment
German Ent eis ung en
English de frost ing s
Figure 2: Example of the different morphotactic tagset granularities for German Enteisungen ‘defrostings’.

We first create a hierarchical tagset with increasing granularity, which we created by analyzing heterogeneous resources for the six languages we work on. The optimal level of granularity is task- and language-dependent: the level is a trade-off between simplicity and expressivity. We illustrate our tagset with the decomposition of the German word Enteisungen ‘defrostings’ (Figure 2):

  • Level 0: The level 0 tagset involves a single tag indicating a segment. It ignores morphotactics completely and is similar to previous work.

  • Level 1: The level 1 tagset crudely approximates morphotactics: it consists of the tags {{\{{Prefix, Root, Suffix}}\}}. This scheme has been successfully used by unsupervised segmenters, e.g., Morfessor CAT-Map Creutz et al. (2007a). It allows the model to learn simple morphotactics, for instance that a prefix cannot be followed by a suffix. This makes a decomposition like reed maps-to\mapsto re+ed unlikely. We also add an additional Unknown tag for morphemes that do not fit into this scheme.

  • Level 2: The level 2 tagset splits affixes into Derivational and Inflectional, effectively increasing the maximal tagset size from 4 to 6. These tags can encode that many languages allow for transitions from derivational to inflectional endings, but rarely the opposite. This makes the incorrect decomposition of German Offenheit (‘openness’) into Off, inflectional en and derivational heit unlikely.444In open (English) and Offen (German), the en is part of the root. This tagset is also useful for building statistical stemmers.

  • Level 3: The level 3 tagset adds the part of speech, i.e., whether a root is Verbal, Nominal or Adjectival, and the part of speech of the word that an affix derives.

  • Level 4: The level 4 tagset includes the inflectional feature a suffix adds, e.g., Case or Number. This is helpful for certain agglutinative languages, in which, e.g., Case must follow Number.

  • Level 5: The level 5 tagset adds the actual value of the inflectional feature, e.g., Plural, and corresponds to the annotation in the datasets. In preliminary experiments we found that the level 5 tagset is too rich and does not yield consistent improvements; we thus do not report experimental results using it.

Table 1 shows tagset sizes for the six languages.555As converting segmentation datasets to tagsets is not always straightforward, we include tags that lack some features, e.g., some level 4 German tags lack POS because our German data does not specify it.

3 Model

Chipmunk is a supervised model based on a semi-Markov conditional random field (semi-CRF) Sarawagi and Cohen (2004) that naturally fits the task of LMS. Semi-CRFs generalize linear-chain CRFs and model segmentation jointly with sequence labeling. Just as linear-chain CRFs are discriminative adaptations of hidden Markov models Lafferty et al. (2001), semi-CRFs are an analogous adaptation of hidden semi-Markov models Murphy (2002). Semi-CRFs allow us to integrate new features that look at complete segments, which is not possible with CRFs, making semi-CRFs a natural choice for morphology.

A semi-CRF represents 𝒘𝒘\boldsymbol{w}bold_italic_w (a word) as a sequence of segments 𝒔=s1,,sN𝒔subscript𝑠1subscript𝑠𝑁\boldsymbol{s}=\langle s_{1},\ldots,s_{N}\ranglebold_italic_s = ⟨ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⟩, each of which is assigned a label nsubscript𝑛\ell_{n}roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The concatenation of all segments equals 𝒘𝒘\boldsymbol{w}bold_italic_w. We seek a log-linear distribution p𝜽(𝒔,𝒘)subscript𝑝𝜽𝒔conditionalbold-ℓ𝒘p_{\boldsymbol{\theta}}(\boldsymbol{s},\boldsymbol{\ell}\mid\boldsymbol{w})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_s , bold_ℓ ∣ bold_italic_w ) over all possible segmentations and label sequences for 𝒘𝒘\boldsymbol{w}bold_italic_w, where 𝜽𝜽\boldsymbol{\theta}bold_italic_θ is the parameter vector. Note that we recover the standard CRF if we restrict the segment length to 1. Formally, we define the probability distribution p𝜽subscript𝑝𝜽p_{\boldsymbol{\theta}}italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT as

p𝜽(𝒔,𝒘)=def1Z𝜽(𝒘)n=1Nexp(𝜽 

 
𝒇n
)
,
superscriptdefsubscript𝑝𝜽𝒔conditionalbold-ℓ𝒘1subscript𝑍𝜽𝒘superscriptsubscriptproduct𝑛1𝑁 

 
𝜽subscript𝒇𝑛
\displaystyle p_{\boldsymbol{\theta}}(\boldsymbol{s},\boldsymbol{\ell}\mid% \boldsymbol{w})\stackrel{{\scriptstyle\mbox{def}}}{{=}}\frac{1}{Z_{\boldsymbol% {\theta}}(\boldsymbol{w})}\prod_{n=1}^{N}\exp\left(\boldsymbol{\theta}\mathbin% {\mathchoice{\hbox to5.74991pt{\hfil\raise 0.0pt\hbox{\scalebox{0.5}{\lower 0.% 0pt\hbox{$\displaystyle\bullet$}}}\hfil}}{\hbox to5.74991pt{\hfil\raise 0.0pt% \hbox{\scalebox{0.5}{\lower 0.0pt\hbox{$\textstyle\bullet$}}}\hfil}}{\hbox to4% .79156pt{\hfil\raise 0.0pt\hbox{\scalebox{0.5}{\lower 0.0pt\hbox{$\scriptstyle% \bullet$}}}\hfil}}{\hbox to4.15268pt{\hfil\raise 0.0pt\hbox{\scalebox{0.5}{% \lower 0.0pt\hbox{$\scriptscriptstyle\bullet$}}}\hfil}}}\boldsymbol{f}_{n}% \right),italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_s , bold_ℓ ∣ bold_italic_w ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG def end_ARG end_RELOP divide start_ARG 1 end_ARG start_ARG italic_Z start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_w ) end_ARG ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( bold_italic_θ ∙ bold_italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ,
(1)

where 𝒇n=def𝒇(sn,n,n1,n)superscriptdefsubscript𝒇𝑛𝒇subscript𝑠𝑛subscript𝑛subscript𝑛1𝑛\boldsymbol{f}_{n}\stackrel{{\scriptstyle\mbox{def}}}{{=}}\boldsymbol{f}(s_{n}% ,\ell_{n},\ell_{n-1},n)bold_italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG def end_ARG end_RELOP bold_italic_f ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_n ) is the feature function and Z𝜽(𝒘)subscript𝑍𝜽𝒘Z_{\boldsymbol{\theta}}(\boldsymbol{w})italic_Z start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_w ) is the partition function. We use a generalization of the forward-backward algorithm for efficient gradient computation Sarawagi and Cohen (2004). Inspection of the semi-Markov forward recursion,

𝜶(0,)𝜶0\displaystyle\boldsymbol{\alpha}(0,\ell)bold_italic_α ( 0 , roman_ℓ ) =1([L])absent1for-alldelimited-[]𝐿\displaystyle=1\qquad\qquad\qquad\qquad{\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}(\forall\ell\in[L])}= 1 ( ∀ roman_ℓ ∈ [ italic_L ] ) (2)
𝜶(n,)𝜶𝑛\displaystyle\boldsymbol{\alpha}(n,\ell)bold_italic_α ( italic_n , roman_ℓ ) =t=1n=1Lexp(𝜽 

 
𝒇n
)
𝜶
(nt,)
,
absentsuperscriptsubscript𝑡1𝑛superscriptsubscriptsuperscript1𝐿 

 
𝜽subscript𝒇𝑛
𝜶
𝑛𝑡superscript
\displaystyle=\sum_{t=1}^{n}\sum_{\ell^{\prime}=1}^{L}\exp(\boldsymbol{\theta}% \mathbin{\mathchoice{\hbox to5.74991pt{\hfil\raise 0.0pt\hbox{\scalebox{0.5}{% \lower 0.0pt\hbox{$\displaystyle\bullet$}}}\hfil}}{\hbox to5.74991pt{\hfil% \raise 0.0pt\hbox{\scalebox{0.5}{\lower 0.0pt\hbox{$\textstyle\bullet$}}}\hfil% }}{\hbox to4.79156pt{\hfil\raise 0.0pt\hbox{\scalebox{0.5}{\lower 0.0pt\hbox{$% \scriptstyle\bullet$}}}\hfil}}{\hbox to4.15268pt{\hfil\raise 0.0pt\hbox{% \scalebox{0.5}{\lower 0.0pt\hbox{$\scriptscriptstyle\bullet$}}}\hfil}}}% \boldsymbol{f}_{n})\cdot\boldsymbol{\alpha}(n-t,\ell^{\prime}),= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_exp ( bold_italic_θ ∙ bold_italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⋅ bold_italic_α ( italic_n - italic_t , roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,

shows that algorithm runs in 𝒪(N2L2)𝒪superscript𝑁2superscript𝐿2\mathcal{O}(N^{2}L^{2})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) time where N𝑁Nitalic_N is the length of the word 𝒘𝒘\boldsymbol{w}bold_italic_w and L𝐿Litalic_L is the number of labels (size of the tagset). Then, we have the partition function equals Z𝜽(𝒘)==1L𝜶(N,)subscript𝑍𝜽𝒘superscriptsubscript1𝐿𝜶𝑁Z_{\boldsymbol{\theta}}(\boldsymbol{w})=\sum_{\ell=1}^{L}\boldsymbol{\alpha}(N% ,\ell)italic_Z start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_w ) = ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT bold_italic_α ( italic_N , roman_ℓ ). A similar recursion, generalizing the Viterbi algorithm for hidden Markov models (Rabiner, 1989), allows us to find the one-best labeled segmentation in 𝒪(N2L2)𝒪superscript𝑁2superscript𝐿2\mathcal{O}(N^{2}L^{2})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) as well.

We employ the maximum-likelihood criterion to estimate the parameters with L-BFGS Liu and Nocedal (1989), a gradient-based optimization algorithm. As in all exponential family models, the gradient of the log-likelihood takes the form of the difference between the observed and expected feature counts Wainwright and Jordan (2008) and can be computed efficiently with the semi-Markov extension of the forward-backward algorithm. We use L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization with a regularization coefficient tuned during cross-validation.

We note that semi-Markov models have the potential to obviate typical errors made by standard Markovian sequence models with an IOB labeling scheme over characters. For instance, consider the incorrect segmentation of the English verb sees into se+es. These are reasonable split positions as many English stems end in se (e.g., consider abuse+s). Semi-CRFs have a major advantage because they admit segmental features that allow them to learn se is not a good morph.

tagset level
language 0 1 2 3 4
English 1 4 5 13 16
Finnish 1 4 6 14 17
German 1 4 6 13 17
Indonesian 1 4 4 8 8
Turkish 1 3 4 10 20
Zulu 1 4 6 14 17
Table 1: Morphotactic tagset size at each level of granularity.

4 Features

We introduce several novel features for LMS. We exploit existing resources, e.g., spell checkers and Wiktionary, to create straightforward and effective features and we incorporate ideas from related areas: named-entity recognition (NER) and morphological tagging.

Affix Features and Gazetteers.

In contrast to syntax and semantics, the morphology of a language is often simple to document and a list of the most common morphs can be found in any good grammar. Wiktionary, for example, contains affix lists for all the six languages used in our experiments.666A good example of such a resource is en.wiktio- nary.org/wiki/Category:Turkish_suffixes. Providing a supervised learner with such a list is a great boon, just as gazetteer features aid NER Smith and Osborne (2006). The benefit is perhaps even greater than in applications like NER because suffixes and prefixes are generally closed-class, and hence these lists are likely to be comprehensive. These features are binary and fire if a given substring occurs in the gazetteer list. In this paper, we simply use suffix lists from English Wiktionary, except for Zulu, for which we use a prefix list, see Table 2. We also include a feature that fires on the conjunction of tags and substrings observed in the training data. In the level 5 tagset, this allows us to link all allomorphs of a given morpheme. In the lower-level tagsets, this links related morphemes. Virpioja et al. (2010) explored this idea for unsupervised segmentation. Linking allomorphs together under a single tag helps combat sparsity in modeling the morphotactics.

# affixes random examples
English 394 -ard -taxy -odon -en -otic -fold
Finnish 120 - -llä -ja -t -nen -hön - -ton
German 112 -nomie -lichenes -ell -en -yl -iv
Indonesian 5 -kau -an -nya -ku -mu
Turkish 263 -ten -suz -mek -den -t -ünüz
Zulu 72 i- u- za- tsh- mi- obu- olu-
Table 2: Sizes of the various affix gazetteers.

Stem Features.

A major problem in statistical segmentation is the reluctance to posit morphs not observed in training; this particularly affects roots, which are open-class. This makes it nearly impossible to correctly segment compounds that contain unseen roots, e.g., to correctly segment homework you need to know that home and work are independent English words. We solve this problem by incorporating spell-check features: binary features that fire if a segment is valid for a given spell checker. Spell-check features act as an effective proxy for a root detector. We use the open-source aspell dictionaries as they are freely available in 91 languages. Table 3 shows the coverage.

Integrating the Features.

Our model uses the features discussed in this section and additionally the simple n𝑛nitalic_n-gram context features of Ruokolainen et al. (2013). The n𝑛nitalic_n-gram features look at variable length substrings of the word on both the right and left side of each boundary. We create conjunctive features from the cross-product between the morphotactic tagset (Section 2) and the features.

5 Related Work

Memory-based Learning.

van den Bosch and Daelemans (1999) and Marsi et al. (2005) present memory-based approaches to discriminative learning of morphological segmentation and both address the problem of LMS. We distinguish our work from theirs in that we define a cross-lingual schema for defining a hierarchical tagset for LMS. Morever, we tackle the problem with a feature-rich, log-linear model, allowing us to easily incorporate disparate sources of knowledge into a single framework.

language # words
English 119,839
Finnish 6,690,417
German 364,564
Indonesian 35,269
Turkish 80,261
Zulu 73,525
Table 3: Number of words covered by the aspell dictionary

Unsupervised UMS.

UMS has been mainly addressed by unsupervised algorithms. Linguistica Goldsmith (2001) and Morfessor Creutz and Lagus (2002) are built around the idea of optimally encoding the data, in the sense of minimal description length (MDL). Morfessor Cat-MAP Creutz et al. (2007a) formulates the model as sequence prediction based on HMMs over a morph dictionary and MAP estimation. The model also attempts to induce basic morphotactic categories (Prefix, Root, Suffix). Kohonen et al. (2010b, a) and Grönroos et al. (2014) present variations of Morfessor for semi-supervised learning. Poon et al. (2009) introduces a Bayesian state-space model with corpus-wide priors. The model resembles a semi-CRF, but dynamic programming is no longer tractable. They employ the three-state tagset of Creutz and Lagus (2004) (row 1 of Figure 2) for Arabic and Hebrew UMS. Their gradient and objective computation is based on an enumeration of a heuristically chosen subset of the exponentially many segmentations. This limits its applicability to language with complex concatenative morphology, e.g., Turkish and Finnish.

Supervised UMS.

Ruokolainen et al. (2013) present an averaged perceptron Collins (2002), a discriminative structured prediction method, for UMS. The model outperforms the semi-supervised model of Poon et al. (2009) on Arabic and Hebrew morpheme segmentation as well as the semi-supervised model of Kohonen et al. (2010a) on English, Finnish and Turkish. Ruokolainen et al. (2014) get further empirical improvements by using features extracted from large corpora, based on the letter successor variety (LSV) model Harris (1995) and on unsupervised segmentation models such as Morfessor CatMAP Creutz et al. (2007a). The idea behind LSV is that for example talking should be split into talk and ing, because talk can also be followed by different letters then i such as e (talked) and s (talks).

Chinese Word Segmentation.

Chinese word segmentation (CWS) is related to UMS. Andrew (2006) successfully apply semi-CRFs to CWS. Joint CWS and POS tagging Ng and Low (2004); Zhang and Clark (2008) is related to LMS.

un. data train tune dev test
English 878k 800 100 100 694
Finnish 2,928k 800 100 100 835
German 2,338k 800 100 100 751
Indonesian 88k 800 100 100 2500
Turkish 617k 800 100 100 763
Zulu 123k 800 100 100 9040
Table 4: Dataset sizes (number of types).
+Affix +Dict,+Affix
Level 0 90.11 90.13 91.66
Level 1 90.73 90.68 92.80
Level 2 89.80 90.46 92.04
Level 3 91.03 90.83 92.31
Level 4 91.80 92.19 93.21
Table 5: Example of the effect of larger tagsets (Figure 2) on Turkish segmentation measured on our development set. As Turkish is an agglutinative language with hundreds of affixes, the efficacy of our approach is expected to be particularly salient here. Recall we optimized for the best tagset granularity for our experiments on Tune.
English Finnish Indonesian German Turkish Zulu
CRF-Morph 83.23 81.98 93.09 84.94 88.32 88.48
CRF-Morph +LSV 84.45 84.35 93.50 86.90 89.98 89.06
First-order CRF 84.66 85.05 93.31 85.47 90.03 88.99
Higher-order CRF 84.66 84.78 93.88 85.40 90.65 88.85
Chipmunk 84.40 84.40 93.76 85.53 89.72 87.80
Chipmunk +Morph 83.27 84.71 93.17 84.84 90.48 90.03
Chipmunk +Affix 83.81 86.02 93.51 85.81 89.72 89.64
Chipmunk +Dict 86.10 86.11 95.39 87.76 90.45 88.66
Chipmunk +Dict,+Affix,+Morph 86.31 88.38 95.41 87.85 91.36 90.16
Table 6: Test F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for UMS. Features: LSV = letter successor variety, Affix = affix, Dict = dictionary, Morph = optimal (on Tune) morphotactic tagset.

6 Experiments

We experiment on six languages from diverse language families. The segmentation data for English, Finnish and Turkish was taken from MorphoChallenge 2010 Kurimo et al. (2010).777http://research.ics.aalto.fi/events/morphochallenge2010/ Despite typically being used for UMS tasks, the MorphoChallenge datasets do contain morpheme-level labels. The German data was extracted from the CELEX2 collection Baayen et al. (1993), which contains all the requisite information. The Zulu data was taken from the Ukwabelana corpus Spiegler et al. (2010). Finally, the Indonesian portion was created by applying the rule-based analyzer MorphInd Larasati et al. (2011) to the Indonesian portion of an Indonesian–English bilingual corpus.888https://github.com/desmond86/Indonesian-English-Bilingual-Corpus

We did not have access to the MorphoChallenge test set, and, thus, we used the original development set as our final evaluation set (Test). We developed Chipmunk using 10-fold cross-validation on the 1000-word training set and split every fold into training (Train), tuning (Tune) and development sets (Dev).999We used both Tune and Dev in order to both optimize hyperparameters on held-out data (Tune) and perform qualitative error analysis on separate held-out data (Dev). For German, Indonesian and Zulu, we randomly selected 1000 word forms as training set and used the rest as evaluation set. For our final evaluation we trained Chipmunk on the concatenation of Train, Tune and Dev (the original 1000 word training set), using the optimal parameters from the cross-evaluation and tested on Test. Table 4 shows the important statistics of our datasets. One of our baselines also uses unlabeled training data. MorphoChallenge provides word lists for English, Finnish, German and Turkish. We use the unannotated part of Ukwabelana for Zulu; and for Indonesian, data from Wikipedia and the corpus of Krisnawati and Schulz (2013).

In all evaluations, we use variants of the standard MorphoChallenge evaluation approach. Importantly, for word types with multiple correct segmentations, this involves finding the maximum score by comparing the one-best segmentation under Chipmunk, as computed by the Viterbi algorithm, with each correct segmentation, as is standardly done in MorphoChallenge.

6.1 UMS Experiments

We first evaluate Chipmunk on UMS, by predicting LMS and then discarding the labels. Our primary baseline is the state-of-the-art supervised system CRF-Morph of Ruokolainen et al. (2013). We ran the version of the system that the authors published on their website.101010http://users.ics.tkk.fi/tpruokol/software/crfs_morph.zip We optimized the model’s two hyperparameters on Tune: the number of epochs and the maximal length of n𝑛nitalic_n-gram character features. The system also supports Harris’s (1995) letter successor variety (LSV) features, extracted from large unannotated corpora. For completeness, we also compare Chipmunk with a first-order CRF and a higher-order CRF Müller et al. (2013), both used the same n𝑛nitalic_n-gram features as CRF-Morph, but without the LSV features.111111Model order, maximal character n𝑛nitalic_n-gram length and regularization coefficients were optimized on Tune. We evaluate all models using the traditional macro F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of the segmentation boundaries.

General Discussion.

The UMS results on held-out data are displayed in Table 6. Our most complex model beats the best baseline by between 1 (German) and 3 (Finnish) points F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT on all six languages. We additionally provide extensive ablation studies to highlight the contribution of our novel features. We find that the properties of each specific language highly influences which features are most effective. For the agglutinative languages, i.e, Finnish, Turkish and Zulu, the affix-based features (+Affix) and the morphotactic tagset (+Morph) yield consistent improvements over the semi-CRF models with a single state. Improvements for the affix features range from 0.2 for Turkish to 2.14 for Zulu. The morphological tagset yields improvements of 0.77 for Finnish, 1.89 for Turkish and 2.10 for Zulu. We optimized tagset granularity on Tune and found that levels 4 and level 2 yielded the best results for the three agglutinative and the three other languages, respectively. The dictionary features (+Dict) help universally, but their effects are particularly salient in languages with productive compounding, i.e., English, Finnish and German, where we see improvements of >1.7absent1.7>1.7> 1.7. In comparison with previous work Ruokolainen et al. (2013), we find that our most complex model yields consistent improvements over CRF-Morph +LSV for all languages: The improvements range from >1absent1>1> 1 for German over >1.5absent1.5>1.5> 1.5 for Zulu, English, and Indonesian to >2absent2>2> 2 for Turkish and >4absent4>4> 4 for Finnish.

The Role of Morphotactics.

To illustrate the effect of modeling morphotactics through the larger morphotactic tagset on performance, we provide a detailed analysis of Turkish. See Table 5. We consider three different feature sets and increase the size of the morphotactic tagsets depicted in Figure 2. The results evince the general trend that improved morphotactic modeling benefits segmentation. Additionally, we observe that the improvements are complementary to those from the other features.

Novel Roots and Affixes.

As discussed earlier, a key problem in UMS, especially in low-resource settings, is the detection of novel roots and affixes. Since many of our features were designed to combat this problem specifically, we investigated this aspect independently. Table 7 shows the number of novel roots and affixes found by our best model and the baseline. In all languages, Chipmunk correctly identifies between 5% (English) and 22% (Finnish) more novel roots than the baseline. We do not see major improvements for affixes, but this is of less interest as there are far fewer novel affixes.

Boundaries.

We further explore how Chipmunk and the baseline perform on different boundary types by looking at missing boundaries between different morphotactic types; this error type is also known as undersegmentation. Figure 3 shows a heatmap that overviews errors broken down by morphotactic tag. We see that most errors are caused between root and suffixes across all languages. This is related to the problem of finding new roots, as a new root is often mistaken as a root-affix composition.

Refer to caption
Figure 3: This figure represents a comparative analysis of undersegmentation. Each column (labels at the bottom) shows how often CRF-Morph +LSV (top number in heatmap) and Chipmunk (bottom number in heatmap) select a segment that is two separate segments in the gold standard. E.g., Rt-Sx indicates how a root and a suffix were treated as a single segment. The color depends on the difference of the two counts.

6.2 Root Detection and Stemming

Root detection1 and stemming1 are two important NLP problems that are closely related to morphological segmentation and used in applications such as MT, information retrieval, parsing and information extraction. Here we explore the utility of Chipmunk as a statistical stemmer and root detector. Stemming is closely related to the task of lemmatization, which involves the additional step of normalizing to the canonical form.121212In our experiments there are no stem alternations. The output is equivalent to that of the Porter stemmer Porter (1980). Consider the German particle verb participle auf-ge-schrieb-en ‘written down’. The participle is built by applying an alternation to the verbal root schreib ‘write’ adding the participial circumfix ge-en and finally adding the verb particle auf. In our segmentation-based definition, we would consider schrieb ‘write’ as its root and auf-schrieb as its stem. In order to additionally to restore the lemma, we would also have to reverse the stem alternation that replaced ei with ie and add the infinitival ending en yielding the infinitive auf-schreib-en.

CRF-Morph Chipmunk
Roots Affixes Roots Affixes
English 614 6 644 12
Finnish 502 10 613 11
German 360 6 414 9
Indonesian 593 0 639 0
Turkish 435 22 514 19
Zulu 146 10 160 11
Table 7: Dev number of unseen root and affix types correctly identified by CRF-Morph +LSV and Chipmunk +Affix,+Dict,+Morph.

Our baseline morfette Chrupała et al. (2008) is a statistical transducer that first extracts edit paths between input and output and then uses a perceptron classifier to decide which edit path to apply. In short, morfette treats the task as a string-to-string transduction problem, whereas we view it as a labeled segmentation problem.131313Note that Morfette is a pipeline that first tags and then lemmatizes. We only make use of this second part of Morfette for which it is a strong string-to-string transduction baseline. Note that morfette would in principle be able to handle stem alternations, although these usually lead to an increase in the number of edit paths. We use level 2 tagsets for all experiments—the smallest tagsets complex enough for stemming—and extract the relevant segments.

English Finnish German Indonesian Turkish Zulu
Root morfette 62.82 39.28 43.81 86.00 26.08 30.76
Detection Chipmunk 70.31 69.85 67.37 90.00 75.62 62.23
Stemming morfette 91.35 51.74 79.49 86.00 28.57 58.12
Chipmunk 94.24 79.23 85.75 89.36 85.06 67.64
Table 8: Accuracy on the root detection and stemming on Test.
Finnish Turkish
F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT MaxEnt 75.61 69.92
MaxEnt +Split 74.02 76.61
Chipmunk +All 80.34 85.07
Acc. MaxEnt 60.96 37.88
MaxEnt +Split 59.04 44.30
Chipmunk +All 65.00 56.06
Table 9: Test results on morphological tag classification.

Discussion.

Our results are shown in Table 8. We see consistent improvements across all tasks. For the fusional languages (English, German and Indonesian) we see modest gains in performance on both root detection and stemming. However, for the agglutinative languages (Finnish, Turkish and Zulu) we see absolute gains as high as 50% (Turkish) in accuracy. This significant improvement is due to the complexity of the tasks in these languages—their productive morphology increases sparsity and makes the unstructured string-to-string transduction approach suboptimal. We view this as solid evidence that labeled segmentation has utility in many components of the NLP pipeline.

6.3 Morphological Tag Classification

The joint modeling of segmentation and morphotactic tags allows us to use Chipmunk for a crude form of morphological analysis: the task of morphological tag classification, which we define as annotation of a word with its most likely inflectional features.141414We recognize that this task is best performed with sentential context (token-based). Integration with a POS tagger, however, is beyond the scope of this paper. To be concrete, our task is to predict the inflectional features of word type based only on its character sequence and not its sentential context. To this end, we take Finnish and Turkish as two examples of languages that should suit our approach particularly well as both have highly complex inflectional morphologies. We use the level 4 tagset and replace all non-inflectional tags with a simple segment tag. The tagset sizes are listed in Table 10.

We use the same experimental setup as in Section 6.2 and compare Chipmunk to a maximum entropy classifier (MaxEnt), whose features are character n𝑛nitalic_n-grams of up to a maximal length of k𝑘kitalic_k.151515Prefixes and suffixes are explicitly marked. The maximum entropy classifier is L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-regularized and its regularization coefficient as well as the value for k𝑘kitalic_k are optimized on Tune. As a second, stronger baseline we use a MaxEnt classifier that splits tags into their constituents and concatenates the features with every constituent as well as the complete tag (MaxEnt +Split). Both of the baselines in Table 9 are 0thth{}^{\text{th}}start_FLOATSUPERSCRIPT th end_FLOATSUPERSCRIPT-order versions of the state-of-the-art CRF-based morphological tagger MarMoT Müller et al. (2013) (since our model is type-based), making this a strong baseline. We report full analysis accuracy and macro F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT on the set of individual inflectional features.

Discussion.

The results in Table 9 show that our proposed method outperforms both baselines on both performance metrics. We see gains of over 6% in accuracy in both languages. This is evidence that our proposed approach could be successfully integrated into a morphological tagger to give a stronger character-based signal.

Morpheme Tags Full Word Tags
Finnish 43 172
Turkish 50 636
Table 10: Number of full word and morpheme tags.

7 Comparison to Finite-State Morphology

A morphological finite-state analyzer is customarily a hand-crafted tool that generates all the possible morphological readings with their associated features. We believe that, for many applications, high-quality finite-state morphological analysis is superior to Chipmunk. Finite-state morphological analyzers output a small set of linguistically valid analyses of a type, typically with only limited overgeneration. However, there are two limitations with finite-state morphological analyzers. The first is that significant effort is required to develop the transducers modeling the morphological grammar and creating and updating the lexicon is laborious. The second is that it is difficult to use finite-state analyzer to guess analyses involving roots not covered in the lexicon.161616While one can in theory put in wildcard root states, this does not work in practice due to overgeneration. In fact, this is usually solved by viewing it as a different problem, morphological guessing, where linguistic knowledge similar to the features we have presented is used to try to guess POS and morphological analysis for types with no analysis in the finite-state analyzer.

In contrast, our training procedure learns a probabilistic transducer, which is a soft version of the type of hand-engineered grammar that is used in finite-state analyzers. The 1-best labeled morphological segmentation our model produces offers a simple and clean representation which could be of use in many downstream applications. Furthermore, our model unifies analysis and guessing into a single simple framework. Nevertheless, finite-state morphologies are still extremely useful, high-precision tools. A primary goal of future work will be to use Chipmunk to attempt to induce higher-quality morphological processing systems.

8 Conclusion and Future Work

We have presented labeled morphological segmentation in this paper, a new approach to morphological processing. LMS unifies three existing tasks in the literature: unlabeled morphological segmentation, stemming, and morphological tag classification. Our hierarchy of labeled morphological segmentation tagsets can be used to map the heterogeneous data in six languages we work with to universal representations of different granularities. We plan future creation of gold standard segmentations in more languages using our annotation scheme.

We further presented Chipmunk a semi-CRF-based model for LMS that allows for the integration of various linguistic features and consistently out-performs previously presented approaches to unlabeled morphological segmentation. An important extension of Chipmunk is embedding it in a context-sensitive POS tagger. Current state-of-the-art models only employ character level n𝑛nitalic_n-gram features to model word-internal structure Müller et al. (2013). We have demonstrated that our structured approach outperforms this baseline. We leave this natural extension to future work.

Acknowledgments

We would like to thank Jason Eisner, Helmut Schmid, Özlem Çetinoğlu as well as the anonymous reviewers for their comments. This material is based upon work supported by a Fulbright fellowship awarded to the first author by the German–American Fulbright Commission and the National Science Foundation under Grant No. 1423276. The second author is a recipient of the Google Europe Fellowship in Natural Language Processing, and this research is supported by this Google Fellowship. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 644402 (HimL) and the DFG grant Models of Morphosyntax for Statistical Machine Translation.

Retrospective

This version, prepared in April 2024, is a lightly edited version of the original CoNLL 2015 paper. The Tables are now rendered with the booktabs package and a base case was added to the semi-Markov recursion (Eq. 2). Finally, a few typos and infelicities in the writing were cleaned up. All in all, 9 years later, the first author at least still finds the computational analysis of morphology a challenging and unsolved problem.

References