Labeled Morphological Segmentation with Semi-Markov Models

Ryan Cotterell^1,2 Thomas Müller² Alexander Fraser² Hinrich Schütze²
¹Johns Hopkins University ^1,2LMU Munich
[email protected] [email protected]

Abstract

We present labeled morphological segmentation, an alternative view of morphological processing that unifies several tasks. From an annotation standpoint, we additionally introduce a new hierarchy of morphotactic tagsets. Finally, we develop Chipmunk, a discriminative morphological segmentation system that, contrary to previous work, explicitly models morphotactics. We show that Chipmunk yields improved performance on three tasks for all six languages: (i) morphological segmentation, (ii) stemming and (iii) morphological tag classification. On morphological segmentation, our method shows absolute improvements of 2–6 points $F_{1}$ over the baseline.

1 Introduction¹¹1The datasets created, an additional description of our tagsets and Chipmunk can be found at http://cistern.cis.lmu.de/chipmunk.

gençleşmelerin
UMS	genç	leş	me	ler	in
Gloss	young	-ate	-ion	-s	genitive marker
LMS	genç	leş	me	ler	in
LMS	Root:Adjectival	Suffix:Deriv:Verb	Suffix:Deriv:Noun	Suffix:Infl:Noun:Plural	Suffix:Infl:Noun:Genitive
Root	genç	Stem	gençleşme	Morphological Tag	Plural:Genitive

Figure 1: Examples of the tasks addressed for the Turkish word gençleşmelerin (‘of the rejuvenatings’): Traditional unlabeled segmentation (UMS), Labeled morphological segmentation (LMS), stemming / root detection and (inflectional) morphological tag classification. The morphotactic annotations produced by LMS allow us to solve these tasks using a single model.

Morphological processing is often an overlooked problem because many well-researched languages in NLP, e.g., Chinese and English, are morphologically impoverished. However, for languages with complex morphology, e.g., Finnish and Turkish, morphological processing is essential. A specific form of morphological processing, morphological segmentation, has shown its utility for machine translation Dyer et al. (2008), sentiment analysis Abdul-Mageed et al. (2012), bilingual word alignment Eyigöz et al. (2013), speech processing Creutz et al. (2007b) and keyword spotting Narasimhan et al. (2014), inter alia. We advance the state of the art in supervised morphological segmentation by describing a high-performance, data-driven tool for handling complex morphology, even in low-resource settings.

In this work, we make the distinction between unlabeled morphological segmentation (UMS), which is often just called morphological segmentation, and labeled morphological segmentation (LMS). The labels in our supervised discriminative model for LMS capture the distinctions between different types of morphemes and directly model the morphotactics of the language. We further create a hierarchical universal tagset for labeling morphemes, with different levels appropriate for different tasks. Our hierarchical tagset was designed by creating a standard representation from heterogeneous resources for six languages. We give an overview of the tasks addressed in this paper in Figure 1, which shows the expected output for the Turkish word gençleşmelerin (‘of the rejuvenatings’). In particular, it shows the full labeled morphological segmentation, from which three representations can be directly derived: the unlabeled morphological segmentation, the stem (or root)³³3Terminological notes: We use root to refer to a morpheme with concrete meaning, stem to refer to the concatenation of all roots and derivational affixes, root detection to refer to strip** both derivational and inflectional affixes, and stemming to refer to strip** only inflectional affixes. and the morphological tag containing part-of-speech (POS) and inflectional features.

We model these tasks with Chipmunk, a semi-Markov conditional random field (semi-CRF; Sarawagi and Cohen, 2004), a model that is well-suited for morphological segmentation. We provide an evaluation and analysis on six languages; Chipmunk yields strong results on all three tasks, including state-of-the-art accuracy on morphological segmentation.

Paper Outline.

Section 2 presents our LMS framework and the morphotactic tagsets we develop, i.e., the labels of the sequence prediction task Chipmunk solves. Section 3 introduces our semi-CRF model. Section 4 presents our novel features. Section 5 compares Chipmunk to previous work. Section 6 presents experiments on the three complementary tasks of segmentation (UMS), stemming, and morphological tag classification. Section 7 briefly discusses finite-state morphology.

2 Labeled Segmentation and Tagset

We define the framework of labeled morphological segmentation, an enhancement of morphological segmentation that—in addition to identifying the boundaries of segments—assigns a fine-grained morphotactic tag to each segment. LMS leads to both better modeling of segmentation and subsumes several other tasks, e.g., stemming.

Most previous approaches to morphological segmentation are either unlabeled or use a small, coarse-grained set such as $\{\textsc{prefix},\textsc{root},\textsc{suffix}\}$ . In contrast, our labels are fine-grained. This finer granularity has two advantages. (i) The labels are needed for many tasks, for instance in sentiment analysis detecting morphologically encoded negation, as in Turkish, is crucial. In other words, for many applications UMS is insufficient. (ii) The LMS framework allows us to learn a probabilistic model of morphotactics. Working with LMS results in higher UMS accuracy. Thus, in applications that only need segments and no labels, LMS is beneficial. Note that the concatenation of labels across segments yields a bundle of morphological attributes similar to those found in the CoNLL datasets often used to train morphological taggers Buchholz and Marsi (2006)—thus LMS helps to unify UMS and morphological tagging. We believe that LMS is a needed extension of current work in morphological segmentation. Our framework concisely allows the model to capture interdependencies among various morphemes and model relations between entire morpheme classes—a neglected aspect of the problem.

5	Prefix:Deriv:Verb	Root:Noun	Suffix:Deriv:Noun	Suffix:Infl:Noun:Plural
4	Prefix:Deriv:Verb	Root:Noun	Suffix:Deriv:Noun	Suffix:Infl:Noun:Number
3	Prefix:Deriv:Verb	Root:Noun	Suffix:Deriv:Noun	Suffix:Infl:Noun
2	Prefix:Deriv	Root	Suffix:Deriv	Suffix:Infl
1	Prefix	Root	Suffix	Suffix
0	Segment	Segment	Segment	Segment
German	Ent	eis	ung	en
English	de	frost	ing	s

Figure 2: Example of the different morphotactic tagset granularities for German Enteisungen ‘defrostings’.

We first create a hierarchical tagset with increasing granularity, which we created by analyzing heterogeneous resources for the six languages we work on. The optimal level of granularity is task- and language-dependent: the level is a trade-off between simplicity and expressivity. We illustrate our tagset with the decomposition of the German word Enteisungen ‘defrostings’ (Figure 2):

•

Level 0: The level 0 tagset involves a single tag indicating a segment. It ignores morphotactics completely and is similar to previous work.
•

Level 1: The level 1 tagset crudely approximates morphotactics: it consists of the tags $\{$ Prefix, Root, Suffix $\}$ . This scheme has been successfully used by unsupervised segmenters, e.g., Morfessor CAT-Map Creutz et al. (2007a). It allows the model to learn simple morphotactics, for instance that a prefix cannot be followed by a suffix. This makes a decomposition like reed $\mapsto$ re+ed unlikely. We also add an additional Unknown tag for morphemes that do not fit into this scheme.
•

Level 2: The level 2 tagset splits affixes into Derivational and Inflectional, effectively increasing the maximal tagset size from 4 to 6. These tags can encode that many languages allow for transitions from derivational to inflectional endings, but rarely the opposite. This makes the incorrect decomposition of German Offenheit (‘openness’) into Off, inflectional en and derivational heit unlikely.⁴⁴4In open (English) and Offen (German), the en is part of the root. This tagset is also useful for building statistical stemmers.
•

Level 3: The level 3 tagset adds the part of speech, i.e., whether a root is Verbal, Nominal or Adjectival, and the part of speech of the word that an affix derives.
•

Level 4: The level 4 tagset includes the inflectional feature a suffix adds, e.g., Case or Number. This is helpful for certain agglutinative languages, in which, e.g., Case must follow Number.
•

Level 5: The level 5 tagset adds the actual value of the inflectional feature, e.g., Plural, and corresponds to the annotation in the datasets. In preliminary experiments we found that the level 5 tagset is too rich and does not yield consistent improvements; we thus do not report experimental results using it.

Table 1 shows tagset sizes for the six languages.⁵⁵5As converting segmentation datasets to tagsets is not always straightforward, we include tags that lack some features, e.g., some level 4 German tags lack POS because our German data does not specify it.

3 Model

Chipmunk is a supervised model based on a semi-Markov conditional random field (semi-CRF) Sarawagi and Cohen (2004) that naturally fits the task of LMS. Semi-CRFs generalize linear-chain CRFs and model segmentation jointly with sequence labeling. Just as linear-chain CRFs are discriminative adaptations of hidden Markov models Lafferty et al. (2001), semi-CRFs are an analogous adaptation of hidden semi-Markov models Murphy (2002). Semi-CRFs allow us to integrate new features that look at complete segments, which is not possible with CRFs, making semi-CRFs a natural choice for morphology.

A semi-CRF represents $\boldsymbol{w}$ (a word) as a sequence of segments $\boldsymbol{s}=\langle s_{1},\ldots,s_{N}\rangle$ , each of which is assigned a label $\ell_{n}$ . The concatenation of all segments equals $\boldsymbol{w}$ . We seek a log-linear distribution $p_{\boldsymbol{\theta}}(\boldsymbol{s},\boldsymbol{\ell}\mid\boldsymbol{w})$ over all possible segmentations and label sequences for $\boldsymbol{w}$ , where $\boldsymbol{\theta}$ is the parameter vector. Note that we recover the standard CRF if we restrict the segment length to 1. Formally, we define the probability distribution $p_{\boldsymbol{\theta}}$ as

\displaystyle p_{\boldsymbol{\theta}}(\boldsymbol{s},\boldsymbol{\ell}\mid% \boldsymbol{w})\stackrel{{\scriptstyle\mbox{def}}}{{=}}\frac{1}{Z_{\boldsymbol% {\theta}}(\boldsymbol{w})}\prod_{n=1}^{N}\exp\left(\boldsymbol{\theta}\mathbin% {\mathchoice{\hbox to5.74991pt{\hfil\raise 0.0pt\hbox{\scalebox{0.5}{\lower 0.% 0pt\hbox{$\displaystyle\bullet$}}}\hfil}}{\hbox to5.74991pt{\hfil\raise 0.0pt% \hbox{\scalebox{0.5}{\lower 0.0pt\hbox{$\textstyle\bullet$}}}\hfil}}{\hbox to4% .79156pt{\hfil\raise 0.0pt\hbox{\scalebox{0.5}{\lower 0.0pt\hbox{$\scriptstyle% \bullet$}}}\hfil}}{\hbox to4.15268pt{\hfil\raise 0.0pt\hbox{\scalebox{0.5}{% \lower 0.0pt\hbox{$\scriptscriptstyle\bullet$}}}\hfil}}}\boldsymbol{f}_{n}% \right),

(1)

where $\boldsymbol{f}_{n}\stackrel{{\scriptstyle\mbox{def}}}{{=}}\boldsymbol{f}(s_{n}% ,\ell_{n},\ell_{n-1},n)$ is the feature function and $Z_{\boldsymbol{\theta}}(\boldsymbol{w})$ is the partition function. We use a generalization of the forward-backward algorithm for efficient gradient computation Sarawagi and Cohen (2004). Inspection of the semi-Markov forward recursion,

	$\displaystyle\boldsymbol{\alpha}(0,\ell)$	$\displaystyle=1\qquad\qquad\qquad\qquad{\color[rgb]{.5,.5,.5}\definecolor[% named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}% \pgfsys@color@gray@fill{.5}(\forall\ell\in[L])}$		(2)
	$\displaystyle\boldsymbol{\alpha}(n,\ell)$	$\displaystyle=\sum_{t=1}^{n}\sum_{\ell^{\prime}=1}^{L}\exp(\boldsymbol{\theta}% \mathbin{\mathchoice{\hbox to5.74991pt{\hfil\raise 0.0pt\hbox{\scalebox{0.5}{% \lower 0.0pt\hbox{$\displaystyle\bullet$}}}\hfil}}{\hbox to5.74991pt{\hfil% \raise 0.0pt\hbox{\scalebox{0.5}{\lower 0.0pt\hbox{$\textstyle\bullet$}}}\hfil% }}{\hbox to4.79156pt{\hfil\raise 0.0pt\hbox{\scalebox{0.5}{\lower 0.0pt\hbox{$% \scriptstyle\bullet$}}}\hfil}}{\hbox to4.15268pt{\hfil\raise 0.0pt\hbox{% \scalebox{0.5}{\lower 0.0pt\hbox{$\scriptscriptstyle\bullet$}}}\hfil}}}% \boldsymbol{f}_{n})\cdot\boldsymbol{\alpha}(n-t,\ell^{\prime}),$

shows that algorithm runs in $\mathcal{O}(N^{2}L^{2})$ time where $N$ is the length of the word $\boldsymbol{w}$ and $L$ is the number of labels (size of the tagset). Then, we have the partition function equals $Z_{\boldsymbol{\theta}}(\boldsymbol{w})=\sum_{\ell=1}^{L}\boldsymbol{\alpha}(N% ,\ell)$ . A similar recursion, generalizing the Viterbi algorithm for hidden Markov models (Rabiner, 1989), allows us to find the one-best labeled segmentation in $\mathcal{O}(N^{2}L^{2})$ as well.

We employ the maximum-likelihood criterion to estimate the parameters with L-BFGS Liu and Nocedal (1989), a gradient-based optimization algorithm. As in all exponential family models, the gradient of the log-likelihood takes the form of the difference between the observed and expected feature counts Wainwright and Jordan (2008) and can be computed efficiently with the semi-Markov extension of the forward-backward algorithm. We use $L_{2}$ regularization with a regularization coefficient tuned during cross-validation.

We note that semi-Markov models have the potential to obviate typical errors made by standard Markovian sequence models with an IOB labeling scheme over characters. For instance, consider the incorrect segmentation of the English verb sees into se+es. These are reasonable split positions as many English stems end in se (e.g., consider abuse+s). Semi-CRFs have a major advantage because they admit segmental features that allow them to learn se is not a good morph.

	tagset level
language	0	1	2	3	4
English	1	4	5	13	16
Finnish	1	4	6	14	17
German	1	4	6	13	17
Indonesian	1	4	4	8	8
Turkish	1	3	4	10	20
Zulu	1	4	6	14	17

Table 1: Morphotactic tagset size at each level of granularity.

4 Features

We introduce several novel features for LMS. We exploit existing resources, e.g., spell checkers and Wiktionary, to create straightforward and effective features and we incorporate ideas from related areas: named-entity recognition (NER) and morphological tagging.

Affix Features and Gazetteers.

In contrast to syntax and semantics, the morphology of a language is often simple to document and a list of the most common morphs can be found in any good grammar. Wiktionary, for example, contains affix lists for all the six languages used in our experiments.⁶⁶6A good example of such a resource is en.wiktio- nary.org/wiki/Category:Turkish_suffixes. Providing a supervised learner with such a list is a great boon, just as gazetteer features aid NER Smith and Osborne (2006). The benefit is perhaps even greater than in applications like NER because suffixes and prefixes are generally closed-class, and hence these lists are likely to be comprehensive. These features are binary and fire if a given substring occurs in the gazetteer list. In this paper, we simply use suffix lists from English Wiktionary, except for Zulu, for which we use a prefix list, see Table 2. We also include a feature that fires on the conjunction of tags and substrings observed in the training data. In the level 5 tagset, this allows us to link all allomorphs of a given morpheme. In the lower-level tagsets, this links related morphemes. Virpioja et al. (2010) explored this idea for unsupervised segmentation. Linking allomorphs together under a single tag helps combat sparsity in modeling the morphotactics.

	# affixes	random examples
English	394	-ard -taxy -odon -en -otic -fold
Finnish	120	-tä -llä -ja -t -nen -hön -jä -ton
German	112	-nomie -lichenes -ell -en -yl -iv
Indonesian	5	-kau -an -nya -ku -mu
Turkish	263	-ten -suz -mek -den -t -ünüz
Zulu	72	i- u- za- tsh- mi- obu- olu-

Table 2: Sizes of the various affix gazetteers.

Stem Features.

A major problem in statistical segmentation is the reluctance to posit morphs not observed in training; this particularly affects roots, which are open-class. This makes it nearly impossible to correctly segment compounds that contain unseen roots, e.g., to correctly segment homework you need to know that home and work are independent English words. We solve this problem by incorporating spell-check features: binary features that fire if a segment is valid for a given spell checker. Spell-check features act as an effective proxy for a root detector. We use the open-source aspell dictionaries as they are freely available in 91 languages. Table 3 shows the coverage.

Integrating the Features.

Our model uses the features discussed in this section and additionally the simple $n$ -gram context features of Ruokolainen et al. (2013). The $n$ -gram features look at variable length substrings of the word on both the right and left side of each boundary. We create conjunctive features from the cross-product between the morphotactic tagset (Section 2) and the features.

5 Related Work

Memory-based Learning.

van den Bosch and Daelemans (1999) and Marsi et al. (2005) present memory-based approaches to discriminative learning of morphological segmentation and both address the problem of LMS. We distinguish our work from theirs in that we define a cross-lingual schema for defining a hierarchical tagset for LMS. Morever, we tackle the problem with a feature-rich, log-linear model, allowing us to easily incorporate disparate sources of knowledge into a single framework.

language	# words
English	119,839
Finnish	6,690,417
German	364,564
Indonesian	35,269
Turkish	80,261
Zulu	73,525

Table 3: Number of words covered by the aspell dictionary

Unsupervised UMS.

UMS has been mainly addressed by unsupervised algorithms. Linguistica Goldsmith (2001) and Morfessor Creutz and Lagus (2002) are built around the idea of optimally encoding the data, in the sense of minimal description length (MDL). Morfessor Cat-MAP Creutz et al. (2007a) formulates the model as sequence prediction based on HMMs over a morph dictionary and MAP estimation. The model also attempts to induce basic morphotactic categories (Prefix, Root, Suffix). Kohonen et al. (2010b, a) and Grönroos et al. (2014) present variations of Morfessor for semi-supervised learning. Poon et al. (2009) introduces a Bayesian state-space model with corpus-wide priors. The model resembles a semi-CRF, but dynamic programming is no longer tractable. They employ the three-state tagset of Creutz and Lagus (2004) (row 1 of Figure 2) for Arabic and Hebrew UMS. Their gradient and objective computation is based on an enumeration of a heuristically chosen subset of the exponentially many segmentations. This limits its applicability to language with complex concatenative morphology, e.g., Turkish and Finnish.

Supervised UMS.

Ruokolainen et al. (2013) present an averaged perceptron Collins (2002), a discriminative structured prediction method, for UMS. The model outperforms the semi-supervised model of Poon et al. (2009) on Arabic and Hebrew morpheme segmentation as well as the semi-supervised model of Kohonen et al. (2010a) on English, Finnish and Turkish. Ruokolainen et al. (2014) get further empirical improvements by using features extracted from large corpora, based on the letter successor variety (LSV) model Harris (1995) and on unsupervised segmentation models such as Morfessor CatMAP Creutz et al. (2007a). The idea behind LSV is that for example talking should be split into talk and ing, because talk can also be followed by different letters then i such as e (talked) and s (talks).

Chinese Word Segmentation.

Chinese word segmentation (CWS) is related to UMS. Andrew (2006) successfully apply semi-CRFs to CWS. Joint CWS and POS tagging Ng and Low (2004); Zhang and Clark (2008) is related to LMS.

	un. data	train	tune	dev	test
English	878k	800	100	100	694
Finnish	2,928k	800	100	100	835
German	2,338k	800	100	100	751
Indonesian	88k	800	100	100	2500
Turkish	617k	800	100	100	763
Zulu	123k	800	100	100	9040

Table 4: Dataset sizes (number of types).

		+Affix	+Dict,+Affix
Level 0	90.11	90.13	91.66
Level 1	90.73	90.68	92.80
Level 2	89.80	90.46	92.04
Level 3	91.03	90.83	92.31
Level 4	91.80	92.19	93.21

Table 5: Example of the effect of larger tagsets (Figure 2) on Turkish segmentation measured on our development set. As Turkish is an agglutinative language with hundreds of affixes, the efficacy of our approach is expected to be particularly salient here. Recall we optimized for the best tagset granularity for our experiments on Tune.

	English	Finnish	Indonesian	German	Turkish	Zulu
CRF-Morph	83.23	81.98	93.09	84.94	88.32	88.48
CRF-Morph +LSV	84.45	84.35	93.50	86.90	89.98	89.06
First-order CRF	84.66	85.05	93.31	85.47	90.03	88.99
Higher-order CRF	84.66	84.78	93.88	85.40	90.65	88.85
Chipmunk	84.40	84.40	93.76	85.53	89.72	87.80
Chipmunk +Morph	83.27	84.71	93.17	84.84	90.48	90.03
Chipmunk +Affix	83.81	86.02	93.51	85.81	89.72	89.64
Chipmunk +Dict	86.10	86.11	95.39	87.76	90.45	88.66
Chipmunk +Dict,+Affix,+Morph	86.31	88.38	95.41	87.85	91.36	90.16

Table 6: Test

F_{1}

for UMS. Features: LSV = letter successor variety, Affix = affix, Dict = dictionary, Morph = optimal (on Tune) morphotactic tagset.

6 Experiments

We experiment on six languages from diverse language families. The segmentation data for English, Finnish and Turkish was taken from MorphoChallenge 2010 Kurimo et al. (2010).⁷⁷7http://research.ics.aalto.fi/events/morphochallenge2010/ Despite typically being used for UMS tasks, the MorphoChallenge datasets do contain morpheme-level labels. The German data was extracted from the CELEX2 collection Baayen et al. (1993), which contains all the requisite information. The Zulu data was taken from the Ukwabelana corpus Spiegler et al. (2010). Finally, the Indonesian portion was created by applying the rule-based analyzer MorphInd Larasati et al. (2011) to the Indonesian portion of an Indonesian–English bilingual corpus.⁸⁸8https://github.com/desmond86/Indonesian-English-Bilingual-Corpus

We did not have access to the MorphoChallenge test set, and, thus, we used the original development set as our final evaluation set (Test). We developed Chipmunk using 10-fold cross-validation on the 1000-word training set and split every fold into training (Train), tuning (Tune) and development sets (Dev).⁹⁹9We used both Tune and Dev in order to both optimize hyperparameters on held-out data (Tune) and perform qualitative error analysis on separate held-out data (Dev). For German, Indonesian and Zulu, we randomly selected 1000 word forms as training set and used the rest as evaluation set. For our final evaluation we trained Chipmunk on the concatenation of Train, Tune and Dev (the original 1000 word training set), using the optimal parameters from the cross-evaluation and tested on Test. Table 4 shows the important statistics of our datasets. One of our baselines also uses unlabeled training data. MorphoChallenge provides word lists for English, Finnish, German and Turkish. We use the unannotated part of Ukwabelana for Zulu; and for Indonesian, data from Wikipedia and the corpus of Krisnawati and Schulz (2013).

In all evaluations, we use variants of the standard MorphoChallenge evaluation approach. Importantly, for word types with multiple correct segmentations, this involves finding the maximum score by comparing the one-best segmentation under Chipmunk, as computed by the Viterbi algorithm, with each correct segmentation, as is standardly done in MorphoChallenge.

6.1 UMS Experiments

We first evaluate Chipmunk on UMS, by predicting LMS and then discarding the labels. Our primary baseline is the state-of-the-art supervised system CRF-Morph of Ruokolainen et al. (2013). We ran the version of the system that the authors published on their website.¹⁰¹⁰10http://users.ics.tkk.fi/tpruokol/software/crfs_morph.zip We optimized the model’s two hyperparameters on Tune: the number of epochs and the maximal length of $n$ -gram character features. The system also supports Harris’s (1995) letter successor variety (LSV) features, extracted from large unannotated corpora. For completeness, we also compare Chipmunk with a first-order CRF and a higher-order CRF Müller et al. (2013), both used the same $n$ -gram features as CRF-Morph, but without the LSV features.¹¹¹¹11Model order, maximal character $n$ -gram length and regularization coefficients were optimized on Tune. We evaluate all models using the traditional macro $F_{1}$ of the segmentation boundaries.

General Discussion.

The UMS results on held-out data are displayed in Table 6. Our most complex model beats the best baseline by between 1 (German) and 3 (Finnish) points $F_{1}$ on all six languages. We additionally provide extensive ablation studies to highlight the contribution of our novel features. We find that the properties of each specific language highly influences which features are most effective. For the agglutinative languages, i.e, Finnish, Turkish and Zulu, the affix-based features (+Affix) and the morphotactic tagset (+Morph) yield consistent improvements over the semi-CRF models with a single state. Improvements for the affix features range from 0.2 for Turkish to 2.14 for Zulu. The morphological tagset yields improvements of 0.77 for Finnish, 1.89 for Turkish and 2.10 for Zulu. We optimized tagset granularity on Tune and found that levels 4 and level 2 yielded the best results for the three agglutinative and the three other languages, respectively. The dictionary features (+Dict) help universally, but their effects are particularly salient in languages with productive compounding, i.e., English, Finnish and German, where we see improvements of $>1.7$ . In comparison with previous work Ruokolainen et al. (2013), we find that our most complex model yields consistent improvements over CRF-Morph +LSV for all languages: The improvements range from $>1$ for German over $>1.5$ for Zulu, English, and Indonesian to $>2$ for Turkish and $>4$ for Finnish.

The Role of Morphotactics.

To illustrate the effect of modeling morphotactics through the larger morphotactic tagset on performance, we provide a detailed analysis of Turkish. See Table 5. We consider three different feature sets and increase the size of the morphotactic tagsets depicted in Figure 2. The results evince the general trend that improved morphotactic modeling benefits segmentation. Additionally, we observe that the improvements are complementary to those from the other features.

Novel Roots and Affixes.

As discussed earlier, a key problem in UMS, especially in low-resource settings, is the detection of novel roots and affixes. Since many of our features were designed to combat this problem specifically, we investigated this aspect independently. Table 7 shows the number of novel roots and affixes found by our best model and the baseline. In all languages, Chipmunk correctly identifies between 5% (English) and 22% (Finnish) more novel roots than the baseline. We do not see major improvements for affixes, but this is of less interest as there are far fewer novel affixes.

Boundaries.

We further explore how Chipmunk and the baseline perform on different boundary types by looking at missing boundaries between different morphotactic types; this error type is also known as undersegmentation. Figure 3 shows a heatmap that overviews errors broken down by morphotactic tag. We see that most errors are caused between root and suffixes across all languages. This is related to the problem of finding new roots, as a new root is often mistaken as a root-affix composition.

Refer to caption — Figure 3: This figure represents a comparative analysis of undersegmentation. Each column (labels at the bottom) shows how often CRF-Morph +LSV (top number in heatmap) and Chipmunk (bottom number in heatmap) select a segment that is two separate segments in the gold standard. E.g., Rt-Sx indicates how a root and a suffix were treated as a single segment. The color depends on the difference of the two counts.

6.2 Root Detection and Stemming

Root detection¹ and stemming¹ are two important NLP problems that are closely related to morphological segmentation and used in applications such as MT, information retrieval, parsing and information extraction. Here we explore the utility of Chipmunk as a statistical stemmer and root detector. Stemming is closely related to the task of lemmatization, which involves the additional step of normalizing to the canonical form.¹²¹²12In our experiments there are no stem alternations. The output is equivalent to that of the Porter stemmer Porter (1980). Consider the German particle verb participle auf-ge-schrieb-en ‘written down’. The participle is built by applying an alternation to the verbal root schreib ‘write’ adding the participial circumfix ge-en and finally adding the verb particle auf. In our segmentation-based definition, we would consider schrieb ‘write’ as its root and auf-schrieb as its stem. In order to additionally to restore the lemma, we would also have to reverse the stem alternation that replaced ei with ie and add the infinitival ending en yielding the infinitive auf-schreib-en.

	CRF-Morph		Chipmunk
	Roots	Affixes	Roots	Affixes
English	614	6	644	12
Finnish	502	10	613	11
German	360	6	414	9
Indonesian	593	0	639	0
Turkish	435	22	514	19
Zulu	146	10	160	11

Table 7: Dev number of unseen root and affix types correctly identified by CRF-Morph +LSV and Chipmunk +Affix,+Dict,+Morph.

Our baseline morfette Chrupała et al. (2008) is a statistical transducer that first extracts edit paths between input and output and then uses a perceptron classifier to decide which edit path to apply. In short, morfette treats the task as a string-to-string transduction problem, whereas we view it as a labeled segmentation problem.¹³¹³13Note that Morfette is a pipeline that first tags and then lemmatizes. We only make use of this second part of Morfette for which it is a strong string-to-string transduction baseline. Note that morfette would in principle be able to handle stem alternations, although these usually lead to an increase in the number of edit paths. We use level 2 tagsets for all experiments—the smallest tagsets complex enough for stemming—and extract the relevant segments.

		English	Finnish	German	Indonesian	Turkish	Zulu
Root	morfette	62.82	39.28	43.81	86.00	26.08	30.76
Detection	Chipmunk	70.31	69.85	67.37	90.00	75.62	62.23
Stemming	morfette	91.35	51.74	79.49	86.00	28.57	58.12
	Chipmunk	94.24	79.23	85.75	89.36	85.06	67.64

Table 8: Accuracy on the root detection and stemming on Test.

		Finnish	Turkish
$F_{1}$	MaxEnt	75.61	69.92
	MaxEnt +Split	74.02	76.61
	Chipmunk +All	80.34	85.07
Acc.	MaxEnt	60.96	37.88
	MaxEnt +Split	59.04	44.30
	Chipmunk +All	65.00	56.06

Table 9: Test results on morphological tag classification.

Discussion.

Our results are shown in Table 8. We see consistent improvements across all tasks. For the fusional languages (English, German and Indonesian) we see modest gains in performance on both root detection and stemming. However, for the agglutinative languages (Finnish, Turkish and Zulu) we see absolute gains as high as 50% (Turkish) in accuracy. This significant improvement is due to the complexity of the tasks in these languages—their productive morphology increases sparsity and makes the unstructured string-to-string transduction approach suboptimal. We view this as solid evidence that labeled segmentation has utility in many components of the NLP pipeline.

6.3 Morphological Tag Classification

The joint modeling of segmentation and morphotactic tags allows us to use Chipmunk for a crude form of morphological analysis: the task of morphological tag classification, which we define as annotation of a word with its most likely inflectional features.¹⁴¹⁴14We recognize that this task is best performed with sentential context (token-based). Integration with a POS tagger, however, is beyond the scope of this paper. To be concrete, our task is to predict the inflectional features of word type based only on its character sequence and not its sentential context. To this end, we take Finnish and Turkish as two examples of languages that should suit our approach particularly well as both have highly complex inflectional morphologies. We use the level 4 tagset and replace all non-inflectional tags with a simple segment tag. The tagset sizes are listed in Table 10.

We use the same experimental setup as in Section 6.2 and compare Chipmunk to a maximum entropy classifier (MaxEnt), whose features are character $n$ -grams of up to a maximal length of $k$ .¹⁵¹⁵15Prefixes and suffixes are explicitly marked. The maximum entropy classifier is $L_{1}$ -regularized and its regularization coefficient as well as the value for $k$ are optimized on Tune. As a second, stronger baseline we use a MaxEnt classifier that splits tags into their constituents and concatenates the features with every constituent as well as the complete tag (MaxEnt +Split). Both of the baselines in Table 9 are 0 ${}^{\text{th}}$ -order versions of the state-of-the-art CRF-based morphological tagger MarMoT Müller et al. (2013) (since our model is type-based), making this a strong baseline. We report full analysis accuracy and macro $F_{1}$ on the set of individual inflectional features.

Discussion.

The results in Table 9 show that our proposed method outperforms both baselines on both performance metrics. We see gains of over 6% in accuracy in both languages. This is evidence that our proposed approach could be successfully integrated into a morphological tagger to give a stronger character-based signal.

	Morpheme Tags	Full Word Tags
Finnish	43	172
Turkish	50	636

Table 10: Number of full word and morpheme tags.

7 Comparison to Finite-State Morphology

A morphological finite-state analyzer is customarily a hand-crafted tool that generates all the possible morphological readings with their associated features. We believe that, for many applications, high-quality finite-state morphological analysis is superior to Chipmunk. Finite-state morphological analyzers output a small set of linguistically valid analyses of a type, typically with only limited overgeneration. However, there are two limitations with finite-state morphological analyzers. The first is that significant effort is required to develop the transducers modeling the morphological grammar and creating and updating the lexicon is laborious. The second is that it is difficult to use finite-state analyzer to guess analyses involving roots not covered in the lexicon.¹⁶¹⁶16While one can in theory put in wildcard root states, this does not work in practice due to overgeneration. In fact, this is usually solved by viewing it as a different problem, morphological guessing, where linguistic knowledge similar to the features we have presented is used to try to guess POS and morphological analysis for types with no analysis in the finite-state analyzer.

In contrast, our training procedure learns a probabilistic transducer, which is a soft version of the type of hand-engineered grammar that is used in finite-state analyzers. The 1-best labeled morphological segmentation our model produces offers a simple and clean representation which could be of use in many downstream applications. Furthermore, our model unifies analysis and guessing into a single simple framework. Nevertheless, finite-state morphologies are still extremely useful, high-precision tools. A primary goal of future work will be to use Chipmunk to attempt to induce higher-quality morphological processing systems.

8 Conclusion and Future Work

We have presented labeled morphological segmentation in this paper, a new approach to morphological processing. LMS unifies three existing tasks in the literature: unlabeled morphological segmentation, stemming, and morphological tag classification. Our hierarchy of labeled morphological segmentation tagsets can be used to map the heterogeneous data in six languages we work with to universal representations of different granularities. We plan future creation of gold standard segmentations in more languages using our annotation scheme.

We further presented Chipmunk a semi-CRF-based model for LMS that allows for the integration of various linguistic features and consistently out-performs previously presented approaches to unlabeled morphological segmentation. An important extension of Chipmunk is embedding it in a context-sensitive POS tagger. Current state-of-the-art models only employ character level $n$ -gram features to model word-internal structure Müller et al. (2013). We have demonstrated that our structured approach outperforms this baseline. We leave this natural extension to future work.

Acknowledgments

We would like to thank Jason Eisner, Helmut Schmid, Özlem Çetinoğlu as well as the anonymous reviewers for their comments. This material is based upon work supported by a Fulbright fellowship awarded to the first author by the German–American Fulbright Commission and the National Science Foundation under Grant No. 1423276. The second author is a recipient of the Google Europe Fellowship in Natural Language Processing, and this research is supported by this Google Fellowship. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 644402 (HimL) and the DFG grant Models of Morphosyntax for Statistical Machine Translation.

Retrospective

This version, prepared in April 2024, is a lightly edited version of the original CoNLL 2015 paper. The Tables are now rendered with the booktabs package and a base case was added to the semi-Markov recursion (Eq. 2). Finally, a few typos and infelicities in the writing were cleaned up. All in all, 9 years later, the first author at least still finds the computational analysis of morphology a challenging and unsolved problem.

References

Abdul-Mageed et al. (2012) Muhammad Abdul-Mageed, Sandra Kuebler, and Mona Diab. 2012. SAMAR: A system for subjectivity and sentiment analysis of Arabic social media. In Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis, pages 19–28, Jeju, Korea. Association for Computational Linguistics.
Andrew (2006) Galen Andrew. 2006. A hybrid Markov/semi-Markov conditional random field for sequence segmentation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 465–472, Sydney, Australia. Association for Computational Linguistics.
Baayen et al. (1993) R. Harald Baayen, Richard Piepenbrock, and Leon Gulikers. 1993. The CELEX lexical database on CD-ROM. Technical report, Linguistic Data Consortium.
van den Bosch and Daelemans (1999) Antal van den Bosch and Walter Daelemans. 1999. Memory-based morphological analysis. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 285–292, College Park, Maryland, USA. Association for Computational Linguistics.
Buchholz and Marsi (2006) Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X), pages 149–164, New York City. Association for Computational Linguistics.
Chrupała et al. (2008) Grzegorz Chrupała, Georgiana Dinu, and Josef van Genabith. 2008. Learning morphology with Morfette. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association (ELRA).
Collins (2002) Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 1–8. Association for Computational Linguistics.
Creutz et al. (2007a) Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraçlar, and Andreas Stolcke. 2007a. Analysis of morph-based speech recognition and the modeling of out-of-vocabulary words across languages. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 380–387, Rochester, New York. Association for Computational Linguistics.
Creutz et al. (2007b) Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraclar, and Andreas Stolcke. 2007b. Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Transactions on Speech and Language Processing.
Creutz and Lagus (2002) Mathias Creutz and Krista Lagus. 2002. Unsupervised discovery of morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning, pages 21–30. Association for Computational Linguistics.
Creutz and Lagus (2004) Mathias Creutz and Krista Lagus. 2004. Induction of a simple morphology for highly-inflecting languages. In Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology, pages 43–51, Barcelona, Spain. Association for Computational Linguistics.
Dyer et al. (2008) Christopher Dyer, Smaranda Muresan, and Philip Resnik. 2008. Generalizing word lattice translation. In Proceedings of ACL-08: HLT, Columbus, Ohio. Association for Computational Linguistics.
Eyigöz et al. (2013) Elif Eyigöz, Daniel Gildea, and Kemal Oflazer. 2013. Simultaneous word-morpheme alignment for statistical machine translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 32–40, Atlanta, Georgia. Association for Computational Linguistics.
Goldsmith (2001) John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):153–198.
Grönroos et al. (2014) Stig-Arne Grönroos, Sami Virpioja, Peter Smit, and Mikko Kurimo. 2014. Morfessor FlatCat: An HMM-based method for unsupervised and semi-supervised learning of morphology. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1177–1185, Dublin, Ireland. Dublin City University and Association for Computational Linguistics.
Harris (1995) Zellig Harris. 1995. From phoneme to morpheme. Language.
Kohonen et al. (2010a) Oskar Kohonen, Sami Virpioja, and Krista Lagus. 2010a. Semi-supervised learning of concatenative morphology. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pages 78–86, Uppsala, Sweden. Association for Computational Linguistics.
Kohonen et al. (2010b) Oskar Kohonen, Sami Virpioja, Laura Leppänen, and Krista Lagus. 2010b. Semi-supervised extensions to Morfessor baseline. In Proceedings of the Morpho Challenge Workshop.
Krisnawati and Schulz (2013) Lucia D. Krisnawati and Klaus U. Schulz. 2013. Plagiarism detection for Indonesian texts. In Proceedings of iiWAS.
Kurimo et al. (2010) Mikko Kurimo, Sami Virpioja, Ville Turunen, and Krista Lagus. 2010. Morpho Challenge 2005–2010: Evaluations and results. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pages 87–95, Uppsala, Sweden. Association for Computational Linguistics.
Lafferty et al. (2001) John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, page 282–289.
Larasati et al. (2011) Septina Dian Larasati, Vladislav Kuboň, and Daniel Zeman. 2011. Indonesian morphology tool (MorphInd): Towards an Indonesian corpus. In Systems and Frameworks for Computational Morphology. Springer.
Liu and Nocedal (1989) Dong C. Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming.
Marsi et al. (2005) Erwin Marsi, Antal van den Bosch, and Abdelhadi Soudi. 2005. Memory-based morphological analysis generation and part-of-speech tagging of Arabic. In Proceedings of ACL Workshop: Computational Approaches to Semitic Languages.
Müller et al. (2013) Thomas Müller, Helmut Schmid, and Hinrich Schütze. 2013. Efficient higher-order CRFs for morphological tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA. Association for Computational Linguistics.
Murphy (2002) Kevin P. Murphy. 2002. Hidden semi-Markov models (HSMMs). Technical report, Massachusetts Institute of Technology.
Narasimhan et al. (2014) Karthik Narasimhan, Damianos Karakos, Richard Schwartz, Stavros Tsakalidis, and Regina Barzilay. 2014. Morphological segmentation for keyword spotting. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 880–885, Doha, Qatar. Association for Computational Linguistics.
Ng and Low (2004) Hwee Tou Ng and ** Kiat Low. 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once? word-based or character-based? In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 277–284, Barcelona, Spain. Association for Computational Linguistics.
Poon et al. (2009) Hoifung Poon, Colin Cherry, and Kristina Toutanova. 2009. Unsupervised morphological segmentation with log-linear models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 209–217, Boulder, Colorado. Association for Computational Linguistics.
Porter (1980) Martin F. Porter. 1980. An algorithm for suffix strip**. Program.
Rabiner (1989) Lawrence Rabiner. 1989. A tutorial on hidden markov models and selected applications in speech recognition. IEEE.
Ruokolainen et al. (2013) Teemu Ruokolainen, Oskar Kohonen, Sami Virpioja, and Mikko Kurimo. 2013. Supervised morphological segmentation in a low-resource learning setting using conditional random fields. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 29–37, Sofia, Bulgaria. Association for Computational Linguistics.
Ruokolainen et al. (2014) Teemu Ruokolainen, Oskar Kohonen, Sami Virpioja, and Mikko Kurimo. 2014. Painless semi-supervised morphological segmentation using conditional random fields. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, Gothenburg, Sweden. Association for Computational Linguistics.
Sarawagi and Cohen (2004) Sunita Sarawagi and William W. Cohen. 2004. Semi-Markov conditional random fields for information extraction. In Advances in Neural Information Processing Systems, volume 17. MIT Press.
Smith and Osborne (2006) Andrew Smith and Miles Osborne. 2006. Using gazetteers in discriminative information extraction. In Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X), pages 133–140, New York City. Association for Computational Linguistics.
Spiegler et al. (2010) Sebastian Spiegler, Andrew van der Spuy, and Peter A. Flach. 2010. Ukwabelana - an open-source morphological Zulu corpus. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 1020–1028, Bei**g, China. Coling 2010 Organizing Committee.
Virpioja et al. (2010) Sami Virpioja, Oskar Kohonen, and Krista Lagus. 2010. Unsupervised morpheme analysis with Allomorfessor. In Multilingual Information Access Evaluation. Springer.
Wainwright and Jordan (2008) Martin J. Wainwright and Michael I. Jordan. 2008. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning.
Zhang and Clark (2008) Yue Zhang and Stephen Clark. 2008. Joint word segmentation and POS tagging using a single perceptron. In Proceedings of ACL-08: HLT, pages 888–896, Columbus, Ohio. Association for Computational Linguistics.

Labeled Morphological Segmentation with Semi-Markov Models

Abstract

1 Introduction111The datasets created, an additional description of our tagsets and Chipmunk can be found at http://cistern.cis.lmu.de/chipmunk.

Paper Outline.

2 Labeled Segmentation and Tagset

3 Model

4 Features

Affix Features and Gazetteers.

Stem Features.

Integrating the Features.

5 Related Work

Memory-based Learning.

Unsupervised UMS.

Supervised UMS.

Chinese Word Segmentation.

6 Experiments

6.1 UMS Experiments

General Discussion.

The Role of Morphotactics.

Novel Roots and Affixes.

Boundaries.

6.2 Root Detection and Stemming

Discussion.

6.3 Morphological Tag Classification

Discussion.

7 Comparison to Finite-State Morphology

8 Conclusion and Future Work

Acknowledgments

Retrospective

References

1 Introduction¹¹1The datasets created, an additional description of our tagsets and Chipmunk can be found at http://cistern.cis.lmu.de/chipmunk.