\onecolumn
\section

Tree Extraction \seclabelext

{algorithmic}

[1] \FunctionTreex𝑥xitalic_x,y𝑦yitalic_y \Stateis,ie,js,je\textlcs(x,y)subscript𝑖𝑠subscript𝑖𝑒subscript𝑗𝑠subscript𝑗𝑒\text𝑙𝑐𝑠𝑥𝑦i_{s},i_{e},j_{s},j_{e}\leftarrow\text{lcs}(x,y)italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ← italic_l italic_c italic_s ( italic_x , italic_y ) \Ifieis=0subscript𝑖𝑒subscript𝑖𝑠0i_{e}-i_{s}=0italic_i start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0 \State\ReturnSub(x𝑥xitalic_x,y𝑦yitalic_y) \Else\State\Return(Tree(x0is,y0jssuperscriptsubscript𝑥0subscript𝑖𝑠superscriptsubscript𝑦0subscript𝑗𝑠x_{0}^{i_{s}},y_{0}^{j_{s}}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT), issubscript𝑖𝑠i_{s}italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, Tree(xie|x|,yje|y|superscriptsubscript𝑥subscript𝑖𝑒𝑥superscriptsubscript𝑦subscript𝑗𝑒𝑦x_{i_{e}}^{|x|},y_{j_{e}}^{|y|}italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_x | end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_y | end_POSTSUPERSCRIPT), |x|ie𝑥subscript𝑖𝑒|x|-i_{e}| italic_x | - italic_i start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT) \EndIf\EndFunction Create a tree given a form-lemma pair x,y𝑥𝑦\langle x,y\rangle⟨ italic_x , italic_y ⟩. LCS returns the start and end indexes of the LCS in x𝑥xitalic_x and y𝑦yitalic_y. xisiesuperscriptsubscript𝑥subscript𝑖𝑠subscript𝑖𝑒x_{i_{s}}^{i_{e}}italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the substring of x𝑥xitalic_x starting at index issubscript𝑖𝑠i_{s}italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (inclusive) and ending at index iesubscript𝑖𝑒i_{e}italic_i start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT (exclusive). ieissubscript𝑖𝑒subscript𝑖𝑠i_{e}-i_{s}italic_i start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT thus equals the length of this substring. |x|𝑥|x|| italic_x | denotes the length of x𝑥xitalic_x. Note that the tree does not store the LCS, but only the length of the prefix and suffix. This way the tree for \wordumgeschaut can also be applied to transform \wordumgebaut \glossrenovated into \wordumbauen \glossto renovate. For the example \wordumgeschaut-\wordumschauen, the LCS is the stem \wordschau. The function then recursively transforms \wordumge into \wordum and \wordt into \worden. The prefix and suffix lengths of the form are 4444 and 1111 respectively. The left sub-node needs to transform \wordumge into \wordum. The new LCS is \wordum. The new prefix and suffix lengths are 00 and 2222 respectively. As the new prefix is empty the is nothing more to do. The suffix node needs to transform \wordge into the empty string ϵitalic-ϵ\epsilonitalic_ϵ. As the new LCS of the suffix is empty, because \wordge and ϵitalic-ϵ\epsilonitalic_ϵ have no character in common, the node is represented as a substitution node. The remaining transformation \wordt into \worden is also represented as a substitution, resulting in the tree in \figrefedit-tree:

\includegraphics

[width=0.40]figures/edit_tree \figlabeledit-tree

Figure \thefigure: Edit tree for the inflected form \wordumgeschaut \glosslooked around and its lemma \wordumschauen \glossto look around. The right tree is the actual edit tree we use in our model, the left tree visualizes what each node corresponds to. Note how the root node stores the length of the prefix \wordumge and the suffix \wordt.

1 Tree Application

\seclabel

app {algorithmic}[1] \FunctionApply\tree\tree\tree, x𝑥xitalic_x \If\tree\tree\tree is a LCS node \State\tree\treei,il,\treej,jl\treesubscript\tree𝑖subscript𝑖𝑙subscript\tree𝑗subscript𝑗𝑙\tree\rightarrow\tree_{i},i_{l},\tree_{j},j_{l}→ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT \If|x|<il+jl𝑥subscript𝑖𝑙subscript𝑗𝑙|x|<i_{l}+j_{l}| italic_x | < italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_j start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT \CommentPrefix and Suffix do not fit. \State\Returnbottom\bot \EndIf\Statep=𝑝absentp=italic_p =Apply(\treei,x0il)subscript\tree𝑖superscriptsubscript𝑥0subscript𝑖𝑙(\tree_{i},x_{0}^{i_{l}})( start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) \CommentCreate prefix. \Ifp𝑝pitalic_p is bottom\bot \CommentPrefix tree cannot be applied. \State\Returnbottom\bot \EndIf\States=𝑠absents=italic_s =Apply(\treej,x|x|jl|x|)subscript\tree𝑗superscriptsubscript𝑥𝑥subscript𝑗𝑙𝑥(\tree_{j},x_{|x|-j_{l}}^{|x|})( start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT | italic_x | - italic_j start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_x | end_POSTSUPERSCRIPT ) \CommentCreate suffix. \Ifs𝑠sitalic_s is bottom\bot \CommentSuffix tree cannot be applied. \State\Returnbottom\bot \EndIf\State\Returnp + xil|x|jlsuperscriptsubscript𝑥subscript𝑖𝑙𝑥subscript𝑗𝑙x_{i_{l}}^{|x|-j_{l}}italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_x | - italic_j start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + s \CommentConcatenate prefix, LCS and suffix. \Else\Comment\tree\tree\tree is a Sub node \State\treeu,v\tree𝑢𝑣\tree\rightarrow u,v→ italic_u , italic_v \Ifx=u𝑥𝑢x=uitalic_x = italic_u \CommentIf x𝑥xitalic_x and u𝑢uitalic_u match return v𝑣vitalic_v \State\Returnv \EndIf\State\Returnbottom\bot \Comment\tree\tree\tree cannot be applied. \EndIf\EndFunction In the code + represents string concatenation and bottom\bot a null string, meaning that the tree cannot be applied to the form. We first run the tree depicted in \figrefedit-tree on the form \wordangebaut \glossattached (to a building). The first node is a LCS node specifying that prefix and suffix should have length 4444 and 1111, respectively. We thus recursively apply the left child node to the prefix string \wordange. This is done by matching the length two prefix \wordan and deleting \wordge yielding the intermediate result \wordanbaut. We continue on the right side of the tree and replace \wordt with \worden. This yield the final (correct) result \wordanbauen. The application of the tree to the form \wordeinbauen \glossinstalled would fail, as we would try to substitute a \wordge in \wordeing.

2 Development Results

\seclabel

dev

cs de en es hu la
\multirow15*{sideways}Baselines \swsimple tag 86.08 69.83 79.05 56.02 94.72 87.95 96.29 87.21 94.37 85.20 83.86 57.77
\swmorfette lemma 87.22 32.95 93.27 62.15 97.60 75.64 92.92 59.68 86.09 42.02 85.19 14.06
joint 77.18 22.56 75.60 36.43 93.15 67.58 90.28 54.00 82.99 37.03 75.00 5.39
\swsimple tag \ul89.82 \ul76.83 \ul85.05 \ul65.22 \ul\tb95.71 \ul\tb90.29 \ul96.83 \ul89.00 \ul\tb95.46 \ul88.17 \ul86.35 \ul65.21
\marmot lemma 87.36 32.95 93.28 62.15 97.66 75.64 92.99 59.68 86.11 42.02 85.35 14.06
joint 79.99 23.98 80.71 40.26 94.09 69.48 90.75 54.66 83.47 37.16 76.58 6.61
JCK tag 86.08 69.83 79.05 56.02 94.72 87.95 96.29 87.21 94.37 85.20 83.86 57.77
\swmorfette lemma 96.24 82.59 97.67 88.80 98.71 92.50 97.61 86.76 97.48 91.16 \ul93.26 \ul63.09
joint 84.32 61.67 78.10 51.42 94.43 84.61 94.67 78.39 93.15 80.73 81.34 43.71
JCK tag \ul89.82 \ul76.83 \ul85.05 \ul65.22 \ul\tb95.71 \ul\tb90.29 \ul96.83 \ul89.00 \ul\tb95.46 \ul88.17 \ul86.35 \ul65.21
\marmot lemma \ul96.26 \ul82.88 \ul97.74 89.11 \ul98.81 \ul93.06 97.78 87.10 \ul97.68 \ul91.91 93.06 62.64
joint \ul88.17 \ul68.43 \ul84.16 \ul60.39 \ul95.41 \ul86.95 \ul95.37 80.25 \ul94.33 \ul83.70 \ul83.57 \ul48.84
\swmorfette tag 86.08 69.83 79.05 56.02 94.72 87.95 96.29 87.21 94.37 85.20 83.86 57.77
\swmorfette lemma 96.25 82.54 97.12 \ul89.90 98.43 92.54 \ul97.97 \ul89.94 97.22 90.04 91.89 55.13
joint 84.39 61.94 77.60 52.87 94.16 84.70 95.09 \ul81.95 93.11 80.60 80.09 36.39
\multirow27*{sideways}\softwarelemming edit tree tag 86.08 69.83 79.05 56.02 94.72 87.95 96.29 87.21 94.37 85.20 83.86 57.77
\swmorfette lemma 96.29 82.93 97.84 89.78 98.71 92.63 97.91 89.00 97.31 90.49 93.00 61.68
joint 84.45 62.36 78.32 52.75 94.43 84.79 94.97 80.68 93.08 80.48 80.92 41.27
align tag 86.08 69.83 79.05 56.02 94.72 87.95 96.29 87.21 94.37 85.20 83.86 57.77
\swmorfette lemma 96.74 85.38 98.17 91.61 98.76 93.11 98.05 90.13 97.70 92.15 93.76 66.30
joint 84.72 63.88 78.47 53.52 94.47 85.09 95.05 81.29 93.31 81.51 81.49 44.67
dict tag 86.08 69.83 79.05 56.02 94.72 87.95 96.29 87.21 94.37 85.20 83.86 57.77
\swmorfette lemma 97.50 89.38 98.36 92.66 98.84 94.02 98.39 92.56 97.98 93.30 94.64 71.57
joint 85.06 65.60 78.53 53.91 94.53 85.83 95.27 82.89 93.46 82.10 81.87 46.92
morph tag 86.08 69.83 79.05 56.02 \NA \NA 96.29 87.21 94.37 85.20 83.86 57.77
\swmorfette lemma 96.59 87.28 97.43 89.96 \NA \NA 98.46 92.98 97.77 92.62 93.60 66.30
joint 85.64 67.73 78.87 55.17 \NA \NA 95.63 84.69 94.10 84.02 82.29 48.72
edit tree tag 89.82 76.83 85.05 65.22 \tb95.71 \tb90.29 96.83 89.00 \tb95.46 88.17 86.35 65.21
\marmot lemma 96.33 83.04 97.93 90.05 98.82 93.02 98.14 89.61 97.53 91.22 92.85 60.98
joint 88.23 68.85 84.41 61.89 95.40 87.00 95.75 83.06 94.30 83.58 83.20 46.73
align tag 89.82 76.83 85.05 65.22 \tb95.71 \tb90.29 96.83 89.00 \tb95.46 88.17 86.35 65.21
\marmot lemma 96.80 85.72 98.24 91.94 98.88 93.71 98.27 90.75 97.91 92.91 93.50 64.89
joint 88.57 70.69 84.59 62.95 95.45 87.47 95.81 83.64 94.52 84.60 83.83 50.32
dict tag 89.82 76.83 85.05 65.22 \tb95.71 \tb90.29 96.83 89.00 \tb95.46 88.17 86.35 65.21
\marmot lemma 97.62 89.88 98.43 92.95 98.94 \tb94.28 98.62 93.23 98.23 94.16 94.49 70.60
joint 88.97 72.79 84.63 63.16 \tb95.50 \tb88.08 96.02 85.24 94.73 85.46 84.41 53.79
morph tag 89.82 76.83 85.05 65.22 \NA \NA 96.83 89.00 \tb95.46 88.17 86.35 65.21
\marmot lemma 97.38 89.55 98.24 92.42 \NA \NA \tb98.71 93.97 98.30 94.38 94.39 70.54
joint 89.30 74.32 84.81 64.08 \NA \NA 96.17 86.29 95.15 86.87 84.62 55.07
dict tag \tb90.47 78.37 85.31 65.94 95.66 88.60 96.94 89.25 95.29 87.55 \tb86.77 65.98
joint lemma 98.34 92.96 98.66 94.07 \tb99.00 94.23 98.68 94.08 98.53 95.46 \tb96.68 \tb82.99
joint 89.82 75.47 84.94 64.19 95.48 86.52 96.11 85.43 94.71 85.40 85.85 60.98
morph tag 90.35 \tb79.61 \tb85.52 \tb67.69 95.66 88.60 \tb96.95 \tb89.55 \tb95.46 \tb88.85 86.70 \tb66.82
joint lemma \tb98.55 \tb94.12 \tb98.74 \tb94.47 \tb99.00 94.23 98.68 \tb94.53 \tb98.57 \tb95.65 96.51 82.22
joint \tb90.05 \tb78.37 \tb85.28 \tb66.58 95.48 86.52 \tb96.20 \tb86.51 \tb95.30 \tb88.22 \tb85.97 \tb62.97
Table A1: Development accuracies for the baselines and the different pipeline versions and a joint version of \softwarelemming. The numbers in each cell are general token accuracy and the token accuracy on unknown forms. Each cell specifies either a baseline {\in\{∈ {baseline, JCK, \swMorfette}}\}} or a \swLemming feature set {\in\{∈ {edit tree, align, dict, morph}}\}} and a tagger {\in\{∈ {\swMorfette, \marmot, joint}}\}}.

3 Test Results

\seclabel

tst

cs de en es hu la
\multirow15*{sideways}Baselines \swsimple tag 86.00 68.88 76.80 52.86 95.35 87.58 96.30 87.65 92.33 81.32 79.70 47.06
\swmorfette lemma 86.81 31.16 91.42 58.68 97.88 78.95 92.86 59.34 84.67 38.95 83.60 20.43
joint 76.96 21.22 72.83 33.27 93.82 68.98 90.14 53.45 80.41 33.08 71.62 7.52
\swsimple tag \ul89.75 \ul76.83 \ul82.81 \ul61.60 \ul\tb96.45 \ul\tb90.68 \ul97.05 \ul90.07 \ul93.64 \ul84.65 \ul82.37 \ul53.73
\marmot lemma 86.94 31.16 91.48 58.68 97.92 78.95 92.92 59.34 84.73 38.95 83.80 20.43
joint 79.66 22.86 77.78 36.49 94.87 71.62 90.81 54.98 81.06 33.27 73.51 8.58
JCK tag 86.00 68.88 76.80 52.86 95.35 87.58 96.30 87.65 92.33 81.32 79.70 47.06
\swmorfette lemma 95.87 80.79 96.50 85.54 98.95 93.76 97.58 86.80 96.52 88.09 \ul90.84 58.13
joint 84.00 59.46 75.60 47.77 95.09 84.34 94.54 77.97 90.71 75.62 76.76 33.56
JCK tag \ul89.75 \ul76.83 \ul82.81 \ul61.60 \ul\tb96.45 \ul\tb90.68 \ul97.05 \ul90.07 \ul93.64 \ul84.65 \ul82.37 \ul53.73
\marmot lemma \ul95.95 \ul81.28 \ul96.63 85.84 \ul99.08 \ul94.28 97.69 87.19 \ul96.69 \ul88.66 90.79 \ul58.23
joint \ul87.85 \ul67.00 \ul81.60 \ul55.97 \ul96.17 \ul87.32 \ul95.44 80.62 \ul92.15 \ul78.89 \ul79.51 \ul39.07
\swmorfette tag 86.00 68.88 76.80 52.86 95.35 87.58 96.30 87.65 92.33 81.32 79.70 47.06
\swmorfette lemma 95.88 80.70 95.88 \ul87.55 98.73 93.56 \ul98.02 \ul90.16 96.06 86.21 90.08 54.16
joint 84.10 59.92 74.87 49.22 94.89 84.44 95.03 \ul81.68 90.60 75.11 76.57 32.50
\multirow27*{sideways}\softwarelemming edit tree tag 86.00 68.88 76.80 52.86 95.35 87.58 96.30 87.65 92.33 81.32 79.70 47.06
\swmorfette lemma 95.91 81.07 96.82 87.31 98.95 93.69 97.87 88.97 96.15 86.68 91.20 59.87
joint 84.15 60.27 75.86 49.22 95.08 84.37 94.86 80.38 90.50 74.86 77.00 34.73
align tag 86.00 68.88 76.80 52.86 95.35 87.58 96.30 87.65 92.33 81.32 79.70 47.06
\swmorfette lemma 96.43 83.83 97.22 89.25 98.99 94.15 98.04 90.28 96.87 89.72 91.64 62.36
joint 84.47 61.97 76.05 50.07 95.11 84.73 94.98 81.30 90.94 76.76 77.18 35.79
dict tag 86.00 68.88 76.80 52.86 95.35 87.58 96.30 87.65 92.33 81.32 79.70 47.06
\swmorfette lemma 97.24 88.13 97.58 91.03 99.07 95.06 98.32 92.36 97.28 91.32 92.97 69.14
joint 84.88 64.14 76.13 50.48 95.19 85.55 95.19 82.91 91.14 77.48 77.52 37.59
morph tag 86.00 68.88 76.80 52.86 \NA \NA 96.30 87.65 92.33 81.32 79.70 47.06
\swmorfette lemma 96.52 86.69 96.57 88.26 \NA \NA 98.53 93.73 96.70 89.30 91.53 62.31
joint 85.57 66.73 76.56 51.90 \NA \NA 95.63 85.02 91.92 79.73 78.18 39.65
edit tree tag 89.75 76.83 82.81 61.60 \tb96.45 \tb90.68 97.05 90.07 93.64 84.65 82.37 53.73
\marmot lemma 96.05 81.68 96.99 87.69 99.08 94.25 98.00 89.56 96.38 87.54 91.37 60.88
joint 88.03 68.04 81.96 57.91 96.17 87.38 95.80 83.42 91.93 78.15 79.75 40.34
align tag 89.75 76.83 82.81 61.60 \tb96.45 \tb90.68 97.05 90.07 93.64 84.65 82.37 53.73
\marmot lemma 96.56 84.49 97.38 89.68 99.12 94.51 98.17 90.89 97.11 90.61 91.80 63.26
joint 88.41 70.03 82.20 59.02 96.19 87.61 95.94 84.38 92.41 80.14 80.03 41.87
dict tag 89.75 76.83 82.81 61.60 \tb96.45 \tb90.68 97.05 90.07 93.64 84.65 82.37 53.73
\marmot lemma 97.46 89.14 97.70 91.27 \tb99.21 \tb95.59 98.48 92.98 97.53 92.10 93.07 69.83
joint 88.86 72.51 82.27 59.42 \tb96.27 \tb88.49 96.12 85.80 92.59 80.77 80.49 44.26
morph tag 89.75 76.83 82.81 61.60 \NA \NA 97.05 90.07 93.64 84.65 82.37 53.73
\marmot lemma 97.29 88.98 97.51 90.85 \NA \NA 98.68 94.32 97.53 92.15 92.54 67.81
joint 89.23 74.24 82.49 60.42 \NA \NA 96.35 87.25 93.11 82.56 80.67 45.21
dict tag \tb90.34 78.47 \tb83.10 62.36 96.32 89.70 97.11 90.13 93.64 84.78 82.89 54.69
joint lemma 98.27 92.67 \tb98.10 92.79 \tb99.21 95.23 98.67 94.07 98.02 94.15 \tb95.58 \tb81.74
joint 89.69 75.44 82.64 60.49 96.17 87.87 96.23 86.19 92.84 81.89 81.92 49.97
morph tag 90.20 \tb79.72 \tb83.10 \tb63.10 96.32 89.70 \tb97.17 \tb90.66 \tb93.67 \tb85.12 \tb83.49 \tb58.76
joint lemma \tb98.42 \tb93.46 \tb98.10 \tb93.02 \tb99.21 95.23 \tb98.78 \tb94.86 \tb98.08 \tb94.26 95.36 80.94
joint \tb89.90 \tb78.34 \tb82.84 \tb62.10 96.17 87.87 \tb96.41 \tb87.47 \tb93.40 \tb84.15 \tb82.57 \tb54.63
Table A2: Test accuracies for the baselines and the different pipeline versions and a joint version of \softwarelemming. The numbers in each cell are general token accuracy and the token accuracy on unknown forms. Each cell specifies either a baseline {\in\{∈ {baseline, JCK, \swMorfette}}\}} or a \swLemming feature set {\in\{∈ {edit tree, align, dict, morph}}\}} and a tagger {\in\{∈ {\swMorfette, \marmot, joint}}\}}.
cs de es hu la
+dict 98.35 99.04 98.92 98.83 96.14
+morph \tb99.03 \tb99.47 \tb99.09 \tb99.41 \tb96.73
Table A3: Development accuracies for \swLemming with and without morphological attributes using gold tags.
\tablabel

gold

train dev test
sent token pos morph sent token form unk lemma unk sent token form unk lemma unk
cs 5979 100012 12 266 5228 87988 19.86 9.79 4213 70348 19.89 9.73
de 5662 100009 51 204 5000 76704 17.11 13.53 5000 92004 19.51 15.64
en 4028 100012 46 1336 32092 8.06 6.16 1640 39590 8.52 6.56
es 3431 100027 12 226 1655 50368 13.37 8.85 1725 50630 13.41 9.16
hu 4390 100014 22 572 1051 29989 23.51 14.36 1009 19908 24.80 14.64
la 7122 59992 23 474 890 9475 17.59 6.43 891 9922 19.82 7.56
Table A4: Dataset statistics. Showing number of sentences (sent), tokens (token), POS tags (pos), morphological tags (morph) and token-based unknown form (form unk) and lemma (lemma unk) rates.