Abstracts- Invited
Talk: Beyond Morphology – Pattern Matching with FST
Lauri Karttunen, Stanford UniversityFST stands for “Finite-State Toolkit.” It is an enhanced version of the XFST tool described in the 2003 Beesley and Karttunen book “Finite State Morphology.” Like XFST, FST serves two purposes. It is a development tool for compiling finite-state networks and a runtime tool that applies networks to input strings or files. XFST is limited to morphological analysis and generation. FST can also be used for other applications. This article describes the new features of the FST regular expression formalism and illustrates their use for named-entity recognition, relation extraction, tokenization, and parsing. The FST pattern matching algorithm ('pmatch') operates on a single pattern network but the network can be a union of any number of distinct pattern definitions. Any number of patterns can be matched simultaneously in one pass over a text. This is a distinct FST advantage over pattern matching facilities in languages such as Perl and Python. - A Lexical
Database for Modern Standard
Arabic
Interoperable with a Finite State Morphological Transducer
Mohammed Attia, Pavel Pecina, Antonio Toral,
Lamia Tounsi and
Josef Van Genabith (Ireland) We build a large-scale, corpus-based lexical database that is representative of Modern Standard Arabic (MSA). We use a corpus of 1,089,111,204 words, a pre-annotation tool, machine learning techniques, and knowledge-based templatic matching to automatically acquire lexical knowledge about morpho-syntactic attributes and inflection paradigms of each entry. Our lexical database is scalable, interoperable and suitable for constructing a morphological analyser, regardless of the design approach and programming language used. The database is formatted according to the international ISO standard in lexical resource representation, Lexical Markup Framework (LMF). This lexical database is used in developing an open-source finite-state morphological processing toolkit. We build a system, AraComLex (Arabic Computer Lexicon), for managing and maintaining the standardized and scalable lexical database. - A user-oriented approach to Evaluation and
Documentation of a Morphological Analyser
Gertrud Faaß
(Germany)This article describes a user-oriented approach to evaluate and
extensively document a morphological analyser with a view on normative
descriptions of ISO and EAGLES. While current state-of-the-art work in
this field usually describes task-based evaluation, our users
(supposedly rather NLP non-experts, anonymously using the tool as part
of a webservice) expect a full documentation of the tool itself, the
testsuite that was used to validate it and the results of the validation
process. ISO and EAGLES offer a good starting point when attempting to
find the attributes that are to be evaluated. The documentation
introduced in this article describes the analyser in a way comparable to
others by defining its features as attribute-value pairs (encoded in
DocBook XML). Furthermore, the evaluation itself and its results are
described. All documentation is online and can be used as a template for
similar tools (reference deleted). -
Indonesian Morphology Tool
(MorphInd): Towards an Indonesian
Corpus
Septina Dian Larasati, Daniel Zeman and
Vladislav Kubon (Czech Republic) This paper describes a robust Finite state morphology tool for
Indonesian (MorphInd), which handles both morphological analysis and
lemmatization for a given surface word form so that it is suitable for
further language processing. MorphInd has wider coverage on handling
Indonesian derivational and inflectional morphology compared to an
existing Indonesian morphological analyzer, along with a more
detailed tagset. MorphInd outputs the analysis in the form of segmented
morphemes along with the morphological tags. The implementation was done
using Finite state technology by adopting the two-level morphology
approach implemented in FOMA. It acheived 84.6% of coverage on a
preliminary stage Indonesian corpus where it mostly fails to capture the proper nouns and foreign words as expected initially. - HFST—Framework for Compiling and
Applying Morphologies
Krister Lindén, Erik
Axelson, Sam Hardwick, Tommi Pirinen and Miikka Silfverberg (Finland) HFST-Helsinki Finite-State Technology is a framework for compiling and
applying linguistic descriptions with finite-state methods. HFST
currently collects some of the most important finite-state tools for
creating morphologies and spellers into one open-source platform and
supports extending and improving the descriptions with weights to
accommodate the modeling of statistical information. HFST offers a path
from language descriptions to efficient language applications in key
environments and operating systems. HFST also provides an opportunity to
exchange transducers between different software providers in order to
get the best out of each library. - Morphology
to the Rescue
Redux: Resolving Borrowings and Code-mixing in Machine
Translation
Esme Manandise and Claudia Gdaniec
(United States, Germany) In the IBM LMT machine translation system, derivational morphological
rules recognize and analyze words that are not found in its source
lexicons, and generate default transfers for these unlisted words.
Unfound words with no inflectional or derivational affixes are by
default nouns. These rules are now expanded to provide lexical coverage
of a particular set of words created on the fly in emails by bilingual
Spanish-English speakers. What characterizes the approach is the
generation of additional default parts-of-speech, and the use of
morphological, semantic, and syntactic features from both source and
target lexicons for analysis and transfer. A built-in rule-based
strategy to handle language borrowing and code-mixing allows for the
recognition of words with variable and unpredictable frequency of
occurrence, which would remain otherwise unfound, thus affecting the
accuracy of parsing and the quality of translation output. - Maximum Entropy Model for Disambiguation
of Rich Morphological Tags
Mārcis Pinnis and
Kārlis Goba (Latvia)In this work we describe a statistical morphological tagger for Latvian,
Lithuanian and Estonian languages based on morphological tag
disambiguation. These languages have rich tagsets and very high rates of
morphological ambiguity. We model distribution of possible tags with an
exponential probabilistic model, which allows to select and use
features from surrounding context. Results show significant improvement
in error rates over baseline, agreeing with results for Czech. In
comparison with the simplified parameter estimation method applied for
Czech, we show that maximum entropy weight estimation achieves
considerably better results - Non-canonical inflection: data,
formalisation and complexity measures
Benoît
Sagot and Géraldine Walther (France)Non-canonical inflection (suppletion, deponency, heteroclisis…) is
extensively studied in formal morphology. However, these studies often
lack practical implementations associated with large-scale lexica. Yet
these are precisely the requirements for objective comparative studies
on the complexity of morphological descriptions. We introduce a model of
inflectional morphology which can represent many non-canonical
phenomena, as well as a formalisation and an implementation thereof. We
illustrate it with data about French, Latin, Italian, Persian and Sorani
Kurdish verbs and about noun classes from Croatian and Slovak. In
particular, we evaluate the complexity of four competing descriptions of
French verbal inflection using the information-theoretic concept of
description length and show that the new concepts introduced in our
model reduce the complexity of morphological modelisations w.r.t.
traditional or more recent models - Morphology generation for Swiss German
dialects
Yves Scherrer (Switzerland)We present a morphology generation system that derives inflected Swiss
German dialect forms from Standard German input. Besides generation of
inflectional affixes, this system also deals with the phonetic
adaptation of cognate stems, and with lexical substitution of
non-cognate stems. Most of its rules are parametrized by probability
maps extracted from dialectological atlases, thereby providing a large
dialectal coverage.
|