FST stands for “Finite-State Toolkit.” It is an enhanced version of the XFST tool described in the 2003 Beesley and Karttunen book “Finite State Morphology.” Like XFST, FST serves two purposes. It is a development tool for compiling finite-state networks and a runtime tool that applies networks to input strings or files. XFST is limited to morphological analysis and generation. FST can also be used for other applications. This article describes the new features of the FST regular expression formalism and illustrates their use for named-entity recognition, relation extraction, tokenization, and parsing. The FST pattern matching algorithm ('pmatch') operates on a single pattern network but the network can be a union of any number of distinct pattern definitions. Any number of patterns can be matched simultaneously in one pass over a text. This is a distinct FST advantage over pattern matching facilities in languages such as Perl and Python.
We build a large-scale, corpus-based lexical database that is representative of Modern Standard Arabic (MSA). We use a corpus of 1,089,111,204 words, a pre-annotation tool, machine learning techniques, and knowledge-based templatic matching to automatically acquire lexical knowledge about morpho-syntactic attributes and inflection paradigms of each entry. Our lexical database is scalable, interoperable and suitable for constructing a morphological analyser, regardless of the design approach and programming language used. The database is formatted according to the international ISO standard in lexical resource representation, Lexical Markup Framework (LMF). This lexical database is used in developing an open-source finite-state morphological processing toolkit. We build a system, AraComLex (Arabic Computer Lexicon), for managing and maintaining the standardized and scalable lexical database.
This article describes a user-oriented approach to evaluate and extensively document a morphological analyser with a view on normative descriptions of ISO and EAGLES. While current state-of-the-art work in this field usually describes task-based evaluation, our users (supposedly rather NLP non-experts, anonymously using the tool as part of a webservice) expect a full documentation of the tool itself, the testsuite that was used to validate it and the results of the validation process. ISO and EAGLES offer a good starting point when attempting to find the attributes that are to be evaluated. The documentation introduced in this article describes the analyser in a way comparable to others by defining its features as attribute-value pairs (encoded in DocBook XML). Furthermore, the evaluation itself and its results are described. All documentation is online and can be used as a template for similar tools (reference deleted).
This paper describes a robust Finite state morphology tool for Indonesian (MorphInd), which handles both morphological analysis and lemmatization for a given surface word form so that it is suitable for further language processing. MorphInd has wider coverage on handling Indonesian derivational and inflectional morphology compared to an existing Indonesian morphological analyzer, along with a more detailed tagset. MorphInd outputs the analysis in the form of segmented morphemes along with the morphological tags. The implementation was done using Finite state technology by adopting the two-level morphology approach implemented in FOMA. It acheived 84.6% of coverage on a preliminary stage Indonesian corpus where it mostly fails to capture the proper nouns and foreign words as expected initially.
HFST-Helsinki Finite-State Technology is a framework for compiling and applying linguistic descriptions with finite-state methods. HFST currently collects some of the most important finite-state tools for creating morphologies and spellers into one open-source platform and supports extending and improving the descriptions with weights to accommodate the modeling of statistical information. HFST offers a path from language descriptions to efficient language applications in key environments and operating systems. HFST also provides an opportunity to exchange transducers between different software providers in order to get the best out of each library.
In the IBM LMT machine translation system, derivational morphological rules recognize and analyze words that are not found in its source lexicons, and generate default transfers for these unlisted words. Unfound words with no inflectional or derivational affixes are by default nouns. These rules are now expanded to provide lexical coverage of a particular set of words created on the fly in emails by bilingual Spanish-English speakers. What characterizes the approach is the generation of additional default parts-of-speech, and the use of morphological, semantic, and syntactic features from both source and target lexicons for analysis and transfer. A built-in rule-based strategy to handle language borrowing and code-mixing allows for the recognition of words with variable and unpredictable frequency of occurrence, which would remain otherwise unfound, thus affecting the accuracy of parsing and the quality of translation output.
In this work we describe a statistical morphological tagger for Latvian, Lithuanian and Estonian languages based on morphological tag disambiguation. These languages have rich tagsets and very high rates of morphological ambiguity. We model distribution of possible tags with an exponential probabilistic model, which allows to select and use features from surrounding context. Results show significant improvement in error rates over baseline, agreeing with results for Czech. In comparison with the simplified parameter estimation method applied for Czech, we show that maximum entropy weight estimation achieves considerably better results
Non-canonical inflection (suppletion, deponency, heteroclisis…) is extensively studied in formal morphology. However, these studies often lack practical implementations associated with large-scale lexica. Yet these are precisely the requirements for objective comparative studies on the complexity of morphological descriptions. We introduce a model of inflectional morphology which can represent many non-canonical phenomena, as well as a formalisation and an implementation thereof. We illustrate it with data about French, Latin, Italian, Persian and Sorani Kurdish verbs and about noun classes from Croatian and Slovak. In particular, we evaluate the complexity of four competing descriptions of French verbal inflection using the information-theoretic concept of description length and show that the new concepts introduced in our model reduce the complexity of morphological modelisations w.r.t. traditional or more recent models
We present a morphology generation system that derives inflected Swiss German dialect forms from Standard German input. Besides generation of inflectional affixes, this system also deals with the phonetic adaptation of cognate stems, and with lexical substitution of non-cognate stems. Most of its rules are parametrized by probability maps extracted from dialectological atlases, thereby providing a large dialectal coverage.