Dealing with spelling variation in Swedish text in order to improve lemmatization, part-of-speech tagging and parsing.
Språkbanken <http://språkbanken.gu.se> uses an in-house large lexical resource cum morphological analyzer, plus an off-the-shelf part-of-speech tagger and dependency parser to annotate its online corpora. These tools expect standardized spellings in the texts to be analyzed (although the data-driven tools – the POS tagger and parser – will handle out-of-vocabulary items which are not recognized by the morphological analyzer).
Many of the texts in Språkbanken also sport non-standard spellings, either because they represent a pre-standardization language stage – medieval and 17th century texts – or because they are full of spelling errors and variants, which often is the case with modern blog texts. The problem consists in developing and implementing a (partial) solution for discovering and dealing with the spelling variation in modern texts (for which we already have sufficiently large-scale language analysis tools). Preferably the solution should be general and extensible to other text types. The work thus includes a good deal of linguistic analysis of lemmatizer, POS tagger and parser output.
Lars Borin and possibly others, Språkbanken