Corpora enriched with linguistic annotation like parts of speech, syntactic constituents, or syntactic dependencies provide access to linguistic examples and linguistic patterns in a way that is not available in raw text corpora (Kübler & Zinsmeister 2014). Manual annotation is very time-consuming, hence, many projects make use of automatic annotation tools for enriching their data. However, the question is how reliable this annotation is. In this presentation, I’ll discuss the pros and cons of automatic annotation on the basis of statistical part-of-speech taggers and briefly sketch the annotation of further linguistic levels (syntax, semantics, and discourse). In addition, I’ll address the question of descriptive adequacy, this is, how well a tagset captures the phenomena in the actual data. I’ll be doing this on the basis of applying the German Stuttgart-Tübingen Tagset (STTS, Schiller et al. 1999) to different varieties of German including texts of second language learners.
References
Kübler, Sandra and Heike Zinsmeister. 2014. Corpus Linguistics and Linguistically Annotated Corpora. London: Bloomsbury.
Schiller, Anne, Simone Teufel, Christine Stöckert, and Christine Thielen. 1999. Guidelines für das Tagging deutscher Textcorpora mit STTS. Technical Report, Universities of Stuttgart and Tübingen.