Zinsmeister, Heike: Doing corpus linguistics with linguistically annotated corpora

Corpora enriched with linguistic annotation like parts of speech, syntactic constituents, or syntactic dependencies provide access to linguistic examples and linguistic patterns in a way that is not available in raw text corpora (Kübler & Zinsmeister 2014). Manual annotation is very time-consuming, hence, many projects make use of automatic annotation tools for enriching their data. However, the question is how reliable this annotation is. In this presentation, I’ll discuss the pros and cons of automatic annotation on the basis of statistical part-of-speech taggers and briefly sketch the annotation of further linguistic levels (syntax, semantics, and discourse). In addition, I’ll address the question of descriptive adequacy, this is, how well a tagset captures the phenomena in the actual data. I’ll be doing this on the basis of applying the German Stuttgart-Tübingen Tagset (STTS, Schiller et al. 1999) to different varieties of German including texts of second language learners.

References

Kübler, Sandra and Heike Zinsmeister. 2014. Corpus Linguistics and Linguistically Annotated Corpora. London: Bloomsbury.

Schiller, Anne,  Simone Teufel, Christine Stöckert, and Christine Thielen. 1999. Guidelines für das Tagging deutscher Textcorpora mit STTS. Technical Report, Universities of Stuttgart and Tübingen.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s