Exploring data from language documentation

Dates: 10.05.2013 - 11.05.2013
Venue: ZAS Berlin
Languages: English, German

Description: Language documentation has produced a large amount of extensive spoken language corpora. These corpora consist of time-aligned and annotated audio and video recordings of endangered and often lesser known languages. The typological diversity and the variety of these data pose new and interesting technological and methodological challenges. Moreover, in the last ten years, a considerable infrastructure has been developed to create and archive larger corpora of time-aligned and annotated primary data. This infrastructure involves digital archives such as the TLA at the MPI in Nijmegen and tools such as ELAN, Toolbox, FLEX, praat and Transcriber.

But to unlock the full potential of spoken language corpora, researchers often face unique challenges: Depending on the properties of the documented language, the primary research questions, and the nature of the workflow, the tools listed above might not fully correspond to the researchers' needs. Also, in studies working with data from different documentation projects, it may be difficult to integrate a variety of formats and standards. This workshop, which is funded by the CLARIN-D project (F-AG3), invites experts from language documentation and linguistic typology as well as language technology and corpus linguistics to present and discuss problems and solutions posed by the analysis of typologically diverse spoken language corpora as well as relevant practices and technologies of related fields.


  • Felix Rau
  • Kilu von Prince