FLANDERS, Julia (Northeastern University, United States of America); SOERING, Sibylle (Georg-August-Universität Göttingen); KOLLATZ, Thomas (Steinheim­ Institute for German­ Jewish History); ROMARY, Laurent (INRIA)

Keywords: TAPAS, TextGrid, DARIAH, infrastructure, repository

  • Date: 2015-10-29
  • Time: 09:00 – 10:30
  • Room: Amphi Laprade

In recent years the TEI community has been making steady progress in developing common infrastructure to support essential tasks such as validation, document conversion, editing, data curation, and publication. While this infrastructure is designed primarily to meet the needs of individual TEI projects and researchers, it also serves an important role for the TEI community itself, in two ways: first, by supporting the aggregation and study of TEI data more broadly; and second, by creating additional foci of community activity and interaction. The challenge lies in designing these tools so that they balance these important goals. We propose a panel session focused on a discussion of the needs of individual researchers and of the TEI research community, including representatives of three major infrastructural projects: the TEI Archiving, Publishing, and Access Service (TAPAS), TextGrid – Virtual Research Environment for the Humanities, and DARIAH – Digital Infrastructure for the Arts and Humanites. Panelists will briefly describe the recent work of their organizations, which will serve as background for a more general discussion (involving both panelists and the audience) on the following questions: How could these infrastructural projects collaborate in ways that would benefit the TEI community? How can these organizations stay abreast of changing user needs, and what TEI community needs are not yet being met by existing infrastructure? What tools could be added to make this infrastructure more valuable to users – both discipline-specific and from a more generic perspective? What value would an international TEI repository have for the TEI community, and how might it be used? What are the technical and social challenges involved in developing this kind of infrastructure, and what kinds of long-term support are most appropriate and effective?

Thomas Kollatz is a researcher at the Steinheim­ Institute for German­ Jewish History, Essen/Germany. He is/was involved in several DH projects including TextGrid and DARIAH- DE and is also responsible for “epidat,” the epigraphical database of Jewish sepulchral inscriptions.

Julia Flanders is the co-director of TAPAS and the director of the Digital Scholarship Group at Northeastern University.

Laurent Romary is Directeur de Recherche INRIA, France and guest scientist at Humboldt University in Berlin, Germany. He currently contributes to the establishment and coordination of the European Dariah infrastructure.

Sibylle Söring is part of the academic management and coordination of “TextGrid – Virtual Research Environment for the Humanities” and “DARIAH-DE – Digital Research Infrastructure for the Arts and Humanities” at Göttingen State and University Library, Germany. With a degree in literature studies, she is/was involved in various DH projects such as TEI based digital editions.


TAPAS, TextGrid, DARIAH, infrastructure, repository

BEIßWENGER, Michael (TU Dortmund University, Germany); CHANIER, Thierry (Université Blaise Pascal (Clermont 2), France)

Keywords: social media, computer-mediated communication, cmc, digital genres

  • Date: 2015-10-29
  • Time: 11:00 – 12:30
  • Room: Amphi Fugier


The panel presents results and ongoing work from corpus projects in which TEI-P5 has been adopted for the representation and linguistic annotation of genres of social media and computer-mediated communication (CMC). It relates to the work of the TEI-SIG “computer-mediated communication” which is developing TEI models for the representation of CMC genres and testing these models for a broad range of genres (ranging from “text-only” genres such as chat and SMS to multimodal genres such as learning environments and Second Life) and in corpus building initiatives for various European languages.

The goal of the panel is to give an overview of models and practices in representing CMC in TEI on the example of German and French CMC corpora. A documentation and ODD files of the schemas developed by the group will be made available in the TEI wiki and be announced via the TEI mailing list before the conference so that everybody who is interested participating in the discussion can examine the CMC models in advance.

The discussion in the panel shall serve as an opportunity for collecting feedback on these models and schema drafts from a broader community within the TEI who is interested in adapting TEI-P5 for the representation of new (digital) genres. This feedback will be taken into consideration when revising the models and – as a next step after the conference – preparing feature requests for adapting the TEI for CMC.

Panel slot #1: PAPER: TITLE: TEI across corpora, languages and genres: How TEI models will enhance the toolkit of CMC research in the Humanities (20 minutes)

AUTHORS: Michael Beißwenger, Thierry Chanier

The internet and social media have given rise to a broad range of new communicative genres which are subsumed under the term computer-mediated communication (cmc) – genres such as chats, forums, text messaging (SMS, WhatsApp), interaction on wiki talk pages and in blog comments, via Twitter, on social network sites, and in multimodal 3D environments. A TEI standard for the representation of those genres and their structural and linguistic peculiarities is a desideratum both in the fields of digital humanities and computer sciences. Such a standard would foster interoperability between language resources as well as the analysis and automatic exploitation of resources of that kind in several respect:

  • It would allow scholars for building interoperable CMC corpora for different languages and thus enhance the empirical basis for doing CMC research across languages and cultures.
  • It would allow scholars for bulding CMC resources which are interoperable with text and speech corpora that are already represented in TEI and thus pave the way for corpus-based research on language use across different types of corpora (= comparative analysis of the language use in CMC, in edited text and in spoken language).
  • Through including models for the description of not only verbal but also of non-verbal acts, it would allow scholars to describe and analyse CMC accross different modalities.

The paper describes the rationale for why a future version of the TEI guidelines should include models for CMC. It gives an outline of requirements which a framework for the representation of CMC should meet in order to allow corpus providers and researchers to make full use of the abovementioned potentials. It presents an overview of challenges and general issues in designing such a representation framework and thus pre-structures the presentation of models and practices that will be presented in paper 2 as well as the following discussion.

Panel slot #2: PAPER: TITLE: Modeling social media and CMC genres in TEI: Models and practices from French and German corpus projects (40 minutes)

AUTHORS: Michael Beißwenger, Thierry Chanier, Eric Ehrhardt, Alexander Geyken, Axel Herold, Marc Kupietz, Lothar Lemnitzer, Harald Lüngen, Céline Poudat, Angelika Storrer, Andreas

The second paper discusses how the requirements and challenges outlined in paper 1 have been handled in customized TEI schemas that have been developed for the representation of CMC and social media genres in French and German corpus projects in 2011–2015. The schemas developed in these projects are not independent from each other but relate to each other: Fostered by discussions in the TEI-SIG “computer-mediated communication”, in the German DFG network Empirikom and in the French corpus network CoMeRe, the projects recursively have been building on each other’s work with the goal of creating a schema that fits for diverse projects in several languages:

  • (1) A first TEI schema for CMC (Beißwenger et al. 2012) has been developed as part of the exploratory work for a reference corpus of German CMC as part of the DWDS corpus collection at the BBAW Berlin (DeRiK, Beißwenger et al. 2013).
  • (2) Margaretha & Lüngen (2014) adopted the basic models introduced in (1) and tested their suitability for the annotation of a corpus of Wikipedia talk pages as part of the DEREKO corpus collection at IDS Mannheim.
  • (3) Building on the results of two meetings of the TEI-CMC-SIG in Rome (2013) and Dortmund (2014) and on requirements from corpora collected in the French CoMeRe network, Chanier et al. (2014) developed a TEI schema which significantly expanded the models suggested in (1) for a corpus collection with highly heterogeneous genres (covering a broad range from text.-only to multimodal genres). This schema has been used for the representation of the CoMeRe repository of French CMC corpora (access to corpora via CoMeRe, 2015).
  • (4) The schema developed for the CoMeRe corpora (3) as well as the experiences from (2) are presently used as the starting point for defining a schema for the use in several German CMC corpus projects: the CLARIN-D curation project ChatCorpus2CLARIN in which the Dortmund Chat Corpus ( is being re-modeled in TEI; a corpus of German Usenet postings which is presently being collected for integration in DEREKO (Schröck, in prep.); a corpus of German WhatsApp messages that has been collected in 2014/15 (Project “Whats Up, Deutschland”, initiated and coordinated by Beat Siebenhaar, University of Leipzig). The schemas used for the Wikipedia corpus in DEREKO and in the DeRiK project (see above) shall subsequently be adapted to this new schema version.

The schema versions (3) and (4) will be documented in the TEI wiki before the conference.

Panel slot #3: DISCUSSION: TITLE: Towards a basic schema for the representation of CMC in TEI (30 minutes)

With respect to the goals outlined in the introduction of this proposal, the panel includes a 30-minute space for discussions instead of a third paper. The discussion shall be introduced by short statements of two invited discussants who bring in the perspective of modeling related genres and text types in TEI and of “experts” in the process of discussing and implementing new features into the TEI guidelines:

  • Peter Stadler, Carl-Maria-von-Weber-Gesamtausgabe, Detmold (member of the TEI Technical Council / TEI-SIG “Correspondence”)
  • Thomas Schmidt, Institut für deutsche Sprache, Mannheim (representation of spoken language corpora / transcribed speech)

The statements by the discussants will be followed by a moderated open discussion with the plenary (which may continue at the meeting of the SIG “computer-mediated communication”).

A documentation and ODD files of the schemas presented in paper 2 will be made available in the TEI wiki and be announced via the TEI mailing list before the conference in order to allow the discussants and other participants to examine the CMC models in advance.

  • Beißwenger, Michael; Ermakova, Maria; Geyken, Alexander; Lemnitzer, Lothar; Storrer, Angelika (2012): A TEI Schema for the Representation of Computer-mediated Communication. Journal of the Text Encoding Initiative (jTEI) 3. (DOI: 10.4000/jtei.476).
  • Beißwenger, Michael; Ermakova, Maria; Geyken, Alexander; Lemnitzer, Lothar; Storrer, Angelika (2013): DeRiK: A German Reference Corpus of Computer-Mediated Communication. In: Literary and Linguistic Computing (LLC).
  • Chanier, Thierry; Poudat, Celine; Sagot, Benoit; Antoniadis, Georges; Wigham, Ciara; Hriba, Linda; Longhi, Julien; Seddah, Djamé (2014): The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres. In: Beißwenger, Michael; Oostdijk, Nelleke; Storrer, Angelika; van den Heuvel, Henk (Eds.): Building and Annotating Corpora of Computer-Mediated Communication: Issues and Challenges at the Interface of Corpus and Computational Linguistics. Special Issue, Journal of Language Technology and Computational Linguistics (JLCL 2/2014), 1–30.
  • CoMeRe (2015). CoMeRe Repository: Corpora of Computer-Mediated Communication in French. Ortolang : Nancy.
  • Margaretha, Eliza; Lüngen, Harald (2014): Building Linguistic Corpora from Wikipedia Articles and Discussions. In: Beißwenger, Michael; Oostdijk, Nelleke; Storrer, Angelika; van den Heuvel, Henk (Eds.): Building and Annotating Corpora of Computer-Mediated Communication: Issues and Challenges at the Interface of Corpus and Computational Linguistics. Special Issue, Journal of Language Technology and Computational Linguistics (JLCL 2/2014), 59–82.
  • Schröck, Jasmin (in prep.): Erstellung eines deutschsprachigen Usenet-Newsgroup-Korpus und Annotation von Phänomenen internetbasierter Kommunikation. Universität Heidelberg.
  • [TEI P5] TEI Consortium (eds) (2007). TEI P5: Guidelines for Electronic Text Encoding and Interchange. (accessed 22 March 2013).

One Comment

Comments are closed.