reading

Peter Menke

The FiESTA data model.

A novel approach to the representation of heterogeneous multimodal interaction data.

Von der Fakultät für Linguistik und Literaturwissenschaft der Universität Bielefeld zur Erlangung des Grades eines Doctor philosophiae zugelassene Dissertation.

Tag der Promotion: 24. September 2015

ISBN: 9783741217920

Herstellung und Verlag: BoD – Books on Demand GmbH, Norderstedt

Bibliografische Information der Deutschen Nationalbibliothek

Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über www.dnb.de abrufbar.

Bibliographic information published by the Deutsche Nationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at www.dnb.de.

Planned and written with Scrivener.

Typeset with LATEX using the tufte-latex class, using the fbb, Roboto and Roboto Mono font families, with the aid of BibLATEX, makeindex, and TikZ.

Printed on age-resistant paper according to ISO 9706.

Gedruckt auf alterungsbeständigem Papier °° ISO 9706.

I Introduction
- 1 Introduction
  - 1.1 Overview
  - 1.2 Motivation
  - 1.3 Relations to other works or publications
  - 1.4 Conventions and declarations
    - 1.4.1 Mathematical conventions
    - 1.4.2 Bibliographic conventions
II Background: What are multimodal data?
- 2 Data
  - 2.1 Introduction
  - 2.2 An examination object
  - 2.3 Data
    - 2.3.1 Data are produced by mappings
    - 2.3.2 Communicative events are transient
    - 2.3.3 Complex mappings
    - 2.3.4 Provenance of mappings
    - 2.3.5 Conclusion
  - 2.4 Primary and secondary data
    - 2.4.1 Lehmann’s semiotic approach
    - 2.4.2 Primary and secondary data in Gesprächsanalyse
    - 2.4.3 Approaches from text and corpus linguistics
    - 2.4.4 An approach from the area of speech databases
    - 2.4.5 An improved definition
  - 2.5 Conclusion
- 3 Modalities and multimodality
  - 3.1 Introduction
  - 3.2 Modalities in social semiotics
  - 3.3 A linguistic and semiotic approach to modalities
    - 3.3.1 Perception
    - 3.3.2 Production
    - 3.3.3 Coding systems
  - 3.4 Conclusion
III State of the art: multimodal corpora and exchange formats for multimodal data
- 4 Example Data Collection
  - 4.1 Overview
  - 4.2 The Jigsaw Map Game (JMG) Corpus
    - 4.2.1 Scientific background
    - 4.2.2 Experimental setup
    - 4.2.3 Data
  - 4.3 The Bielefeld Speech and Gesture Alignment (SaGA) Corpus
    - 4.3.1 Scientific background
    - 4.3.2 Experimental setup
    - 4.3.3 Data
  - 4.4 The B6 Chat Game (CG) Corpus
    - 4.4.1 Scientific background
    - 4.4.2 Experimental setup
    - 4.4.3 Data
  - 4.5 Corpora dealing with emotions (EMO)
    - 4.5.1 Scientific background
    - 4.5.2 Experimental setup
    - 4.5.3 Data
  - 4.6 The Obersee (OBS) Corpus
    - 4.6.1 Scientific background
    - 4.6.2 Experimental setup
    - 4.6.3 Data
  - 4.7 The Tagesgespräch (TAG) Corpus
    - 4.7.1 Scientific background
    - 4.7.2 Provenance
    - 4.7.3 Data sets
- 5 File formats and usability of annotation tools
  - 5.1 Praat
    - 5.1.1 The tool
    - 5.1.2 The file formats
    - 5.1.3 Evaluation
  - 5.2 ELAN
    - 5.2.1 The tool
    - 5.2.2 File formats
    - 5.2.3 Evaluation
  - 5.3 Summary
- 6 An analysis of the example data collection
  - 6.1 Issues identified in the example data sets
    - 6.1.1 Hierarchies and associations between annotations
    - 6.1.2 Descriptive annotation values
    - 6.1.3 Complex annotation values
    - 6.1.4 Vocabularies emerge during annotation
    - 6.1.5 Values and boundaries of annotations change
    - 6.1.6 Consistency
    - 6.1.7 Data semantics
    - 6.1.8 Documentation
  - 6.2 General observations
  - 6.3 Discussion
- 7 A survey about multimodal data and annotation tools
  - 7.1 Introduction
  - 7.2 Setup and participants
    - 7.2.1 Setup
    - 7.2.2 Dissemination
    - 7.2.3 Participants
  - 7.3 Results
    - 7.3.1 Personal background
    - 7.3.2 Selected data sets
    - 7.3.3 Tool prominence and usage
    - 7.3.4 Activities
    - 7.3.5 Inside-outside ratio
    - 7.3.6 Correlations between inside-outside ratio and other aspects
    - 7.3.7 Searching
    - 7.3.8 Visualisation
    - 7.3.9 Expressing relations between annotations
    - 7.3.10 File exchange
    - 7.3.11 Working with Controlled Vocabularies (ELAN only)
    - 7.3.12 Helpfulness
    - 7.3.13 Additional comments
  - 7.4 Discussion and conclusion
- 8 Generic exchange formats: motivation, advantages and criteria
  - 8.1 Motivation
  - 8.2 A catalogue of criteria
    - 8.2.1 General considerations
    - 8.2.2 Conversion, transformation, and merging
    - 8.2.3 Referencing primary data
    - 8.2.4 Structure
    - 8.2.5 Annotation values
    - 8.2.6 Meta-information
  - 8.3 Conclusion
- 9 Existing candidates for an exchange format for multimodal data sets
  - 9.1 Introduction
  - 9.2 Generic theories and formalisms
    - 9.2.1 The Annotation Graph paradigm
    - 9.2.2 The NITE Object Model
  - 9.3 Solutions associated with particular tools
    - 9.3.1 Praat
    - 9.3.2 ELAN
    - 9.3.3 Anvil
    - 9.3.4 EXMARaLDA and Folker
    - 9.3.5 Interim findings
  - 9.4 Text-focused solutions
    - 9.4.1 The Text Encoding Initiative and the TEI guidelines
    - 9.4.2 CES and XCES
    - 9.4.3 LAF and GrAF
    - 9.4.4 PAULA
    - 9.4.5 SALT (and Pepper)
    - 9.4.6 Interim findings
  - 9.5 Solutions for multimodal corpora
    - 9.5.1 ATLAS
    - 9.5.2 “Exchange Format for Multimodal Annotations”
  - 9.6 Conclusion
IV The generic exchange format and framework FiESTA
- 10 The FiESTA data model
  - 10.1 Introduction
    - 10.1.1 Goal of this part
    - 10.1.2 Concepts and conventions
  - 10.2 The basic structure of a FiESTA document
    - 10.2.1 Header
    - 10.2.2 Scales
    - 10.2.3 Relating and aligning scales
    - 10.2.4 Items
    - 10.2.5 Layers and layer connectors
  - 10.3 Auxiliary constructions and functions
    - 10.3.1 Interval Calculations
    - 10.3.2 Search and retrieval of items
    - 10.3.3 Structures and partitions induced by layers
  - 10.4 The type and constraint system
    - 10.4.1 Constraints
    - 10.4.2 Constraint Relations
    - 10.4.3 Constraint set, domains, and maximality
    - 10.4.4 Types as sets of constraints
    - 10.4.5 Supertypes and subtypes
    - 10.4.6 Using and exploiting the type system
  - 10.5 A type-constraint system for multimodal annotation data
    - 10.5.1 Scales
    - 10.5.2 The layer set and its graph complexity
    - 10.5.3 Intra-Layer Structure
    - 10.5.4 Inter-Layer links between events
    - 10.5.5 Values
    - 10.5.6 Instantiations for relevant annotation file formats
  - 10.6 Conclusion
- 11 The FiESTA file format and reference implementation
  - 11.1 The FiESTA file format
    - 11.1.1 A FiESTA document
    - 11.1.2 The head
    - 11.1.3 The scale set
    - 11.1.4 The layer set
    - 11.1.5 The item set
  - 11.2 The FiESTA reference implementation
    - 11.2.1 Accessors
    - 11.2.2 Interfaces
    - 11.2.3 Constraints and types
    - 11.2.4 Extended use of FiESTA in the Phoibos corpus manager
- 12 Evaluation and conclusion
  - 12.1 General considerations
  - 12.2 Conversion, transformation, and merging
  - 12.3 Referencing primary data
  - 12.4 Structure
  - 12.5 Annotation values
  - 12.6 Meta-information
  - 12.7 Conclusion
V Conclusion
- 13 Conclusion
  - 13.1 Results
  - 13.2 Critical reflection
  - 13.3 Perspectives
Appendices
A The survey on multimodal corpora and annotation tools

Abstract

This thesis presents a data model for the representation of experimental data collections from the field of research on multimodal communication. We collected evidence for the existence of problems and shortcomings in the work with several annotation tools with (a)an analysis of existing multimodal corpora and (b)a survey among researchers working with multimodal data. On this basis we argue that, despite the fact that there are numerous data models and formalisms for the representation of classic text-based corpora, these are not suited for multimodal data collections. As a consequence, we developed a data model that takes into account the properties of such multimodal corpora. In particular, it supports temporal and spatial references, flexible data types, and controlled transformations between the file formats of several annotation tools.

Zusammenfassung

Diese Dissertation stellt ein Datenmodell zur Repräsentation experimentbasierter Datensätze aus dem Forschungsgebiet der multimodalen Kommunikation vor. Es werden Belege für die Existenz verschiedener Probleme und Unzulänglichkeiten in der Arbeit mit multimodalen Datensammlungen aufgezeigt. Diese resultieren aus (a)einer Analyse bestehender multimodaler Korpora und (b)einer Umfrage, an der Wissenschaftler_innen teilgenommen haben, die zu konkreten Problemen in der Arbeit mit ihren multimodalen Datensammlungen befragt wurden. Auf dieser Grundlage wird herausgearbeitet, dass trotz der Existenz einer Vielzahl von Datenmodellen und Formalismen zur Darstellung klassischer Textkorpora sich diese nicht eignen, um die den multimodalen Korpora eigenen Besonderheiten abbilden zu können. Aus diesem Grund wird ein Datenmodell entwickelt, das all jene spezifischen Eigenschaften multimodaler Korpora zu berücksichtigen sucht. Dieses Datenmodell bietet Lösungen speziell für die Arbeit mit einer oder mehreren Zeitachsen und Raumkoordinaten, für die Darstellung komplexer Annotationswerte, und für die Transformation zwischen verschiedenen (bisher inkompatiblen) Dateiformaten verbreiteter Annotationswerkzeuge.

Acknowledgments

There are many people to whom I am grateful with respect to this thesis, and I cannot list them all. Thus, I hope not to disappoint anybody by providing the following explicit enumeration (and, thus, giving the false impression of a closed set). I am grateful to

ALEXANDER MEHLER,	kickstarter of this endeavour.
DAVID SCHLANGEN,	supervisor and provider of several helpful hints and comments.
PETRA WAGNER,	for being willing to be the second reviewer of this thesis.
HANS-JÜRGEN EIKMEYER,	for highly helpful insights and suggestions.
BARBARA JOB,	for the repetitive misappropriation of the “Forschungsseminar Sprache und Kommunikation / Linguistik romanischer Sprachen” in order to present snapshots of this thesis project (I hope the audience pardons me the mathematical shock treatment).
PHILIPP CIMIANO,	one of the principal investigators in Project X1, who procured room for my work on this thesis whenever it was necessary.
KIRSTEN BERGMANN, ANOUSCHKA FOLTZ, FARINA FREIGANG, PETRA JAECKS, ALEXANDER NEUMANN, CHRISTIAN SCHNIER, AND PETRA WEISS,	the people (in alphabetical order) who provided me with insights into the data structures of the corpora I used as my example collection. Many thanks to those who helped to unveil the mysteries of many of the yet-to-be-published data collections in the CRC.
FARINA FREIGANG, FLORIAN HAHN, AND SARA MARIA HELLMANN,	beta testers of the pilot version of the multimodal corpus survey. I am grateful for several helpful comments that lead to substantial improvements of the questionnaire.
MY FAMILY,	who had to cut back, especially in the last months.
SEBASTIAN MENKE,	for endless patience, support and supply — be it with (i. a.) encouragement, coffee, mathematical emergencies, or proofreading. Thank you.

Peter Menke

Bielefeld, October 2015

Part I

Introduction

1 Introduction

Every step is a first step if it’s a step in the right direction.

TERRY PRATCHETT: I Shall Wear Midnight

1.1 Overview

THIS THESIS INTRODUCES and describes FiESTA, a new data model and library that assists researchers in creating, managing, and analyzing multimodal data collections. In this introductory chapter, we clarify the motivation for this project and, in parallel, give a commented overview of how each chapter contributes to the big picture. The visual roadmap in Figure 1 on the following page accompanies and illustrates this outline.

SECTION 1.2 describes our motivation and contains pointers to the respective chapters of the thesis.

SECTION 1.3 connects this thesis to other publications and projects from the wider context of multimodal corpora and data sets.

Figure 1: Visual roadmap for this thesis. Large, dashed boxes indicate parts, nested solid boxes stand for chapters. The narrative flow is shown as arrow connections between the chapters. Italic texts next to chapters outline the goals or accomplishments of their respective chapter(s). “GEF” (due to the restricted space in the diagram) stands for “generic exchange format”.

SECTION 1.4 introduces some conventions used in this thesis, along with some remarks about mathematical notations.

1.2 Motivation

Multi-modal records allow us not only to approach old research problems in new ways, but also open up entirely new avenues of research.

Wittenburg, Levinson, et al., 2002 : 176

THIS STATEMENT DESCRIBES a central development in linguistics and its neighbouring disciplines in the last decades: The focus of research is no longer on the purely linguistic component of communicative interaction only. Instead, interaction is understood as a complex interplay between linguistic events (typically, spoken utterances) and events in other modalities, such as gesture, gaze, or facial expressions (cf. Kress and Leeuwen, 2001; Knapp, Hall, and Horgan, 2013).

A couple of decades ago, technology could only provide limited support to this branch of research. Microlevel video analysis, for instance, originated in the last century: Back then, researchers used purpose-built film projectors that could play film reels “at a variety of speeds, from very slow to extremely fast, effectively achieving slow motion vision and sound” (Condon and Ogston, 1967 : 227). This served as the basis for detailed, yet hand-written, analyses of interaction on the level of single video frames.

Since then, researchers benefitted from various developments and technological shifts, such as easily available computing facilities and digitization of video and audio recordings: The fact that media recordings can be digitized means that there is no loss of quality in copies anymore. This is an improvement compared to situations where copies of analog media often were expensive while at the same time being lossy, thus also limiting the number of generations of copies that could be produced (cf. Draxler, 2010 : 11 f.). In addition, year by year, computational power, disk space and recordable devices (such as working memory or hard disks) become more affordable (Gray and Graefe, 1997).

In addition, the advent of high-level programming languages (in general, and especially in the scientific context) and the evergrowing supply with modular, reusable programming libraries containing solutions to many problems enabled the community to create annotation tools. These are special pieces of software suited to the needs of researchers in the field of multimodal interaction, such as the EUDICO Linguistic Annotator ELAN (Wittenburg, Brugman, et al., 2006), or Anvil (Kipp, 2001). Both tools support the playback and navigation of video and audio recordings, as a basis for the creation and temporal localisation of additional data. Similarly, for detailed phonological and phonetic analyses of sound files, the tool Praat (Boersma and Weenink, 2013, 2001) was developed.

With these tools and their wide range of possible operations, scientists work on a diverse range of research questions, investigating phenomena such as

– the synchronicity and cross-modal construction of meaning in speech and gesture signals (Lücking et al., 2010; Bergmann and Kopp, 2012; Bergmann, 2012; Lücking et al., 2013),

– the use of speech-accompanying behaviour signalling emotion, and its possible differences in patients and healthy subjects (Jaecks et al., 2012),

– the interaction of speech and actions in object arrangement games, with a focus on the positioning of critical objects in a twodimensional target space (Schaffranietz et al., 2007; Lücking, Abramov, et al., 2011; Menke, McCrae, and Cimiano, 2013),

– or the multimodal behaviour in negotiation processes concerning object configurations in miniature models of buildings or landscapes (Pitsch et al., 2010; Dierker, Pitsch, and Hermann, 2011; Schnier et al., 2011).

IN ALL DIALOGICAL¹ situations investigated in these experiments, interlocutors produced several series of interaction signals over time – such as speech, gestures, facial expressions or manipulations of objects located in the shared space between interlocutors. These streams of interactive patterns are sometimes independent of each other. Often, however, multiple streams are coupled in a single interlocutor (e. g., in speech-accompanying gestures), and in other cases, the streams of different interlocutors are (at least locally) coupled (e. g., in coconstructions of speech, where a fragmentary segment of a linguistic construction is continued or completed by another interlocutor).

Figure 2: Schema of the data generation workflow in the research of multimodal interaction. Left: The different levels of data, and information about how subsequent layers are generated out of prior ones. Right: An example of primary and secondary annotations based on the segment of a recording (containing a speech transcription, an annotation of gesture, a syntactic analysis of the speech, and a secondary annotation expressing an hypothesis about how items on both the speech and gesture form a joint semantic unit.

A detailed and thorough analysis of such dialogues typically pursues the following course (cf. Figure 2; the following description indicates numbers in the diagram for easier reference):

First, video and audio recordings of the interaction are created ().
To simplify work, further references to these recordings (and, indirectly to the events of the original situation) refer to an abstraction in the shape of a so-called timeline (). Points and intervals on this timeline are the only link to the underlying media files, since (under the assumption that all media have been synchronised) every segment in the media files can unambigously be referenced with such a time stamp information.
Then, researchers create primary annotations on the basis of these media recordings (). This is done by identifying points or intervals on the timeline and associating them with a coded representation (typically, by using text) of the observed phenomenon.
In addition, it can be necessary to generate annotations on an additional level, so-called secondary annotations (). These do not refer to temporal information directly. Instead, they point to one or multiple annotations (cf. Brugman and Russel, 2004 : 2065). They typically assign a property or category to an annotation, or model a certain kind of relation between two or more annotations.

“DATA” AND “MODALITY” are two terms which, although researchers have an intuitive understanding, are often deficiently defined. Therefore, we prepend two chapters to this thesis that attempt to clarify the exact definitions of terms from the two fields of data (Chapter 2) and of modalities (Chapter 3).

While most of the investigations concerning multimodal interaction follow the basic schema described above, its differentiations per project can diverge substantially. This is mostly due to the fact that different research questions often require idiosyncratic data structures and different descriptive categories (as, for instance, for the description of non-linguistic behaviour).

IN ORDER TO give a more detailed overview of how these data structures can be designed, and how they diverge against the background of varying research questions, descriptions of a sample of multimodal data collections, along with the underlying research questions, are presented in Chapter 4, starting on page →. This is accompanied by an introduction to the graphical user interfaces and the file formats of two annotation tools that were repeatedly used for creating the example data collection: Praat and ELAN (Chapter 5, starting on page →).

FIRST AND FOREMOST, the annotation tools mentioned in the previous section provide a solid basis for the research of multimodal interaction. And yet, as will be shown in the following chapters, there are still areas and specific tasks where these general-purpose tools fail, and where creative, but ad-hoc solutions are implemented. Examples of such problematic tasks are

– the creation of a certain connection inside the annotation structure for which the developers of the tool did not provide a solution,

– the automated creation and seamless integration of an additional layer containing part-of-speech tags (a task that can almost effortlessly be performed when working with text-based corpora),

– a calculation of quantitative relations of interesting patterns in multiple layers,

– or a customizable visualisation of such patterns.

Thus, as a starting point for our investigation we take the following claim that summarises these (and other) issues:

CLAIM 1

Investigators of multimodal interaction need better support in various areas for the collection, analysis, visualisation, exchange, storage, and machine comprehensibility of their corpora.

An analysis of the example data collection reveals first bits of evidence for this claim and tries to complement them with observations and results from literature. This analysis is given in Chapter 6 (page →).

HOWEVER, INFORMATION CONCERNING issues in data generation and analysis is sparse in scientific publications, and our analysis of the example data can only produce hypotheses about potential problems and issues. Therefore, a survey among creators and producers of multimodal data collections was conducted in addition. Particularly, this survey examined what kinds of problems existed for researchers, and which of them impeded them most during their work, what tasks they needed to perform in order to answer specific research questions, and how features and solutions should look that could be able to assist and support them in answering specific research questions. This survey, its design considerations, its realisation, and its evaluation, are presented in Chapter 7 (starting on page →).

IN SEVERAL OTHER areas with concurring data formats, a solution to such a set of problems was to develop a common exchange file format – one central format that can model the data structures and represent the information contained in all other formats. The advantage of such a central format is that, once data conversion routines between the common format and third-party formats are established, any subsequent task only needs to be implemented once – for the common format. In the past, such exchange (or pivotal) formats have successfully been created in different areas. Also, several exchange formats have been developed and (sometimes) standardised in the field of linguistics, such as the modular schemas for representing different sorts of texts by the Text Encoding Initiative (TEI; TEI Consortium, 2008), the Linguistic Annotation Framework (LAF; Ide, Romary, and Clergerie, 2003; Ide and Romary, 2006), the PAULA format (Potsdamer Austauschformat für linguistische Annotationen; Zeldes, Zipser, and Neumann, 2013), and many more.

An obvious choice would be to identify and use an exchange format that is suited for representing complex multimodal data sets. In order to evaluate candidates for such a purpose, we analysed the collected evidence (both from the literature review and the survey), and transformed it into a catalogue of critera for exchange formats. This catalogue will then be used to evaluate exchange format candidates. The advantages of generic exchange formats and the resulting catalogue of criteria are described in detail in Chapter 8 (starting on page →).

AT FIRST GLANCE, many file formats that have successfully been used to represent textual data seem to be promising candidates also for the representation of multimodal interaction data. However, an evaluation of these formats revealed that multimodal corpora have conceptual and structural differences from text-based corpora that make it difficult to apply such a formalism.

One of these problems is that many of the common formats presuppose the existence of one single, flat stream of primary data – typically, a text or a non-overlapping sequence of transcribed utterances. In these approaches, often either tokens are marked in the primary text which can then be referred to from annotations, or locations in the text are described using numeric character offsets. However, this approach cannot express multimodal interaction data in an adequate way. While there are numerous reasons, we present three of them that we consider especially important:

In classical corpus linguistics, the primary data is already present in the shape of the finalised (thus, immutable) text to be annotated. In contrast, in multimodal interaction studies such a text does not exist a priori, but it has to be produced in the form of a transcription. Such a transcription must itself refer to a kind of axis more suited to the situation, that is, a timeline, and optionally, also to spatial coordinates. Approaches that use only character sequences as their primary axis therefore have no adequate way of representing these temporal and spatial coordinates.
A textual primary axis, be it segmented on the word or character level, has a much lower resolution than a timeline. In addition, the axis distorts temporal relations and makes the comparison of durations and distances impossible, because character-based lengths and distances cannot be compared to temporal or spatial ones.
In multimodal interaction, often multiple streams of events cooccur which cannot be flattened down into a single sequence without discarding large amounts of important information about the order.

These and other problems have become evident in the literature review as well as the survey. In order to have a reliable assessment of to which degree existing solutions meet the requirements in the field of multimodal interaction studies, an evaluation of known data models, libraries, and file formats was conducted that have been proposed or used for the handling of linguistic (and, possibly, multimodal) corpora. The result of the evaluation shows that none of these data models meets a sufficient amount of criteria from the catalogue collected earlier. Chapter 9 (beginning on page →) contains this detailed evaluation of existing solutions.

THE RESULTS FROM this evaluation underpin the second central claim of this work:

CLAIM 2

There is (to our best knowledge) no known solution (in the shape of a theoretical or implemented data model) that meets the important requirements that researchers have when investigating multimodal interaction.

Since the evaluation of all solutions under examination did not expose a data model as a suitable candidate, the final part of this thesis describes the design and development of FiESTA, a novel data model for solving as many of the aforementioned issues as possible. This data model, since it has been designed as a specific match to the criteria catalogue, is expected to provide better solutions to the problems in multimodality research. We exemplify the usefulness of FiESTA by describing implemented and potential improvements of the workflow of scientists along with an evaluation of the formalism against the criteria catalogue.

The formal specification and documentation of the so-called Format for extensive spatio-temporal annotations (FiESTA) can be found in Chapter 10 (beginning on page →). Summaries of the XML-based file format and the pilot implementation are given in Chapter 11 (page →), and a conclusive evaluation in Chapter 12 (page →).

SINCE A SINGLE thesis does not provide enough space for a thorough description and documentation of such an ambitious project, the conclusion evaluates on a meta-level what has been achieved in the thesis, and what further developments and improvements appear promising. This conclusion and the outlook are presented in Chapter 13 (page →). In addition, Appendix A (page →) presents the questions of the survey in detail.

1.3 Relations to other works or publications

This thesis project is closely related to and embedded within the endeavours of Project X1 “Multimodal Alignment Corpora” within the Collaborative Research Centre (CRC)² 673 “Alignment in Communication”³. X1 provided solutions for both low-level storage and sharing and high-level administration and analysis of a variety of data collections dealing with multimodal dialogues. The models and products of this thesis provided the theoretical and structural basis for several of these solutions. Sometimes, early versions of models and implementations have already been used an integrated into services and applications (one of them being the Phoibos corpus manager, which is mentioned in Chapter 12).

A DRAFT OF THE MEXICO MODEL (which is a sister project of FiESTA that aims at representing whole multimodal corpora; MExiCo stands for “Multimodal Experiment Corpora”) for managing multimodal corpora has been summarized and published in Menke and Cimiano (2013).

THE GENERAL IDEA OF A CORPUS MANAGEMENT APPLICATION (and also of the underlying functionality that eventually resulted in the FiESTA and MExiCo libraries) has been outlined in Menke and Mehler (2010) and Menke and Mehler (2011), and plans were to integrate the X1 solutions with the eHumanities Desktop System (Gleim, Waltinger, Mehler, et al., 2009). However, due to personnel reorganisation issues within Project X1, this agenda was abandoned in favor of the development of the current approach.

THE NOTION OF A GENERIC SCALE-BASED APPROACH to modelling multimodal annotations (as described in this thesis) is based on earlier work: on Menke (2007), where a general scale concept for an improved modelling of overlapping discourse segments, based on Stevens’ levels of measurement (Stevens, 1946), is described, and on (Menke, 2009), where metrics for the calculation of the synchronicity of interval-based annotations, especially for multimodal ensembles⁴ consisting of speech and gesture parts, are introduced.

FIESTA, while being a central subject of this thesis, is present as a draft in various earlier publications (not necessarily bearing this name, but possibly under its previous working title “ToE”, short for “time-oriented events”), among them Menke and Mehler (2010), Menke and Mehler (2011), and Menke and Cimiano (2012).

PRELIMINARY INVESTIGATIONS TOWARD A MACHINE-READABLE, RDF-BASED ONTOLOGY OF MULTIMODAL DATA UNITS AND PHENOMENA based on the works of Chapter 3 have been discussed at a workshop on multimodality in the context of the LREC 2012 conference, and have been published afterwards in Menke and Cimiano (2012).

THE CHAT GAME CORPUS developed in Project B6 of the CRC 673 has been summarized and described in Menke, McCrae, and Cimiano (2013).

THE ATTEMPT OF DEVELOPING A PROTOTYPE FOR A MULTIMODAL TOOL CHAIN using FiESTA as the central exchange file format is described in Menke, Freigang, et al. (2014).

1.4 Conventions and declarations

1.4.1 Mathematical conventions

MEDIAN AND QUARTILES. We denote the median of a distribution as μ, and the interval between the lower and upper quartile of a distribution as Q_1,3.

SIGNIFICANCE. The results of a statistical significance test are called highly significant and marked with ** if p ≤ 0.01, and they are called significant and marked with * if 0.01 < p ≤ 0.05.

BOOLEAN VALUES. is the set of Boolean values true T and false F.

ACCESS TO SEQUENCE ELEMENTS. The n^th element of a sequence s is denoted by s[n].

DOT NOTATION. For structures that have subordinate components, a dot notation inspired by the member access syntax of several programming languages is used. For instance, if b is a book, then b.TITLE retrieves its title, and b.CHAPTERS returns an enumeration of its chapters.

We consider this notation more readable than multiple nested predicate expressions.

1.4.2 Bibliographic conventions

Since this thesis originates from a research project at a German university, we assume that there will not be any need for translations of German quotations. Translations from languages other than English or German were created by us, if not explicitly stated otherwise by the addition of a source of the translation. Highlighting and structure in quoted passages originate from the original authors, if not explicitly stated otherwise.

Pages in the World Wide Web are used in two different functions in this document: as a means for the mere localisation of a resource, and as evidence for an argument.

If a page serves as the entry page or frontdoor page of a product, object, or other resource that the text refers to, then the link to the page is given in a footnote.
If a page contains information that serves as evidence for arguments in the text, then it is inserted as an ordinary citation. Title information is taken from the TITLE element in the HTML header. Author information is obtained from the text body or from the HTML header. If no author could be detected, the citation displays a custom shorthand in the text and as the label in the bibliography, preceded with the ° character, such as °SyncWriter1.

All web resources have been checked for functionality on 26 May 2015, unless we provide a different date in the reference.

¹ Throughout this thesis, “dialogue” and “dialogical” explicitly include communicative situations with more than two participants (for which sometimes the term “multilogue” is used).

² This official English translation of the German term “Sonderforschungsbereich” (SFB) is imprecise, a better version is “specialised research department”. However, due to the official status of the first term, we will use it (or its abbreviation).

³ http://www.sfb673.org

⁴ Multimodal ensembles were introduced in Mehler and Lücking (2009) and Lücking (2013), see also below.

Part II

Background: What are multimodal data?

2 Data

MERE ACCUMULATION OF OBSERVATIONAL EVIDENCE IS NOT PROOF.

TERRY PRATCHETT: Hogfather

2.1 Introduction

WHEN TALKING ABOUT solutions for the management and analysis of multimodal data⁵, it is advisable to have at least a basic agreement on the terms used in this phrase. However, although most researchers from fields where recorded dialogues and communicative situations are analysed have an intuitive agreement upon their terms (especially primary data and secondary data), it is not trivial to find definitions for them that really match their usage.

As a consequence, this chapter clarifies what is to be understood under the superordinate term of data, and then summarises the most prominent readings and definitions of primary and secondary data. It will become apparent that some are quite closely related, while others make use of these terms while having nothing to do with the others conceptually. These different levels of relation will be analysed in the conclusion of this chapter.

Figure 3: A ficticious experiment, visualised as a graph, where nodes represent different representations of data, and edges express the relations between them (an edge from a node to another node indicates that the creation of data set was influenced by or based on data set

2.2 An examination object

FOR THE REST of this chapter we will refer to the example configuration of a multimodal experiment depicted in Figure 3. This figure contains a schematic overview of the data sets resulting from a typical (yet ficticious) experiment. The original situation, and the representations derived from it are given numbers, while the data mapping operations between them are given letters for easier reference. This setup may seem artificial, yet it contains several relations and data types that will help understanding the differences of the different variations in data nomenclature. The representations are:

THE ORIGINAL SITUATION. This is the sequence of communicative events in reality (which, as will later be shown, is volatile, and has to be recorded and documented).

DIRECT VIDEO AND AUDIO RECORDINGS. These are conventional, immediate recordings of the visual impression and the sound that occurred in the original situation. They are stored using discrete frames or samples, which, when played, create an impression of moving pictures and continuous sound.

A LIVE TRANSCRIPT. This is a log created by an observer who was present in the real communicative situation and wrote down immediately what was said in the dialogue, and by whom.

A CONVERTED VIDEO FILE. This is a video file that has been automatically converted from . This could be a video file that is reduced in file size, image resolution, or that has been compressed using a video codec.

GESTURE ANNOTATION. Based on audio and video recordings, specially trained annotators create time-based markings and classifications of the gestures produced by interlocutors in the original situation.

SPEECH TRANSCRIPT. Based mainly on the audio recordings, transcriptions of the speech (to be more exact, of utterances and words) of the interlocutors in the original situation is created. These transcriptions are created with a special software that allows for the marking of temporal position and duration of the utterances and words.

PART-OF-SPEECH ANNOTATION, AUTOMATED. A piece of software takes speech transcripts as input and assigns part-of-speech information to the units on the word level, according to the algorithm implemented, and the underlying data sets (these can be lexica, dictionaries, corpora, etc.).

PART-OF-SPEECH ANNOTATION, MANUAL. An annotator investigates the units on the word level found in the speech transcripts, and assigns part-of-speech information to them, based on his grammatical competence, and material he has at his disposition (again, lexica, dictionaries, etc.).

aspect	Kertesz & Rakosi (2012)	Lehmann (2005)
representation	statement	representation
validity	with a positive plausibility value	which is taken for granted
based on	originating from some direct	of the epistemic object of some
original	source	empirical research

Table 1: Comparison of elements in definitions of the concept “data”.

UNIFIED SPEECH TRANSCRIPT. The data contained in the live transcript and the recording-based speech transcript are analysed and combined into a single resource, in order to reduce errors and disambiguate missing or disputable portions.

2.3 Data

THERE ARE SEVERAL definitions and interpretations of data. Although they are all related and refer to similar things, they still differ slightly, depending on disciplines involved and purposes pursued.

According to Kertész and Rákosi (2012 : 169), “[a] datum is a statement with a positive plausibility value originating from some direct source”. Christian Lehmann defines the term in a similar way. For him, “[a] datum is a representation_i of an aspect of the epistemic object of some empirical research which_i is taken for granted” (Lehmann, 2005 : 182). In these two definitions, three relevant components can be identified (cf. Table 1):

Data are representations of the objects of study. This also includes that a datum is tangible, that is, it is materially manifested and can be accessed used as the basis for an analysis.
The objects of study (entities, their properties and relations interesting to researchers) serve as a basis for the creation of data.
People usually agree that these representations have a certain quality that makes them usable as a basis for a scientific argumentation, proofs, and similar things (often this quality is called validity, cf. Menke, 2012 : 288 f.).

There are, however, some minor issues with both definitions. First, we consider “statement” too narrow a concept to model several of the types of data multimodal research deals with, especially raw signal data in audio or video recordings. They are not statements, because they do not have any propositional content, at least not without an additional level of interpretation. Also, for an object to serve as a valid datum, it needs more than just any positive plausibility value. Plausibility should be high enough that one can rely on it in order to draw valid conclusions. There is, however, no absolute threshold, it depends on the situation. At least, the plausibility value should be significantly higher than a chance value or baseline (cf. Menke, 2012 : 304 f.).

A working definition based on the two definitions cited above could be:

DEFINITION 1

A scientific datum is a valid and processable representation of an object of study, or of one or more of its aspects or properties.

In this definition, “valid and processable” mean the following: The validity value must be high enough to minimize errors and doubts, and data must be in a state such that the chosen measurements and operations can be applied. This presupposes that data is tangible and durable. As a consequence, actions and events, being immaterial (thus, non-tangible) entities, are not considered data according to this definition.

On this basis, some additional aspects and properties of data will be summarized and discussed in the following subsections.

2.3.1 Data are produced by mappings

According to Stachowiak (1965, 1989), data sets are the result of a modelling process – that is, representations of an original which come into existence by a mapping operation. In a good model, relations and properties of the original are reflected in corresponding relations and properties of the representation. If this holds for all relations and properties under consideration, we are dealing with isomorphic mappings.

Stachowiak’s model concept is rather general, it covers scientific modelling processes as well as examples from completely different fields; e. g., photographs of real situations (Stachowiak, 1965 :439), or globes and spheres as models of planets (Stachowiak, 1965 :444). Scientific models are one special case of models for which he postulates some additional, special properties (Stachowiak, 1989 : 219). Here, objects of study (which are interesting to a certain scientific field) are mapped to representations that make it possible to apply research methods. Lehmann (2005) describes how such a configuration looks in the area of dendrochronology:

The data on which dendrochronology builds its theories are series of numbers, each of which represents the width of an annual ring of some tree and is associated with one in a series of years. The crosssections of the tree may be stored somewhere for measurements, because they constitute the ultimate basis of reference for certain relevant observations. The data, however, are those series of numbers insofar as they represent facts about these objects.

Lehmann, 2005 : 179

In this example, the cross-sections of trees are the objects of study. They serve as originals for the mapping operation that assigns numbers to them which express distances and lengths. These series of numbers are the data that are used in analyses and evaluations.

2.3.2 Communicative events are transient

Disciplines such as dendrochronology or archaeology regularly deal with physical, material objects of study (tree sections, shards, bones) which exist or existed in the real world, and, thus, are rather easily accessible. Lyons (1977 : 442) calls these first-order entities. However, other disciplines do not have this opportunity. Historians and linguists both are interested in events (Lyons, 1977 : 443: second-order entities) rather than objects (cf. Lehmann, 2005 : 180). Especially, for the area of empirical linguistics, a typical object of study is the set of events that occurs in a dialogue or another communicative situation. Such events cannot serve as direct material because they do not exist in a physical way – they occur rather than exist, and they are volatile, meaning that they get lost immediately after they occur (cf. Lehmann, 2005; Allwood, 2008; Menke and Mehler, 2010). However, their physical manifestation can be preserved (for instance, in video and audio recordings, or in live transcripts, as in the example configuration). As a consequence, this means that material in these areas does not consist of the objects of study themselves, but of recordings of them (cf. Lehmann, 2005 : 179 f.). These recordings then form the most important resource conserving the events from reality. Figure 4 shows an adapted version of Figure 3, where the original situations are not included anymore. The remaining subset displays those entities that are tangible and, therefore, count among actual data sets. From now on, we will follow Figure 4 and cease to categorise as a data set.

Figure 4: The ficticious experimental setup from Figure 3, at a time point after completion of the dialog situation.

2.3.3 Complex mappings

The creation of data from an object of study can consist of a single, simple operation, such as a direct measurement of distances in the cross-section of a tree. Regularly, however, more than one operation needs to be applied for the creation of a representation that can be used for obtaining a final result. In this case, we deal with a more complex configuration of mappings. This could be a sequence of mappings, where the result of a mapping step serves as an original in the following step (cf. Stachowiak, 1965 :438). For instance, researchers could decide to first take a picture of the cross-section to preserve spatial relations (after all, the cross-section could break, dry or wither over time and therefore the proportions of its annual rings could be altered), and then perform the distance measurements on the photograph rather than the original, resulting in the sequence

cross-section ↦ photograph ↦ measured distances

In our example configuration, we can find different sequences manifested as paths in the graph. For instance, the original situation is recorded as an audio signal , which then serves as an original for the creation of a derived representation in the shape of a speech transcript :

In other cases, however, more complex, graph-like structures of representations can be the result. The example configuration contains several examples of graph structures more complex than linear sequences. The original dialog situation serves as an original for three modeling processes , , and : The creation of both video and audio recordings and the creation of the live transcript . On the other hand, the unified transcript is based on two originals: The live transcript and the recording-based transcript .

Authors agree that there is no such thing as the one, single, correct model of an original, and this holds also for the special cases of scientific models. Stachowiak claims that models “usually represent their originals only for specific subjects [. . . ] (the users of the model), during specific time intervals [. . . ] and with restrictions to specific (mental or physical) operations”⁶. Similarly, according to Lehmann, “[n]othing is, in and of itself, a datum; instead, it is a datum for somebody (or for a scientific community) in some perspective” (Lehmann, 2005 : 181).

In other words, there is no single, correct way of creating a scientific data representation for an original of a certain type. In a scientific context, especially the identity of the creator and the underlying theories, assumptions and interpretations are important which guide the design of the operations to be used when creating a representation. For an example, see Figure 5. It contains a series of representations of an utterance, consisting of the single word “Mexico”. Each of these representations is useful in a different context, and to different groups of researchers investigating different questions: While phoneticians can pose and investigate research questions solely by looking at and evaluating, e. g., pitch patterns or formant progression, conversation analysts might be interested in word-based speech transcripts only.

Figure 5: Different representations of an original, consisting of the utterance of the word “Mexico”: (a) an oscillogram; (b) a sonagram; visualisations of (c) pitch and (d) formants; time-aligned textual transcriptions of (e) phonemes, (f) syllables, and (g) words.

2.3.4 Provenance of mappings

When we look at representations and it becomes apparent that both represent the same aspect of the linguistic events, namely, part-of-speech information of the words uttered during dialog. The example does not state whether these two representations are equal, similar, or whether they diverge. They could be identical, but there is another important aspect to them that is not related to some quality inherent to the representation itself: It is the genesis and provenance of the two representations – it is important who created the data, what techniques, algorithms, and methods he used, and on which theories, assumptions, and prerequisites he based the process.

While and share their structure and appearance, was created by the implementation of an algorithm that may or may not have based its work on actual grammar information, while is the product of a human annotator who selected values according to his interpretation of the material against the background of his or her grammatical competence. This difference can cause large, systematic differences in the data. Thus, in order to fully interpret a data set, often information about its provenance is required.

2.3.5 Conclusion

Up to now, the following properties of data sets have been assembled:

Data sets are processable representations of the thing to be analysed.
Data can be the result of subsequent, and, sometimes, very complex processes involving several originals.
A data set often is not analysed, interpreted, and stored in isolation, but rather with the context of its creation in mind. In other words, the related provenance information often should be documented as well.

On this basis we will now try to distinguish different kinds of data sets. An audio recording, for instance, is different from a speech transcript in many ways. Many people would call the audio recording a primary datum, and the transcript a secondary datum. However, the criteria for this distinction are not always clearly defined. In the next sections we will enumerate items in a category system that can help distinguish different kinds of data on a clearly defined basis.

Contents

Part I

Introduction

1

Introduction

1.1 Overview

1.2 Motivation

1.3 Relations to other works or publications

1.4 Conventions and declarations

1.4.1 Mathematical conventions

1.4.2 Bibliographic conventions

Part II

Background: What are multimodal data?

2

Data

2.1 Introduction

2.2 An examination object

2.3 Data

2.3.1 Data are produced by mappings

2.3.2 Communicative events are transient

2.3.3 Complex mappings

2.3.4 Provenance of mappings

2.3.5 Conclusion