Contents

List of Figures

List of Tables

Notes on Contributors

Preface

Introduction

Part I Formal Foundations

1 Formal Language Theory

1 Introduction

2 Basic Notions

3 Language Classes and Linguistic Formalisms

4 Regular Languages

5 Context-Free Languages

6 The Chomsky Hierarchy

7 Mildly Context-Sensitive Languages

8 Further Reading

2 Computational Complexity in Natural Language

1 A Brief Review of Complexity Theory

2 Parsing and Recognition

3 Complexity and Semantics

4 Determining Logical Relationships between Sentences

3 Statistical Language Modeling

1 Introduction to Statistical Language Modeling

2 Structured LanguageModel

3 Speech Recognition Lattice Rescoring Using the Structured Language Model

4 Richer Syntactic Dependencies

5 Comparison with Other Approaches

6 Conclusion

4 Theory of Parsing

1 Introduction

2 Context-Free Grammars and Recognition

3 Context-Free Parsing

4 Probabilistic Parsing

5 Lexicalized Context-Free Grammars

6 Dependency Grammars

7 Tree Adjoining Grammars

8 Translation

9 Further Reading

Part II Current Methods

5 Maximum Entropy Models

1 Introduction

2 Maximum Entropy and Exponential Distributions

3 Parameter Estimation

4 Regularization

5 Model Applications

6 Prospects

6 Memory-Based Learning

1 Introduction

2 Memory-Based Language Processing

3 NLP Applications

4 Exemplar-Based Computational Psycholinguistics

5 Generalization and Abstraction

6 Generalizing Examples

7 Further Reading

7 Decision Trees

1 NLP and Classification

2 Induction of Decision Trees

3 NLP Applications

4 Advantages and Disadvantages of Decision Trees

5 Further Reading

8 Unsupervised Learning and Grammar Induction

1 Overview

2 Computational Learning Theory

3 Empirical Learning

4 Unsupervised Grammar Induction and Human Language Acquisition

5 Conclusion

9 Artificial Neural Networks

1 Introduction

2 Background

3 Contemporary Research

4 Further Reading

10 Linguistic Annotation

1 Introduction

2 Review of Selected Annotation Schemes

3 The Annotation Process

4 Conclusion

11 Evaluation of NLP Systems

1 Introduction

2 Fundamental Concepts

3 Evaluation Paradigms in Common Evaluation Settings

4 Case Study: Evaluation ofWord-Sense Disambiguation

5 Case Study: Evaluation of Question Answering Systems

6 Summary

Part III Domains of Application

12 Speech Recognition

1 Introduction

2 Acoustic Modeling

3 Search

4 Case Study: The AMI System

5 Current Topics

6 Conclusions

13 Statistical Parsing

1 Introduction

2 History

3 Generative Parsing Models

4 Discriminative Parsing Models

5 Transition-Based Approaches

6 Statistical Parsing with CCG

7 OtherWork

8 Conclusion

14 Segmentation and Morphology

1 Introduction

2 Unsupervised Learning ofWords

3 Unsupervised Learning of Morphology

4 Implementing ComputationalMorphologies

5 Conclusions

15 Computational Semantics

1 Introduction

2 Background

3 State of the Art

4 Research Issues

5 Corpus-Based andMachine Learning Methods

6 Concluding Remarks

16 Computational Models of Dialogue

1 Introduction

2 The Challenges of Dialogue

3 Approaches to Dialogue System Design

4 Interaction and Meaning

5 Extensions

6 Conclusions

17 Computational Psycholinguistics

1 Introduction

2 ComputationalModels of Human Language Processing

3 Symbolic Models

4 Probabilistic Models

5 Connectionist Models of Sentence Processing

6 HybridModels

7 Concluding Remarks

Part IV Applications

18 Information Extraction

1 Introduction

2 Historical Background

3 Name Extraction

4 Entity Extraction

5 Relation Extraction

6 Event Extraction

7 Concluding Remarks

19 Machine Translation

1 Introduction

2 The State of the Art: Phrase-Based Statistical MT

3 Other Approaches to MT

4 MT Applications

5 Machine Translation at DCU

6 Concluding Remarks and Future Directions

7 Further Reading

20 Natural Language Generation

1 High-Level Perspective: Making Choices about Language

2 Two NLG Systems: SumTime and SkillSum

3 NLG Choices and Tasks

4 NLG Evaluation

5 Some NLG Research Topics

6 NLG Resources

21 Discourse Processing

1 Discourse: Basic Notions and Terminology

2 Discourse Structure

3 Discourse Coherence

4 Anaphora Resolution

5 Applications

6 Further Reading

22 Question Answering

1 What is Question Answering?

2 Current State of the Art in Open Domain QA

3 Current Directions

4 Further Reading

References

Author Index

Subject Index

Praise for The Handbook of Computational Linguistics and Natural Language Processing

“All in all, this is very well compiled book, which effectively balances the width and depth of theories and applications in two very diverse yet closely related fields of language research.”

Machine Translation

“This Handbook is exceptionally broad and exceptionally deep in its coverage. The contributions, by noted experts, cover all aspects of the field, from fundamental theory to concrete applications. Clark, Fox and Lappin have performed a great service by compiling this volume.”

Richard Sproat, Oregon Health & Science University

For Camilla

Blackwell Handbooks in Linguistics

This outstanding multi-volume series covers all the major subdisciplines within linguistics today and, when complete, will offer a comprehensive survey of linguistics as a whole.

Already published:

The Handbook of Child Language
Edited by Paul Fletcher and Brian MacWhinney

The Handbook of Phonological Theory, Second Edition
Edited by John A. Goldsmith, Jason Riggle, and Alan C. L. Yu

The Handbook of Contemporary Semantic Theory
Edited by Shalom Lappin

The Handbook of Sociolinguistics
Edited by Florian Coulmas

The Handbook of Phonetic Sciences, Second Edition
Edited by William J. Hardcastle and John Laver

The Handbook of Morphology
Edited by Andrew Spencer and Arnold Zwicky

The Handbook of Japanese Linguistics
Edited by Natsuko Tsujimura

The Handbook of Linguistics
Edited by Mark Aronoff and Janie Rees-Miller

The Handbook of Contemporary Syntactic Theory
Edited by Mark Baltin and Chris Collins

The Handbook of Discourse Analysis
Edited by Deborah Schiffrin, Deborah Tannen, and Heidi E. Hamilton

The Handbook of Language Variation and Change
Edited by J. K. Chambers, Peter Trudgill, and Natalie Schilling-Estes

The Handbook of Historical Linguistics
Edited by Brian D. Joseph and Richard D. Janda

The Handbook of Language and Gender
Edited by Janet Holmes and Miriam Meyerhoff

The Handbook of Second Language Acquisition
Edited by Catherine J. Doughty and Michael H. Long

The Handbook of Bilingualism and Multilingualism,Second Edition
Edited by Tej K. Bhatia and William C. Ritchie

The Handbook of Pragmatics
Edited by Laurence R. Horn and Gregory Ward

The Handbook of Applied Linguistics
Edited by Alan Davies and Catherine Elder

The Handbook of Speech Perception
Edited by David B. Pisoni and Robert E. Remez

The Handbook of the History of English
Edited by Ans van Kemenade and Bettelou Los

The Handbook of English Linguistics
Edited by Bas Aarts and April McMahon

The Handbook of World Englishes
Edited by Braj B. Kachru; Yamuna Kachru, and Cecil L. Nelson

The Handbook of Educational Linguistics
Edited by Bernard Spolsky and Francis M. Hult

The Handbook of Clinical Linguistics
Edited by Martin J. Ball, Michael R. Perkins, Nicole Müller, and Sara Howard

The Handbook of Pidgin and Creole Studies
Edited by Silvia Kouwenberg and John Victor Singler

The Handbook of Language Teaching
Edited by Michael H. Long and Catherine J. Doughty

The Handbook of Language Contact
Edited by Raymond Hickey

The Handbook of Language and Speech Disorders
Edited by Jack S. Damico, Nicole Müller, Martin J. Ball

The Handbook of Computational Linguistics and Natural Language Processing
Edited by Alexander Clark, Chris Fox, and Shalom Lappin

The Handbook of Language and Globalization
Edited by Nikolas Coupland

The Handbook of Hispanic Linguistics
Edited by Manuel Díaz-Campos

The Handbook of Language Socialization
Edited by Alessandro Duranti, Elinor Ochs, and Bambi B. Schieffelin

The Handbook of Intercultural Discourse and Communication
Edited by Christina Bratt Paulston, Scott F.
Kiesling, and Elizabeth S. Rangel

The Handbook of Historical Sociolinguistics
Edited by Juan Manuel Hernández-Campoy and Juan Camilo Conde-Silvestre

The Handbook of Hispanic Linguistics
Edited by José Ignacio Hualde, Antxon Olarrea, and Erin O’Rourke

The Handbook of Conversation Analysis
Edited by Jack Sidnell and Tanya Stivers

The Handbook of English for Specific Purposes
Edited by Brian Paltridge and Sue Starfield

List of Figures

1.1Chomsky’s hierarchy of languages.
2.1Architecture of a multi-tape Turing machine.
2.2A derivation in the Lambek calculus.
2.3Productions of a DCG recognizing the language {anbncndnen |n 0}.
2.4Derivation of the string aabbccddee in the DCG of Figure 2.3.
2.5Semantically annotated CFG generating the language of the syllogistic.
2.6Meaning derivation in a semantically annotated CFG.
2.7Productions for extending the syllogistic with transitive verbs.
3.1Recursive linear interpolation.
3.2ARPA format for language model representation.
3.3Partial parse.
3.4A word-and-parse k-prefix.
3.5Complete parse.
3.6Before an adjoin operation.
3.7Result of adjoin-left under NTlabel.
3.8Result of adjoin-right under NTlabel.
3.9Language model operation as a finite state machine.
3.10SLM operation.
3.11One search extension cycle.
3.12Binarization schemes.
3.13Structured language model maximum depth distribution.
3.14Comparison of PPL, WER, labeled recall/precision error.
4.1The CKY recognition algorithm.
4.2Table Tobtained by the CKY algorithm.
4.3The CKY recognition algorithm, expressed as a deduction system.
4.4The Earley recognition algorithm.
4.5Deduction system for Earley’s algorithm.
4.6Table Tobtained by Earley’s algorithm.
4.7Parse forest associated with table Tfrom Figure 4.2.
4.8Knuth’s generalization of Dijkstra’s algorithm, applied to finding the most probable parse in a probabilistic context-free grammar G.
4.9The probabilistic CKY algorithm.
4.10A parse of ‘our company is training workers,’ assuming a bilexical context-free grammar.
4.11Deduction system for recognition with a 2-LCFG. We assume w =a1 ···an, an+1 =$.
4.12Illustration of the use of inference rules (f), (c), and (g) of bilexical recognition.
4.13A projective dependency tree.
4.14A non-projective dependency tree.
4.15Deduction system for recognition with PDGs. We assume w =a1 ···an, and disregard the recognition of an+1 =$.
4.16Substitution (a) and adjunction (b) in a tree adjoining grammar.
4.17The TAG bottom-up recognition algorithm, expressed as a deduction system.
4.18A pair of trees associated with a derivation in a SCFG.
4.19An algorithm for the left composition of a sentence w and a SCFG G.
6.1An example 2D space with six examples labeled white or black.
6.2Two examples of the generation of a new hyper-rectangle in NGE.
6.3An example of an induced rule in RISE, displayed on the right, with the set of examples that it covers (and from which it was generated) on the left.
6.4An example of a family in a two-dimensional example space and ranked in the order of distance.
6.5An example of family creation in Fambl.
6.6Pseudo-code of the family extraction procedure in Fambl.
6.7Generalization accuracies (in terms of percentage of correctly classified test instances) and F-scores, where appropriate, of MBL with increasing k parameter, and Fambl with k =1 and increasing K parameter.
6.8Compression rates (percentages) of families as opposed to the original number of examples, produced by Fambl at different maximal family sizes (represented by the x-axis, displayed at a log scale).
7.1A simple decision tree for period disambiguation.
7.2State of the decision tree after the expansion of the root node.
7.3Decision tree learned from the example data.
7.4Partitions of the two-dimensional feature subspace spanned by the features ‘color’ and ‘shape.’
7.5Data with overlapping classes and the class boundaries found by a decision tree.
7.6Decision tree induced from the data in Figure 7.5 before and after pruning.
7.7Decision tree with node numbers and information gain scores.
7.8Decision tree with classification error counts.
7.9Probabilistic decision tree induced from the data in Figure 7.5. 190
7.10Part of a probabilistic decision tree for the nominative case of nouns. 194
9.1A multi-layered perceptron.
9.2Category probabilities estimated by an MLP
9.3A recurrent MLP, specifically a simple recurrent network.
9.4A recurrent MLP unfolded over the sequence.
9.5The SSN architecture, unfolded over a derivation sequence, with derivation decisions Dt and hidden layers St.
9.6An SSN unfolded over a constituency structure.
10.1An example PTB tree.
10.2 A labeled dependency structure.
10.3OntoNotes: a model for multi-layer annotation.
12.1Waveform (top) and spectrogram (bottom) of conversational utterance ‘no right I didn’t mean to imply that.’
12.2HMM-based hierarchical modeling of speech.
12.3Representation of an HMM as a parameterized stochastic finite state automaton (left) and in terms of probabilistic dependences between variables (right).
12.4Forward recursion to estimate αt(qj) = p(x1,…, xt,qt = qj | A.).
12.5Hidden Markov models for phonemes can be concatenated to form models for words.
12.6Connected word recognition with a bigram language model.
12.7Block processing diagram showing the AMI 2006 system for meeting transcription (Hain et al., 2006)
12.8Word error rates (%) results in the NIST RT’06 evaluations of the AMI 2006 system on the evaluation test set, for the four decoding passes.
13.1Example lexicalized parse-tree.
13.2Example tree with complements distinguished from adjuncts.
13.3Example tree containing a trace and the gap feature.
13.4Example unlabeled dependency tree.
13.5Generic algorithm for online learning taken from McDonald et al. (2005b).
13.6The perceptron update.
13.7Example derivation using forward and backward application.
13.8Example derivation using type-raising and forward composition.
13.9Example CCG derivation for the sentence Under new features, participants can transfer money from the new funds.
14.1The two problems of word segmentation.
14.2Word discovery from an MDL point of view.
14.3A signature for two verbs in English.
14.4Morphology discoveryaslocal descent.
14.5BuildinganFST from two FSAs. 15.1 Derivation of semantic representation with storage.
16.1Basic componentsofaspoken dialogue system.
16.2Finite state machine for a simple ticket booking application.
16.3Asimple frame.
16.4Goal-oriented action schema.
16.5A single utterance gives rise to distinct updates of the DGB for distinct participants.
17.1Relative clause attachment ambiguity.
17.2An example for the parse-trees generated by a probabilistic-context free grammar (PCFG) (adapted from Crocker & Keller 2006).
17.3The architecture of the SynSem-Integration model, from Pado et al. (2009).
17.4Asimple recurrent network.
17.5CIANet: a network featuring scene–language interaction with a basic attentional gating mechanism to select relevant events in a scene with respecttoanunfolding utterance.
17.6The competitive integration model (Spivey-Knowlton & Sedivy 1995).
18.1Example dependency tree.
19.1Asentence-aligned corpus.
19.2Anon-exact alignment.
19.3In the word-based translation on the left we see that the noun–adjective reordering into English is missed. On the right, the noun and adjective are translated as a single phrase and the correct ordering is modeled in the phrase-based translation.
19.4Merging source-to-target and target-to-source alignments (from Koehn 2010).
19.5All possible source segmentations with all possible target translations (from Koehn 2004).
19.6Hypothesis expansion via stack decoding (from Koehn 2004).
19.7An aligned tree pair in DOT for the sentence pair: he chose the ink cartridge, il a choisi la cartouche d’encre.
19.8Compositionintree-DOT.
20.1Human and corpus wind descriptions for September 19, 2000.
20.2An example literacy screener question (SkillSum input).
20.3Example text producedbySkillSum.
20.4Example SumTime document plan.
20.5Example SumTime deep syntactic structure. 21.1 Exampleofthe RST relation evidence.
22.1BasicQAsystem architecture.
22.2An ARDA scenario (from Small & Strzalkowski 2009).
22.3An answer model for the question: Where is Glasgow? (Dalmas & Webber 2007), showing both Scotland and Britain as possible answers.
22.4Example interaction taken from a live demonstration to the ARDA AQUAINT communityin2005.
22.5Goal frame for the question: What is the status of the Social Security system?
22.6Two cluster seed passages and their corresponding frames relative to the retirement clarification question.
22.7Two cluster passages and their corresponding frames relative to the private accounts clarification question.

List of Tables

3.1Headword percolation rules
3.2Binarization rules
3.3Parameter re-estimation results
3.4Interpolation with trigram results
3.5Maximum depth evolution during training
6.1Examples generated for the letter–phoneme conversion task, from the word–phonemization pair booking–[bukIN], aligned as [b-ukI-N]
6.2Number of extracted families at a maximum family size of 100, the average number of family members, and the raw memory compression, for four tasks
6.3Two example families (represented by their members) extracted from the PP and CHUNK data sets respectively
7.1Training data consisting of seven objects which are characterized by the features ‘size,’ ‘color,’ and ‘shape.’ The first four items belong to class ‘+,’ the others to class ‘
8.1Comparison of different tag sets on IPSM data
8.2Cross-linguistic evaluation: 64 clusters, left all words, right f <5
11.1Structure of a typical summary of evaluation results
11.2Contingency table for a document retrieval task
16.1NSUs in a subcorpus of the BNC
16.2Comparison of dialogue management approaches
17.1Conditional probability of a verb frame given a particular verb, as estimated using the Penn Treebank
19.1Number of fragments for English-to-French and French-to-English HomeCentre experiments
20.1Numerical wind forecast for September 19, 2000

Notes on Contributors

Ciprian Chelbais a Research Scientist with Google. Between 2000 and 2006 he worked as a Researcher in the Speech Technology Group at Microsoft Research.

He received his Diploma Engineer degree in 1993 from the Faculty of Electronics and Telecommunications at “Politehnica” University, Bucuresti, Romania, M.S. in 1996 and PhD in 2000 from the Electrical and Computer Engineering Department at the Johns Hopkins University.

His research interests are in statistical modeling of natural language and speech, as well as related areas such as machine learning and information theory as applied to natural language problems.

Recent projects include language modeling for large-vocabulary speech recog­nition (discriminative model estimation, compact storage for large models), search in spoken document collections (spoken content indexing, ranking and snipeting), as well as speech and text classification.

Alexander Clarkis an Honorary Research Fellow in the Department of Computer Science at Royal Holloway, University of London. His first degree was in Math­ematics from the University of Cambridge, and his PhD is from the University of Sussex. He did postdoctoral research at the University of Geneva. In 2007 he was a Professeur invité at the University of Marseille. He is on the editorial board of the journal Research on Language and Computation, and a member of the steer­ing committee of the International Colloquium on Grammatical Inference. His research is on unsupervised learning in computational linguistics, and in gram­matical inference; he has won several prizes and competitions for his research. He has co-authored with Shalom Lappin a book entitled Linguistic Nativism and the Poverty of the Stimulus, which is being published by Wiley-Blackwell in 2010.

Stephen Clarkis a Senior Lecturer at the University of Cambridge Computer Laboratory where he is a member of the Natural Language and Information Pro­cessing Research Group. From 2004 to 2008 he was a University Lecturer at the Oxford University Computing Laboratory, and before that spent four years as a postdoctoral researcher at the University of Edinburgh’s School of Informatics, working with Prof. Mark Steedman. He has a PhD in Artificial Intelligence from the University of Sussex and a first degree in Philosophy from the University of Cambridge. His main research interest is statistical parsing, with a focus on the grammar formalism combinatory categorial grammar. In 2009 he led a team at the Johns Hopkins University Summer Workshop working on “Large Scale Syntactic Processing: Parsing the Web.” He is on the editorial boards of Computational Lin­guistics and the Journal of Natural Language Engineering, and is a Program Co-Chair for the 2010 Annual Meeting of the Association for Computational Linguistics.

Matthew W. Crockerobtained his PhD in Artificial Intelligence from the Univer­sity of Edinburgh in 1992, where he subsequently held appointments as Lecturer in Artificial Intelligence and Cognitive Science and as an ESRC Research Fel­low. In January 2000, Dr Crocker was appointed to a newly established Chair in Psycholinguistics, in the Department of Computational Linguistics at Saarland University, Germany. His current research brings together the experimental inves­tigation of real-time human language processing and situated cognition in the development of computational cognitive models.

Matthew Crocker co-founded the annual conference on Architectures and Mechanisms for Language Processing (AMLaP) in 1995. He is currently an asso­ciate editor for Cognition, on the editorial board of Springer’s Studies in Theoretical Psycholinguistics, and has been a member of the editorial board for Computational Linguistics.

Walter Daelemans(MA, University of Leuven, Belgium, 1982; PhD, Compu­tational Linguistics, University of Leuven, 1987) held research and teaching positions at the Radboud University Nijmegen, the AI-LAB at the University of Brussels, and Tilburg University, where he founded the ILK (Induction of Linguistic Knowledge) research group, and where he remained part-time Full Professor until 2006. Since 1999, he has been a Full Professor at the University of Antwerp (UA), teaching Computational Linguistics and Artificial Intelligence courses and co-directing the CLiPS research center. His current research inter­ests are in machine learning of natural language, computational psycholinguistics, and text mining. He was elected fellow of ECCAI in 2003 and graduated 11 PhD students as supervisor.

Raquel Fernándezis a Postdoctoral Researcher at the Institute for Logic, Lan­guage and Computation, University of Amsterdam. She holds a PhD in Computer Science from King’s College London for work on formal and computational mod­eling of dialogue and has published numerous peer-review articles on dialogue research. She has worked as Research Fellow in the Center for the Study of Language and Information (CSLI) at Stanford University and in the Linguistics Department at the University of Potsdam.

Dr Chris Foxis a Reader in the School of Computer Science and Electronic Engi­neering at the University of Essex. He started his research career as a Senior Research Officer in the Department of Language and Linguistics at the University of Essex. He subsequently worked in the Computer Science Department where he obtained his PhD in 1993. After that he spent a brief period as a Visiting Researcher at Saarbruecken before becoming a Lecturer at Goldsmiths College, University of London, and then King’s College London. He returned to Essex in 2003. At the time of writing, he is serving as Deputy Mayor of Wivenhoe.

Much of his research is in the area of logic and formal semantics, with a partic­ular emphasis on issues of formal expressiveness, and proof-theoretic approaches to characterizing intuitions about natural language semantic phenomena.

Jonathan Ginzburgis a Senior Lecturer in the Department of Computer Sci­ence at King’s College London. He has previously held posts in Edinburgh and Jerusalem. He is one of the managing editors of the journal Dialogue and Discourse. He has published widely on formal semantics and dialogue. His monograph The Interactive Stance: Meaning for Conversation was published in 2009.

John A. Goldsmithis Edward Carson Waller Distinguished Service Professor in the Departments of Linguistics and Computer Science at the University of Chicago, where he has been since 1984. He received his PhD in Linguistics in 1976 from MIT, and taught from 1976 to 1984 at Indiana University. His primary inter­ests are computational learning of natural language, phonological theory, and the history of linguistics.

Ralph Grishmanis Professor of Computer Science at New York University. He has been involved in research in natural language processing since 1969, and since 1985 has directed the Proteus Project, with funding from DARPA, NSF, and other government agencies. The Proteus Project has conducted research in natural lan­guage text analysis, with a focus on information extraction, and has been involved in the creation of a number of major lexical and syntactic resources, including Comlex, Nomlex, and NomBank. He is a past President of the Association for Computational Linguistics and the author of the text Computational Linguistics: An Introduction.

Thomas Hainholds the degree Dipl.-Ing. with honors from the University of Technology, Vienna and a PhD from Cambridge University. In 1994 he joined Philips Speech Processing, which he left as Senior Technologist in 1997. He took up a position as Research Associate at the Speech, Vision and Robotics Group and Machine Intelligence Lab at the Cambridge University Engineering Department where he also received an appointment as Lecturer in 2001. In 2004 he joined the Department of Computer Science at the University of Sheffield where he is now a Senior Lecturer. Thomas Hain has a well established track record in automatic speech recognition, in particular involvement in best-performing ASR systems for participation in NIST evaluations. His main research interests are in speech recog­nition, speech and audio processing, machine learning, optimisation of large-scale statistical systems, and modeling of machine/machine interfaces. He is a member of the IEEE Speech and Language Technical Committee.

James B. Hendersonis an MER (Research Professor) in the Department of Computer Science of the University of Geneva, where he is co-head of the interdisciplinary research group Computational Learning and Computational Linguistics. His research bridges the topics of machine learning methods for structure-prediction tasks and the modeling and exploitation of such tasks in NLP, particularly syntactic and semantic parsing. In machine learning his current interests focus on latent variable models inspired by neural networks. Previously, Dr Henderson was a Research Fellow in ICCS at the University of Edinburgh, and a Lecturer in CS at the University of Exeter, UK. Dr Henderson received his PhD and MSc from the University of Pennsylvania, and his BSc from the Massachusetts Institute of Technology, USA.

Shalom Lappinis Professor of Computational Linguistics at King’s College London. He does research in computational semantics, and in the application of machine learning to issues in natural language processing and the cognitive basis of language acquisition. He has taught at SOAS, Tel Aviv University, the University of Haifa, the University of Ottawa, and Ben Gurion University of the Negev. He was also a Research Staff member in the Natural Language group of the Computer Science Department at IBM T.J. Watson Research Center. He edited the Handbook of Contemporary Semantic Theory (1996, Blackwell), and, with Chris Fox, he co-authored Foundations of Intensional Semantics (2005, Blackwell). His most recent book, Linguistic Nativism and the Poverty of the Stimulus, co-authored with Alexander Clark, is being published by Wiley-Blackwell in 2010.

Jimmy Linis an Associate Professor in the iSchool at the University of Mary­land, affiliated with the Department of Computer Science and the Institute for Advanced Computer Studies. He graduated with a PhD in Computer Science from MIT in 2004. Lin’s research lies at the intersection of information retrieval and nat­ural language processing, and he has done work in a variety of areas, including question answering, medical informatics, bioinformatics, evaluation metrics, and knowledge-based retrieval techniques. Lin’s current research focuses on “cloud computing,” in particular, massively distributed text processing in cluster-based environments.

Robert Maloufis an Associate Professor in the Department of Linguistics and Asian/Middle Eastern Languages at San Diego State University. Before coming to SDSU, Robert held a postdoctoral fellowship in the Humanities Computing Department, University of Groningen (1999–2002). He received a PhD in Linguis­tics from Stanford University (1998) and BA in linguistics and computer science from SUNY Buffalo (1992). His research focuses on the application of compu­tational techniques to understanding how language works, particularly in the domains of morphology and syntax. He is currently investigating the use of evolutionary simulation for explaining linguistic universals.

Prof. Ruslan Mitkovhas been working in (applied) natural language process­ing, computational linguistics, corpus linguistics, machine translation, transla­tion technology, and related areas since the early 1980s. His extensively cited research covers areas such as anaphora resolution, automatic generation of multiple-choice tests, machine translation, natural language generation, automatic summarization, computer-aided language processing, centering, translation memory, evaluation, corpus annotation, bilingual term extraction, question answering, automatic identification of cognates and false friends, and an NLP-driven corpus-based study of translation universals.

Mitkov is author of the monograph Anaphora Resolution (2002, Longman) and sole editor of The Oxford Handbook of Computational Linguistics (2005, Oxford Uni­versity Press). Current prestigious projects include his role as Executive Editor of the Journal of Natural Language Engineering (Cambridge University Press) and Editor-in-Chief of the Natural Language Processing book series (John Benjamins Publishing). Ruslan Mitkov received his MSc from the Humboldt University in Berlin, his PhD from the Technical University in Dresden and he worked as a Research Professor at the Institute of Mathematics, Bulgarian Academy of Sciences, Sofia. Prof. Mitkov is Professor of Computational Linguistics and Lan­guage Engineering at the School of Humanities, Languages and Social Sciences at the University of Wolverhampton which he joined in 1995, where he set up the Research Group in Computational Linguistics. In addition to being Head of the Research Group in Computational Linguistics, Prof. Mitkov is also Director of the Research Institute in Information and Language Processing.

Dr Mark-Jan Nederhofis a Lecturer in the School of Computer Science at the University of St Andrews. He holds a PhD (1994) and MSc (1990) in computer sci­ence from the University of Nijmegen. Before coming to St Andrews in 2006, he was Senior Researcher at DFKI in Saarbrücken and Lecturer in the Faculty of Arts at the University of Groningen. He has served on the editorial board of Computa­tional Linguistics and has been a member of the programme committees of EACL, HLT/EMNLP, and COLING-ACL.

His research covers areas of computational linguistics and computer languages, with an emphasis on formal language theory and computational complexity. He is also developing tools for use in philological research, and especially the study of Ancient Egyptian.

Martha Palmeris an Associate Professor in the Linguistics Department and the Computer Science Department of the University of Colorado at Boulder, as well as a Faculty Fellow of the Institute of Cognitive Science. She was formerly an Asso­ciate Professor in Computer and Information Sciences at the University of Pennsyl­vania. She has been actively involved in research in natural language processing and knowledge representation for 30 years and did her PhD in Artificial Intelli­gence at the University of Edinburgh in Scotland. She has a life-long interest in the use of semantic representations in natural language processing and is dedicated to the development of community-wide resources. She was the leader of the English, Chinese, and Korean PropBanks and the Pilot Arabic PropBank. She is now the PI for the Hindi/Urdu Treebank Project and is leading the English, Chinese, and Arabic sense-tagging and PropBanking efforts for the DARPA-GALE OntoNotes project. In addition to building state-of-the-art word-sense taggers and semantic role labelers, she and her students have also developed VerbNet, a public-domain rich lexical resource that can be used in conjunction with WordNet, and SemLink, a mapping from the PropBank generic arguments to the more fine-grained VerbNet semantic roles as well as to FrameNet Frame Elements. She is a past President of the Association for Computational Linguistics, and a past Chair of SIGHAN and SIGLEX, where she was instrumental in getting the Senseval/Semeval evaluations under way.

Ian Pratt-Hartmannstudied Mathematics and Philosophy at Brasenose College, Oxford, and Philosophy at Princeton and Stanford Universities, gaining his PhD from Princeton in 1987. He is currently Senior Lecturer in the Department of Computer Science at the University of Manchester.

Ehud Reiteris a Reader in Computer Science at the University of Aberdeen in Scotland. He completed a PhD in natural language generation at Harvard in 1990 and worked at the University of Edinburgh and at CoGenTex (a small US NLG company) before coming to Aberdeen in 1995. He has published over 100 papers, most of which deal with natural language generation, including the first book ever written on applied NLG. In recent years he has focused on data-to-text systems and related “language and the world” research challenges.

Steve Renalsreceived a BSc in Chemistry from the University of Sheffield in 1986, an MSc in Artificial Intelligence in 1987, and a PhD in Speech Recognition and Neural Networks in 1990, both from the University of Edinburgh. He is a Profes­sor in the School of Informatics, University of Edinburgh, where he is the Director of the Centre for Speech Technology Research. From 1991 to 1992, he was a Post­doctoral Fellow at the International Computer Science Institute, Berkeley, CA, and was then an EPSRC Postdoctoral Fellow in Information Engineering at the Uni­versity of Cambridge (1992–4). From 1994 to 2003, he was a Lecturer then Reader at the University of Sheffield, moving to the University of Edinburgh in 2003. His research interests are in the area of signal-based approaches to human com­munication, in particular speech recognition and machine learning approaches to modeling multi-modal data. He has over 150 publications in these areas.

Philip Resnikis an Associate Professor at the University of Maryland, College Park, with joint appointments in the Department of Linguistics and the Institute for Advanced Computer Studies. He completed his PhD in Computer and Infor­mation Science at the University of Pennsylvania in 1993. His research focuses on the integration of linguistic knowledge with data-driven statistical modeling, and he has done work in a variety of areas, including computational psycholinguis-tics, word-sense disambiguation, cross-language information retrieval, machine translation, and sentiment analysis.

Giorgio Sattareceived a PhD in Computer Science in 1990 from the University of Padua, Italy. He is currently a Full Professor at the Department of Infor­mation Engineering, University of Padua. His main research interests are in computational linguistics, mathematics of language and formal language theory.

For the years 2009–10 he is serving as Chair of the European Chapter of the Association for Computational Linguistics (EACL). He has joined the standing committee of the Formal Grammar conference (FG) and the editorial boards of the journals Computational Linguistics, Grammars and Research on Language and Compu­tation. He has also served as Program Committee Chair for the Annual Meeting of the Association for Computational Linguistics (ACL) and for the International Workshop on Parsing Technologies (IWPT).

Helmut Schmidworks as a Senior Scientist at the Institute for Natural Language Processing in Stuttgart with a focus on statistical methods for NLP. He developed a range of tools for tokenization, POS tagging, parsing, computational morphology, and statistical clustering, and he frequently used decision trees in his work.

Antal van den Bosch(MA, Tilburg University, The Netherlands, 1992; PhD, Computer Science, Universiteit Maastricht, The Netherlands, 1997) held Research Assistant positions at the experimental psychology labs of Tilburg University and the Université Libre de Bruxelles (Belgium) in 1993 and 1994. After his PhD project at the Universiteit Maastricht (1994–7), he returned to Tilburg University in 1997 as a postdoc researcher. In 1999 he was awarded a Royal Dutch Academy of Arts and Sciences fellowship, followed in 2001 and 2006 by two consecutively awarded Innovational Research funds of the Netherlands Organisation for Sci­entific Research. Tilburg University appointed him as Assistant Professor (2001), Associate Professor (2006), and Full Professor in Computational Linguistics and AI (2008). He is also a Guest Professor at the University of Antwerp (Belgium). He currently supervises five PhD students, and has graduated seven PhD students as co-supervisor. His research interests include memory-based natural language processing and modeling, machine translation, and proofing tools.

Prof. Andy Wayobtained his BSc (Hons) in 1986, MSc in 1989, and PhD in 2001 from the University of Essex, Colchester, UK. From 1988 to 1991 he worked at the University of Essex, UK, on the Eurotra Machine Translation project. He joined Dublin City University (DCU) as a Lecturer in 1991 and was promoted to Senior Lecturer in 2001 and Associate Professor in 2006. He was a DCU Senior Albert College Fellow from 2002 to 2003, and has been an IBM Centers for Advanced Studies Scientist since 2003, and a Science Foundation Ireland Fellow since 2005. He has published over 160 peer-reviewed papers. He has been awarded grants totaling over e6.15 million since 2000, and over e6.6 million in total. He is the Centre for Next Generation Localisation co-ordinator for Integrated Language Technologies (ILT). He currently supervises eight students on PhD programs of study, all of whom are externally funded, and has in addition graduated 10 PhD and 11 MSc students. He is currently the Editor of the journal Machine Translation, President of the European Association for Machine Translation, and President-Elect of the International Association for Machine Translation.

Nick Webbis a Senior Research Scientist in the Institute for Informatics, Logics and Security Studies, at the University at Albany, SUNY, USA. Previously he was a Research Fellow in the Natural Language Processing Group at the University of Sheffield, UK, and a Research Officer at the University of Essex, UK, where he obtained a BSc in Computer Science (with a focus on Artificial Intelligence) and an MSc (in Computational Linguistics). His PhD from Sheffield concerns the analysis of dialogue corpora to build computational models of dialogue-act classification, and his research interests concern intelligent information access, including interactive question answering and dialogue systems.

Bonnie Webberwas a Researcher at Bolt Beranek and Newman while working on the PhD she received from Harvard University in 1978. She then taught in the Department of Computer and Information Science at the University of Pennsyl­vania for 20 years before joining the School of Informatics at the University of Edinburgh. Known for research on discourse and on question answering, she is a Past President of the Association for Computational Linguistics, co-developer (with Aravind Joshi, Rashmi Prasad, Alan Lee, and Eleni Miltsakaki) of the Penn Discourse TreeBank, and co-editor (with Annie Zaenen and Martha Palmer) of the new electronic journal, Linguistic Issues in Language Technology.

Shuly Wintneris a Senior Lecturer at the Department of Computer Science, University of Haifa, Israel. His research spans various areas in computational linguistics, including formal grammars, morphology, syntax, development of lan­guage resources and machine translation, with a focus on Semitic languages. He has published over 60 scientific papers in computational linguistics. Dr Wintner is the Editor-in-Chief of the journal Research in Language and Computation.

Nianwen Xueis an Assistant Professor of Languages & Linguistics and Com­puter Science at Brandeis University. His research interests include syntactic and semantic parsing, machine translation, temporal representation and inference, Chinese-language processing, and linguistic annotation (Chinese Treebank, Chi­nese Proposition Bank, OntoNotes). He serves on the ACL SIGANN committee and co-organized the Linguistic Annotation Workshops (LAW II and LAW III) and the 2009 CoNLL Shared Task on Syntactic and Semantic Dependencies in Multiple Languages. He got his PhD in linguistics from the University of Delaware.

Preface

We started work on this handbook three years ago and, while bringing it to fruition has involved a great deal of work, we have enjoyed the process. We are grateful to our colleagues who have contributed chapters to the volume. Its quality is due to their labor and commitment. We appreciate the considerable time and effort that they have invested in making this venture a success. It has been a pleasure working with them.

We owe a debt of gratitude to our editors at Wiley-Blackwell, Danielle Descoteaux and Julia Kirk, for their unstinting support and encouragement throughout this project. We wish that all scientific-publishing projects were blessed with publishers of their professionalism and good nature.

Finally, we must thank our families for enduring the long period of time that we have been engaged in working on this volume. Their patience and good will has been a necessary ingredient for its completion.

The best part of compiling this handbook has been the opportunity that it has given each of us to observe in detail and in perspective the wonderful burst of creativity that has taken hold of our field in recent years.

Alexander Clark, Chris Fox, and Shalom Lappin
London and Wivenhoe
September 2009

Introduction

The field of computational linguistics (CL), together with its engineering domain of natural language processing (NLP), has exploded in recent years. It has developed rapidly from a relatively obscure adjunct of both AI and formal linguistics into a thriving scientific discipline. It has also become an important area of industrial development. The focus of research in CL and NLP has shifted over the past three decades from the study of small prototypes and theoretical models to robust learning and processing systems applied to large corpora. This handbook is intended to provide an introduction to the main areas of CL and NLP, and an overview of current work in these areas. It is designed as a reference and source text for graduate students and researchers from computer science, linguistics, psychology, philosophy, and mathematics who are interested in this area.

The volume is divided into four main parts. Part I contains chapters on the formal foundations of the discipline. Part II introduces the current methods that are employed in CL and NLP, and it divides into three subsections. The first section describes several influential approaches to Machine Learning (ML) and their application to NLP tasks. The second section presents work in the annotation of corpora. The last section addresses the problem of evaluating the performance of NLP systems. Part III of the handbook takes up the use of CL and NLP procedures within particular linguistic domains. Finally, Part IV discusses several leading engineering tasks to which these procedures are applied.

In Chapter 1 Shuly Wintner gives a detailed introductory account of the main concepts of formal language theory. This subdiscipline is one of the primary formal pillars of computational linguistics, and its results continue to shape theoretical and applied work. Wintner offers a remarkably clear guide through the classical language classes of the Chomsky hierarchy, and he exhibits the relations between these classes and the automata or grammars that generate (recognize) their members.

While formal language theory identifies classes of languages and their decidability (or lack of such), complexity theory studies the computational resources in time and space required to compute the elements of these classes. Ian Pratt-Hartmann introduces this central area of computer science in Chapter 2, and he takes up its significance for CL and NLP. He describes a series of important complexity results for several prominent language classes and NLP tasks. He also extends the treatment of complexity in CL/NLP from classical problems, like syntactic parsing, to the relatively unexplored area of computing sentence meaning and logical relations among sentences.

Statistical modeling has become one of the primary tools in CL and NLP for representing natural language properties and processes. In Chapter 3 Ciprian Chelba offers a clear and concise account of the basic concepts involved in the construction of statistical language models. He reviews probabilistic n-gram models and their relation to Markov systems. He defines and clarifies the notions of perplexity and entropy in terms of which the predictive power of a language model can be measured. Chelba compares n-gram models with structured language models generated by probabilistic context-free grammars, and he discusses their applications in several NLP tasks.