The Hindi/Urdu Treebank: New Frontiers in Hindi and Urdu Natural Language Processing
COLING 2012, Mumbai, India

The Hindi/Urdu Treebank (HUTB) (Bhatt et al 2009) has been under development since 2008. It is an innovative resource for research on natural language processing (NLP) for Hindi and Urdu: like all treebanks, it contains levels of morphological and syntactic annotations. However, unlike other treebanks, it is being annotated simultaneously for lexical predicate-argument structure (PB, in the PropBank style), and the syntactic annotations are available in both a dependency representation (DS) and a phrase structure representation (PS). All three levels of representation are independently motivated and independently described. The treebank comprises 400,000 words of Hindi and 150,000 words of Urdu.

The goal of this tutorial is to give an in-depth introduction to the HUTB and to enable participants to start using it productively. Because of the novelty of the structure of the treebank, researchers in Hindi and Urdu NLP are encouraged to fully understand the contents of this treebank, rather than to just consider it as raw data for machine learning approaches.

For this tutorial, we will release 218 sentences in Hindi, which illustrate specific linguistic phenomena in the language. This dataset includes semantic annotation on the syntactic dependency structure, and its automatically derived phrase structure counterpart. Hence, we can see the three different levels of linguistic analysis: Dependency Structure, Predicate Argument Structure and Phrase Structure for a given sentence.

The data is available in two formats: Shakti Standard Format (with chunk information) and the CONLL format.
Additionally, we have also included a PDF version of these sentences which show all the three layers in a graphical format. Each of these formats are available in UTF-8 or Roman encoding. The data package also includes the documentation.

In order to access the latest version of the Dependency Treebank only, please look at the shared task website.


The structure of the tutorial will be as follows. (Total time is 3 hours 45 minutes, excluding a coffee break)

Introduction to the nature of syntactic representations. (Rambow, 15 minutes slides)

Introduction to the morphology, syntax, and lexical semantics of Hindi and Urdu. (Sharma, 40 minutes slides)

The morphological representation for Hindi and Urdu, including encoding issues, tokenization, part-of-speech tags, and morphological representation. (Rambow and Sharma, 20 minutes slides)

The dependency representation (DS) for Hindi and Urdu syntax: principles, representation, and examples. (Sharma, 25 minutes slides )

The lexical semantic representation (PB) for Hindi and Urdu: principles, representation, and examples. (Vaidya, 25 minutes slides)

The phrase structure representation (PS) for Hindi and Urdu syntax: principles, representation, and examples. (Rambow, 25 minutes slides)

Sample initial experiments in Hindi and Urdu NLP using the HUTB.

(Sharma, Rambow, and Vaidya, 15 minutes slides).

Columbia University : Owen Rambow

International Institute of Information Technology: Dipti Misra Sharma

University of Colorado, Boulder : Ashwini Vaidya

Computational Language and Education Research, University of Colorado Boulder