Term Project Ideas - Intention

The goal behind this is assignment is to have you become familiar with a specific research area and take a stab at moving the state of the art forward. The project could be primarily linguistic analysis or primarily programming or could combine the two. You are encouraged to work in teams of 2, 3 or even 4 persons. If you do work in a team it should be interdisciplinary, it should include students from at least two different departments. You should assume that you will have to read something like 3 or 4 papers over and above the class required readings to ground yourself in the research area. You will then define an experiment or a set of analyses or a system that you will run, perform or implement, respectively, to explore some aspect of your research area. You are expected to turn in a 5 page, single spaced paper describing your project, and give a 10-15 minute presentation on it. You could also do serious comparison of two or three approaches at a detailed level, in which case you would turn in a longer paper of at least 10 pages or more.

In addition to your own Term Project, you are expected to be a discussant on another project. The Discussant assignments are on the class web page. That will involve reading the project background paper(s), asking constructive questions during the project presentation, and turning in this questionnaire.

Past Projects

A synopsis of IBM's Watson Q/A system - KA

An Empirically Validated Thematic Role Hierarchy - LG & JI

In Trento, Sara Tonelli and Irina Sergienya (a Ph.D. student) are working on automatically extracting hierarchical relations between PropBank arguments, VerbNet thematic roles and FrameNet Frame Elements from the annotations in PropBank I that are available from Semlink. This project would take the output of their automatic extraction and evaluate it, so that the results can be published. It would therefore require at least some computing skills, but primarily linguistic judgements. A useful reference for this evaluation, in addition to the PB, VN and FN papers already posted, would be

A Critical Anlaysis of BabelNet - LE, JE, GR

See this link to be introduced to an "encyclopedic dictionary" and a multilingual ontology created by mapping the largest multilingual Web encyclopedia - i.e., Wikipedia - to the most popular computational lexicon of English - i.e., WordNet. The integration is performed via an automatic linking algorithm and by filling in lexical gaps with the aid of Machine Translation. The result is an "encyclopedic dictionary" that provides babel synsets, i.e., concepts and named entities lexicalized in many languages and connected with large amounts of semantic relations. For the paper presentations, one or two from the website, but also one or two explaining the technical approaches in more detail.

This project could include a detailed critique of the strengths and limitations of BabelNet as currently implemented, and a pilot implementation of an alternative approach.

CPA and FrameNet - DP, KS, TO, JP (see induction of SR as well)

How well does CPA corresopond to distinct FrameNet Frames? In concert with Octavian Popescu and Sara Tonelli (via skype), investigate the relationships that exists between the corpus patterns and FrameNet. The topic may be investigated from both a theoretical and an empirical point of view. Trento will make available corpora with examples of corpus patterns, both manually annotated (CPA - extracted from BNC) , and automatically extracted (from BNC, but we could also use the OntoNotes/SemLink text resources) in Italian and English, in order to see if a sufficiently accurate mapping could be generated between corpus patterns and frames, maybe via ontological attributes (SUMO). Tools involving corpus patterns (preprocessing, extraction, learning, recognition) could also be provided. In the IWSC 2013 paper, the tool for learning and recognition is described. How does Octavian's approach constrast with the approach outlined in the Lexical Substitutability paper?

Event Detection - MB (see also WSD)

There are several possible projects on this topic. For instance, could the ideas on inducing semantic relations (see below) be extended to the notion of event detection? How would that differ from the approach outlined in the following papers?

How would any of these differ from the following:

Syntactic Parsing

We have a syntactic parser that has been trained on DARPA data. We have new fragmentary medical data it has been retrained on, and we would like to know if it is handling the sentence fragments correctly. This requires identifying the fragmentary sentence in the corpus, comparing the performance before and after retraining on the fragmentary training data, and doing error analysis on the sentence fragments that are not parsed correctly. This would be an excellent interdisciplinary project, with linguists and cs students.

Project Ideas based on Papers such as the following: - MP, Sentiment

Acquiring Semantic Class Preferences from Corpora - MG, MO

We have just parsed and done SRL on Gigaword. We can add VN class tags, so we have the data to investigate how well some of these techniques work, and what their strengths and weaknesses are. Again, something that would really benefit from a team of linguists and computer scientists. It would also be interesting to compare these approaches with the Lexical Substitutability paper.

Exploring the Argument mismatches between PropBank and VerbNet - YA, Arabic PB and VN

We have a database that maps between PropBank Frame File entries and VerbNet thematic grids. Sometimes an argument on one side will have no mapping on the other side. Are these usually adjuncts, or is there another explanation? This could be for a language other than English, such as Arabic, if there is already an Arabic PropBank and an Arabic VerbNet.

Evaluating the contribution SRL makes to an application/Extensions to SRL - TL, Chinese VerbNet

This could include evaluating the contribution SRL makes to any NLP application, or developing PropBanks or VerbNets or FrameNets for other languages such as Chinese or Arabic.

Word Sense Disambiguation, extended to include NER - JG & AJ (Clinical NER), MB

A topic in this area could range from an implementation to an in-depth analysis of different approaches and their strengths and weaknesses.

Topic Modeling - TS w/ Jim Martin and EPIC group

Extensions to

Induction of Semantic Relations - DB, KS, TO, JP

Just about anything related to one of these papers or to NELL.

Topic of Your Choice - SL, rich dependency labels for Korean Dependency Structure parses

I'm open to suggestion. Find a couple of papers you are interested in and we can talk about how to turn that into a project.