Assignment 3:  Information Extraction

In this assignment you are to implement an HMM-based approach to named entity recognition.  In this approach, we can cast the problem of finding named entities as a tagging task using IOB tags. The framework for the HMM-based solution is identical to the one used for the POS tagging assignment.  The particular NER task we’ll be tackling is to find all the references to genes in a set of biomedical journal article abstracts. 

Sample GENE tags

Structure O

, O

promoter O

analysis O

and O

chromosomal O

assignment O

of O

the O

human B

APEX I

gene I

. O

The training material consists of around 13000 sentences with gene references tagged with IOB tags.  Since we’re only dealing with one kind of entity here there are only three tag types in the data. The format of the data is identical to the POS tagging HW: one token per line associated with its proper tag (tab separated).  An example is shown in the sidebar. In this example there is one gene mentioned “human apex gene” with the corresponding tag sequence B I I.

Although the structure of this problem is the same as POS tagging, the characteristics of the problem are quite different.  In particular, there are far fewer parameters to learn for transition probabilities since there are only three tags. However, the vocabulary is much larger than the BERP domain and unknown words will be far more prevalent.  Both of these considerations may lead you to different strategies from those you used in Assignment 2.

Evaluation

As noted in the book, evaluation of these kinds of systems is not based on per tag accuracy (you can do pretty well on that basis by just saying O all the time). What we really want to optimize is recall, precision and f-measures at the gene level.  Remember that precision is the ratio of correctly identified genes identified by the system to the total number of genes your system found,  and recall is the ratio of correctly identified genes to the total that you should have found.  The F1 measure given in the book is just the harmonic mean of these two. We will use F1 to evaluate your systems on the withheld test data.  You should create a training and dev set from the data being provided for use in developing your system.  


Option 2

You have been assigned to pairs to do this assignment. Each pair is responsible for a single program submission. You have a choice of two ways to proceed with this assignment. Option 1 has just been described. Option 2 is the same as Option 1 but includes using skeleton code supplied in labs and has a mandatory Part II to the assignment. Part II consists of defining and explaining in detail 5 linguistics features for use in POS Tagging (not implemented - just a written description). The skeleton code and one or two associated labs describing how to use it will be available the week afer Spring Break. This Part II can also be combined with Option 1 (which does not use the skeleton code) for Extra Credit on this assignment


What to hand in

As before, you will need to turn in your code, along with a short report describing what you did and all the choices you made.  You should include in your report how you assessed your system (training/dev) and how well you believe it works.  In addition, we will provide a test set for you to run your system on shortly before the due date.  Include the output of your system on this test data in the same form as the training data.

The format for how Jack will be calling your code will be:

python {Partner1LastName}_{Partner2LastName}_IOBTagger.py "PathToInputTrainingData" "PathToInputTestData" "PathToOutputFile"

An example of a call might look like this:

python Palmer_Brown_IOBTagger.py "C:/User/Jack/trainingData.txt" "C:/User/Jack/testData.txt" "C:/User/Jack/Palmer_Brown_output.txt"

Here are a few other things to note:

  • Please do not hard code these paths into your code!!
  • The output file should have the same number of lines as the inputted test file.
  • Do not assume that only one test file will be used, Jack may be using an additional test file to evaluate your code as well.
  • Please do not have print statements inside of long loops!! This slows down your code significantly and does not help us evaluate your program. In general your code does not need any print statements, you may add some to identify the current step the program is on if you desire.


How to hand it in

1. Go to learn.colorado.edu and login with your identikey
2. Choose the Natural Lang Processing course
3. Click the Dropbox link from the Assessments dropdown menu.
4. Click on the "Programming Assignment 3 Submission" link
5. Once you've added both your code and report files click the submit button.
Note: You may submit as many times as you want, but only your last submission will be graded. If you update just your report or code make sure to upload both for the last submission.

This is due by 11:55PM on APRIL 24, 2015. .