http://verbs.colorado.edu/~mpalmer/CSCI5832/assignments_2/assignment-1-deterministic-.html Assignment 2 | Natural Language Processing

Assignment 2

In this assignment you are to implement an HMM-based approach to POS tagging. Specifically, you are to implement the Viterbi algorithm using a bigram tag/state model.  As training data, I am providing a POS-tagged section of the BERP corpus. Your systems will be evaluated against an unseen test set drawn from the same corpus.

Sample sentence

i       PRP

'd      MD

like    VB

french  JJ

food    NN

.       .


Training

The training data consists of around 15,000 POS-tagged sentences from the BERP corpus. The sentences are arranged as one word-tag pair per line with a blank line between sentences, words and tags are tab-separated. Contractions are split out into separate tokens with separate tags.  An example is shown to the left.

You should assume that the tags that appear in the training data constitute all the tags that exist (no new tags will appear in testing).  On the other hand, new words may appear in the test set.


Decoding

For decoding, your system will read in sentences from a file with the same format minus the tags.  That is one word per line ending with a period and blank line before the next sentence. As output you should emit an appropriate tag for each word in the same format as the training data.

Evaluation

To know if you're making progress on improving your system during development you need to have an evaluation metric.  I'm providing an evaluation script that reports a simple aggregate accuracy score.  To use the script from the command line on a Unix system, type something like this:

The evaluation script is being updated - stay tuned....

    evalPOS.py  berp-key.txt berp-out.txt

where berp-key.txt contains the word <tab> tag  pairs for the gold standard tags and berp-out.txt contains the system output to be evaluated.

Assignment

To complete the assignment, you'll need to address the following problems:

  1. Extract the required counts from the training data for the various probability estimates that are needed.
  2. Deal will unknown words in some systematic way.
  3. Do some form of smoothing (for the bigram transition probabilities).
  4. Implement the Viterbi  (Figure 5.17) algorithm.
  5. Tune your system in some sensible manner.  This may involve writing a smarter evaluation script than the one I'm giving you.  In particular,  an evaluation script that produced a per class error rate, or even better, a complete confusion matrix would be useful.


What to hand in

You will need to turn in your code, along with a short report describing what you did and all the choices you made.  You should include in your report how you assessed your system (training/dev) and how well you believe it works.  In addition, I will provide a test set for you to run your system on shortly before the due date.  Include the output of your system on this test data in the same form as the training data.


How to hand it in

1. Go to learn.colorado.edu and login with your identikey
2. Choose the Natural Lang Processing course
3. Click the Dropbox link from the Assessments dropdown menu.
4. Click on the "Programming Assignment 2 Submission" link
5. Once you've added both your code and report files click the submit button.
Note: You may submit as many times as you want, but only your last submission will be graded. If you update just your report or code make sure to upload both for the last submission.

This is due by 11:55PM on EXTENDED DEADLINE - MARCH 5!! .