-
Notifications
You must be signed in to change notification settings - Fork 53
Homework5
DUE: by Wednesday, April 24 by 1pm
For this homework, you will complete a sentiment analysis system for tweets by implementing features that will be used by supervised machine learners. Unlike previous homeworks, you will not be given much code to start with -- just a mostly empty repository and some requirements for how command-line arguments should produce particular outputs.
Notes:
- You should read all of the problems before starting.
- It will be best to solve the problems in order since each one builds on the previous problem.
- If you run into what you think might be a bug with Nak or Chalk, please let me know right away so I can fix it if it indeed is a bug.
- Feel free to look at the sentiment analysis homework on which this is based. This homework cuts some aspects of that homework, like doing subjectivity analysis separately in a two-stage classification setup.
- Tip: Check out Bo Pang and Lillian Lee's book: Opinion Mining and Sentiment Analysis (free online!)
- If you have any questions or problems with any of the materials, don't hesitate to ask!
If you don’t absolutely need to have words on paper, don’t print this homework description out -- it has a lot of example output that makes it lengthy, and it will be easier to do some cutting and pasting etc right from the web page.
As usual, you may work on this homework in teams. Make sure to include the names of the team members in the write-up. Only one submission is required per team. Teams are responsible for self-policing to ensure that everyone is helping get the work done. However, please get in touch with me if you feel someone is not working and you need help resolving the situation.
Fork the gpp repository. As you will notice, it is mostly empty.
Note for anyone who isn't a student in the class: If you are trying this homework out after it is due, you may find that the contents of the repository have changed. In that case, you can download version 0.1.0. (Also available as tag v0.1.0 in the repo.)
Your implemented solution will be a tagged version of your gpp fork and a PDF containing your written responses. Use the tag "ANLP-HW5-SUBMISSION".
You will work with Twitter polarity classification datasets that are located in gpp/data
. Go to that directory and have a look at it. Note that all of the commands given in the problems assume that you are running them in the top-level gpp
directory.
IMPORTANT: It should be possible for me to clone your repository, compile and run the commands given below without any trouble. Make sure to test it out on a different machine from the one you developed on.
Submission: Submit your written answers on Blackboard as <lastname>_<firstname>_hw5_answers.pdf
. Make sure to include the link to your fork of gpp at the top of the file.
We now turn to the sentiment analysis task: predicting the polarity of tweets. There are three datasets: the Debate08 (Obama-McCain) dataset, Health Care Reform (HCR) dataset, and the Stanford Twitter Sentiment dataset.
The Debate08 dataset comes from the following papers:
- David A. Shamma; Lyndon Kennedy; Elizabeth F. Churchill. 2009. Tweet the Debates: Understanding Community Annotation of Uncollected Sources. ACM Multimedia, ACM.
- Nicholas A. Diakopoulos; David A. Shamma. 2010. Characterizing Debate Performance via Aggregated Twitter Sentiment CHI 2010, ACM.
This dataset can be found in data/debate08
. It has been split into train/dev/test XML files that you'll be using for the obvious purposes. See the script data/debate08/orig/create_debate_data_splits.py
if you want to see the details of how the raw annotations were processed to create the files you'll be using.
The HCR dataset comes from the following paper:
- Michael Speriosu, Nikita Sudan, Sid Upadhyay, and Jason Baldridge. 2011. Twitter Polarity Classification with Label Propagation over Lexical Links and the Follower Graph. In Proceedings of the First Workshop on Unsupervised Methods in NLP. Edinburgh, Scotland.
As before, there is a train/dev/test split, with each split in its own XML file. [Note: I'm working on getting an updated version of this dataset with significantly more annotations, and more targets. I'll email the class when that is prepared.]
The Stanford Twitter Sentiment dataset comes from the following paper:
- Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervision. Unpublished manuscript. Stanford University, 2009.
This dataset is much smaller, so it will be used only for evaluating models after they have been trained on the materials from the other datasets.
Both Debate08 and HCR are in a common XML format, while the Stanford set is in its own native format. You'll need to provide appropriate readers for each data format.
One of the most important things to do when working on empirical natural language processing is to compare your results to reasonable baselines to ensure that the effort you are putting into some fancy model is better than a super simple approach. This is the "do the dumb thing first" rule, so let's do that.
If you have training data, then the easiest rule to follow is to find the majority class label and use that for labeling new instances. Using standard Unix tools, you can find this out for data/debate08/train.xml
as follows:
$ grep 'label="' data/debate08/train.xml | cut -d ' ' -f4 | sort | uniq -c 369 label="negative" 143 label="neutral" 283 label="positive"
So, negative is the majority label. However, you need to compute this in Scala code based on a sequence of labels. You need to write Scala code that will correctly parse command line arguments and be accessible by calling the "exp" target (which you need to add) of the bin/gpp bash script. (See the full help message at the bottom of this description for the options you should support.)
$ bin/gpp exp --train data/debate08/train.xml --eval data/debate08/dev.xml --method majority
################################################################################
Evaluating data/debate08/dev.xml
--------------------------------------------------------------------------------
Confusion matrix.
Columns give predicted counts. Rows give gold counts.
--------------------------------------------------------------------------------
454 0 0 | 454 negative
141 0 0 | 141 neutral
200 0 0 | 200 positive
--------------------------
795 0 0
negative neutral positive
--------------------------------------------------------------------------------
57.11 Overall accuracy
--------------------------------------------------------------------------------
P R F
57.11 100.00 72.70 negative
0.00 0.00 0.00 neutral
0.00 0.00 0.00 positive
...................................
19.04 33.33 24.23 Average
The output shows evaluation for full polarity classification (positive, negative, neutral) with respect to both the raw confusion matrix and precision (P), recall (R) and F-score (F) for each category. Right now the detailed results aren't very interesting because we're not predicting more than a single label, so we'll discuss what these mean more in the next problem. Overall accuracy is computed simply as the number of tweets that received the correct label, divided by the total number of tweets.
Tip: Use the class nak.util.ConfusionMatrix
to easily get the above output.
Of course, predicting everything to be one label or the other is a pretty useless thing to do in terms of the kinds of latter processing and analysis one might do based on sentiment analysis. That means it is very important to not just consider the overall accuracy, but also to pay attention to the performance for each label. The lexicon based method we consider next enables prediction of any label, and thus allows us to start considering label-level performance and trying to find models that not only have good overall accuracy, but which also do a good job at finding instances of each label.
Note: This is a pretty simple thing to compute, so most of the work you'll need to do here is just to get all the code set up for ingesting data, evaluating, etc. Once you have that in place, you'll be pretty much set for easily adding other classifiers and such. This is more free-form than the past homeworks, so some of you may actually find yourselves not knowing how to proceed---and my goal here is to make sure you can do this sort of thing since you don't get stub code in the real world. The resources you have at your disposal include looking at past examples, looking at the APIs, and asking for help from me and the other members of the class.
Another reasonable baseline is to use a polarity lexicon to find the number of positive tokens and negative tokens and pick the label that has more tokens. You've already done this in Project Phase Two, where we used Bing Liu's Opinion Lexicon.
For this problem, you must create a lexicon based classifier that can be run using the --method lexicon
flag, e.g. it should be possible to get results from the command line like this:
$ bin/gpp exp --eval data/debate08/dev.xml --method lexicon
Note that no training material is needed because this is only based on the words that are present and the polarity lexicon entries.
Tips:
- Tokenization matters a great deal for this. Consider using
chalk.lang.eng.Twokenize
. - You are free to use Liu's lexicon as before, but you should also look at the MPQA lexicon and SentiWordNet. (You can even try using all of them.)
My solution gets 44.28% accuracy. You should be able to get near that, but don't stress out about matching or beating it. Here's the full evaluation output.
$ bin/gpp exp --eval data/debate08/dev.xml --method lexicon
--------------------------------------------------------------------------------
Confusion matrix.
Columns give predicted counts. Rows give gold counts.
--------------------------------------------------------------------------------
154 223 77 | 454 negative
19 102 20 | 141 neutral
26 78 96 | 200 positive
-----------------------------
199 403 193
negative neutral positive
--------------------------------------------------------------------------------
44.28 Overall accuracy
--------------------------------------------------------------------------------
P R F
77.39 33.92 47.17 negative
25.31 72.34 37.50 neutral
49.74 48.00 48.85 positive
...................................
50.81 51.42 44.51 Average
At this point, let's stop and look at the results in more detail. The overall accuracy is lower than what we get for the majority class baseline. However, the lexicon ratio method can predict any of the labels, which leads to more interesting patterns. Note the following:
- P, R and F stand for Precision, Recall and F-score, as standardly defined.
- For each evaluation type, an average of P/R/F is provided.
- The values we'll care about the most in final evaluations are the F-score average and the overall accuracy. However, it is important to consider precision and recall individually for different kinds of tasks.
- Even though the overall accuracy is a lot lower than the majority class baseline, the output is far more meaningful; this shows in the label-level results, and the P/R/F averages, which are much higher than for the majority class baseline.
Overall, this is clearly a poor set of results, but that is okay -- it's just a baseline! Let's do better with models acquired from the training data.
Part (a). Implementation. Now that we have done a couple of simple sanity checks to see what we should expect to do at least as well as (and hopefully much better), we can turn to machine learning from labeled training examples. You should enable the liblinear solvers that are available in Nak to train supervised models (logistic regression, support vector machines). Minimally, you should support the use of L2-regularized logistic regression (L2R_LR), such that it can be run as follows.
$ bin/gpp exp --train data/debate08/train.xml --eval data/debate08/dev.xml --method L2R_LR --cost .9
--------------------------------------------------------------------------------
Confusion matrix.
Columns give predicted counts. Rows give gold counts.
--------------------------------------------------------------------------------
350 41 63 | 454 negative
85 40 16 | 141 neutral
90 14 96 | 200 positive
----------------------------
525 95 175
negative neutral positive
--------------------------------------------------------------------------------
61.13 Overall accuracy
--------------------------------------------------------------------------------
P R F
66.67 77.09 71.50 negative
42.11 28.37 33.90 neutral
54.86 48.00 51.20 positive
...................................
54.54 51.15 52.20 Average
These results are obtained using just bag-of-words features. It already looks much better than the baselines! For this problem, you'll improve the extraction of features and determine a good value for the cost parameter (which should default to 1.0).
Part (b). Written answer. Find a better smoothing value than the default (1.0) for both L2R_LR for both the Debate08 and HCR datasets. Write down what your best values are for each data set and include the output for both. You should find a good balance between the overall accuracy and the average F-score.
Part (c). Implementation. Improve and extend the features used by the classifiers by modifying how tweets are featurized. Some things you can do:
- lower casing all tokens
- using stems (see the PorterStemmer)
- excluding stop words from being features in unigrams
- consider bigrams and trigrams (and possibly as raw and/or stems)
- using the polarity lexicon (e.g. output a feature polarity=POSITIVE for every word that is in the positive lexicon
- regular expressions that detect patterns like "loooove", "LOVE", "love", presence of emoticons, etc.
- consider using the TweetNLP tools -- which include part-of-speech taggers for Twitter.
It should be possible to use the --extended (-x)
flag to your command line options, e.g.:
$ bin/gpp exp --train data/debate08/train.xml --eval data/debate08/dev.xml --method L2R_LR --extended
You'll want to consider different values for the cost than what you had before.
Part (d). Written answer. Describe the features that you used and why, giving examples from the training set. Include the output from your best model.
For comparison, here are my best results on data/debate08/dev.xml
(when training only on debate08/train.xml
):
--------------------------------------------------------------------------------
Confusion matrix.
Columns give predicted counts. Rows give gold counts.
--------------------------------------------------------------------------------
406 30 18 | 454 negative
76 61 4 | 141 neutral
81 23 96 | 200 positive
----------------------------
563 114 118
negative neutral positive
--------------------------------------------------------------------------------
70.82 Overall accuracy
--------------------------------------------------------------------------------
P R F
72.11 89.43 79.84 negative
53.51 43.26 47.84 neutral
81.36 48.00 60.38 positive
...................................
68.99 60.23 62.69 Average
Hopefully some of you will beat this!
This problem involves predicting the polarity of tweets in the Stanford Twitter Sentiment dataset.
Part (a). Implementation. Convert the Stanford Twitter Sentiment corpus so that it is in the XML format of the other datasets. You should be able to run it as follows:
$ bin/gpp convert-stanford data/stanford/orig/testdata.manual.2009.05.25 > data/stanford/dev.xml
To do this conversion, look at the original data file and at the paper referenced above. Everything you need to create the XML elements is in there.
Now, use it as an evaluation set. Do so for the lexicon ratio classifier and for L2R_LR trained on either the Debate08 or HCR training sets and for both together. For the latter, note that the --train
option should be able to take a list of file names, e.g.:
$ bin/gpp exp --train data/debate08/train.xml data/hcr/train.xml -e data/stanford/dev.xml --extended
For the lexicon classifier, it is just like you did for the other datasets, since no training is involved.
Part (b). Written answer. In a paragraph or two, describe what happened, and why you think it did.
Look in data/emoticon -- you'll see:
- happy.txt: 2000 tweets that have a smiley emoticon
- sad.txt: 2000 tweets that have a frowny emoticon
- neutral.txt: 2000 tweets that don't have smilies or frownies or certain subjective terms (it's noisy, so it is very much neutral-ISH)
Part (a). Write code that produces a training XML file from the above files in the format of the others (you can put in a dummy target, e.g. "unknown". The tweets in happy.txt
, sad.txt
and neutral.txt
should be labeled positive, negative, and neutral, respectively. So, this is clearly an attempt to get annotations for free -- and there are indications that it should work, e.g. see the technical report by Go et al 2009: Twitter Sentiment Classification using Distant Supervision. Speriousu et al. also take advantage of emoticon training, but do so via label propagation rather than direct training of models.
Make your implementation accessible as the convert-emoticon
target so that it can be run as follows:
$ bin/gpp convert-emoticon data/emoticon > data/emoticon/train.xml
Part (b). Writing. Use emoticon/train.xml
as a training source and evaluate on debate08/dev.xml
and hcr/dev.xml
. Discuss the results, with reference to both datasets. Does it at least beat a random baseline or a lexicon-based classifier? How close does it come to the supervised models? Does the effectiveness of the noisy labels vary with respect to the model? Pay particular attention to the label-level precision and recall values. Are they more balanced or less balanced than the results from models trained on human annotated examples? If there are differences, can you think of reasons why?
Part (c). Writing. You are likely to find that the results aren't as good as with the models trained on human annotated examples. So, perhaps there is a way to take advantage of both the human annotations and this larger set of noisily labeled examples. Actually, there are many ways of doing this -- here you'll do the very simple strategy of just concatenating the two training sets as we did for the previous problem.
$ bin/gpp exp --train data/debate08/train.xml data/hcr/train.xml --eval data/debate08/dev.xml
You'll probably find that you need to adjust the cost value to get better results. Try this strategy for both Debate08 and HCR, using data/emoticon/train.xml
, and discuss what comes of that. What do you think you could do to improve things?
We'll wrap up with a summary of your best results and a look at the output of your best model.
Part (a). Written answer. For debate08/dev.xml
, hcr/dev.xml
, and stanford/dev.xml
state your best score for each model and configuration, including which training set (d,h,e--for debate, HCR, and emoticon), feature set and parameters were involved. Do it as a table, e.g.:
| Model | Training | Features | Cost | Overall | Negative | Neutral | Positive | Average |
| | | | Value | F-score | F-score | F-score | F-score | F-score |
------------------------------------------------------------------------------------------------
LexiconRatio | | | | | | | | |
L2R_LR | | Basic | | | | | | |
L2R_LR | | Extended | | | | | | |
Feel free to add other solver types if you have enabled them and are interested in trying them out.
Part (b). Written answer. Run all of the above model/configurations with these same parameters on debate08/test.xml
and hcr/test.xml
and produce a table of those results. Did the models maintain the same relative performance as they had on the development set? Are the differences in performance about the same, or have they increased?
Part (c) Written. The option --detailed (-d)
should output the correctly resolved tweets and the incorrectly resolved ones. Provide the output from ConfusionMatrix.detailedOutput
for this.
Obtain the detailed output for your best system for data/hcr/dev.xml
. Look at at least 20 of the incorrect ones and discuss the following:
- Which tweets, if any, do you think have the wrong gold label?
- Which tweets, if any, are too complex or subtle for the simple positive/negative/neutral distinction? (For example, they are positive toward one thing and negative toward another, even in the same breath.)
- Which tweets, if any, have a clear sentiment value, but are linguistically too complex for the kind of classifiers you are using here?
- Which tweets, if any, do you think the system should have gotten? Are there any additional features you could easily add that could fix them (provided sufficient training data)?
For each of these, paste the relevant tweets you discuss into your response file.
Part (d) Written answer. Based on your experience creating features with the resources made available to you and having looked at the errors in detail, describe 2-3 additional strategies you think would help in this context, such as other forms of machine learning, additional linguistic processing, etc. Feel free to look up papers in the literature (and the opinion mining book by Pang and Lee) and use their findings as support/evidence.
There are various things you could do:
- Improve the modeling, e.g. using ensembles, label propagation, bootstrapping, etc.
- Try out another sentiment dataset and work with it. (I'll post some links later, but email me if you want them and I haven't yet done so.)
- Pull more emoticon data from Twitter and train models with a much larger set than what was provided in
data/emoticon
to see whether that improves that strategy. You might check out this relevant paper: Lin and Kolcz (2012). - Train models from all the data provided here and create an application that plots uses a saved polarity classification model to classify tweets for analysis or display. (This will likely tie into some of your class projects.)
- Pull some tweets (like 20-30), apply your best model to them, and score them. Discuss the quality of your model's output. Consider comparing it to other online sentiment analysis tools.