-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathTODO
22 lines (22 loc) · 1.19 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
* Play with tf-idf stuff
* Stop words good or bad?
* do bi-grams help? trigrams?
* PCA -> is it necessary? Are there better sizes (eg, try 10, 20, 30, ... 100?)
* Try to normalize the ner_XXX and pos_XXX vectors by num_words. ---> Not too helpful
* Try alternate functional forms, eg num_chars^2, log(num_chars), num_chars^(1/2)
* Better spell checker
* Presence of Keywords in prompt (compare tf-idf of prompt X among prompts with tf-idf of that essay or something???)
* Check for presence of verbs in all sentences?
* try turning things into flags?
* Try doing RFM classification. Several people have found it works better, based on online comments, but hasn't so far for me. Try properly.
* Investiage GBM classification v GBM regression
* Use caret to select features, tune parameters, including shrinkage, n.minobsinnode (5, 10, 15, 20), interaction depth (3 is better than 1 or 2; do 4-8 improve further?)
* Features:
* Sentence length flags (over 60 words, under XXX words)
* Sentences beginning with coordinating conjunctions (and, but, hopefully, etc)
* successive nouns (3+)
* successive prepositional phrases (3+)
* use of first person
* split infinitives
* double negatives
* oxford comma