added hw3 markdown

USCbiostats · Oct 23, 2024 · 45d6514 · 45d6514
1 parent cac8115
commit 45d6514
Show file tree

Hide file tree

Showing 2 changed files with 25 additions and 0 deletions.
diff --git a/website/content/assignment/09-hw3.Rmd b/website/content/assignment/09-hw3.Rmd
@@ -2,6 +2,7 @@
 title: "Assignment 03 - Text Mining"
 output: tufte::tufte_html
 link-citations: yes
+date: 2024-10-23
 ---
 
 ## Due Date

diff --git a/website/content/assignment/09-hw3.md b/website/content/assignment/09-hw3.md
@@ -0,0 +1,24 @@
+---
+title: "Assignment 03 - Text Mining"
+output: tufte::tufte_html
+link-citations: yes
+date: 2024-10-23
+---
+
+## Due Date
+
+This assignment is due by 11:59pm Pacific Time, November 8th, 2024. 
+
+## Text Mining
+
+A new dataset has been added to the data science data repository <https://github.com/USCbiostats/data-science-data/tree/master/03_pubmed>. The dataset contains 3,241 abstracts from articles collected via 5 PubMed searches. The search terms are listed in the second column, `term` and these will serve as the "documents." Your job is to analyse these abstracts to find interesting insights.
+
+1.  Tokenize the abstracts and count the number of each token. Do you see anything interesting? Does removing stop words change what tokens appear as the most frequent? What are the 5 most common tokens for each search term after removing stopwords?
+2.  Tokenize the abstracts into bigrams. Find the 10 most common bigrams and visualize them with ggplot2.
+3.  Calculate the TF-IDF value for each word-search term combination (here you want the search term to be the "document"). What are the 5 tokens from each search term with the highest TF-IDF value? How are the results different from the answers you got in question 1?
+
+## Sentiment Analysis
+
+1.  Perform a sentiment analysis using the NRC lexicon. What is the most common sentiment for each search term? What if you remove `"positive"` and `"negative"` from the list?
+2.  Now perform a sentiment analysis using the AFINN lexicon to get an average positivity score for each abstract (hint: you may want to create a variable that indexes, or counts, the abstracts). Create a visualization that shows these scores grouped by search term. Are any search terms noticeably different from the others?
+