- TA: 新聞所洪郁雯 r05342005[@]ntu.edu.tw (remove [] please)
- https://r4css.github.io/R1062/
- Joining Facebook page R1062 for calling for help and gathering.
- A cheatsheet to look up commands
- Download whole project before lessons
- Youtube Video for review
Sys.setlocale(category = "LC_ALL", locale = "UTF-8")
Sys.setlocale(category = "LC_ALL", locale = "C")
Sys.setlocale(category = "LC_ALL", locale = "cht") # for win
- Slide10: Random Forest
- Be sure to have
9_3_decision_tree_titanic.Rmd
andTM08_stock_random_forest.Rmd
- Slide09: SVM
- Slide08-1: PCA
- [PCA iris]
- [PCA Marriage Equality]
- Pre-report and review of topic modeling
- Slide07. Processing Chinese text and topic modeling
- Copy your slide link for next week report to here before class.
- Slide06. dplyr_trump's tweet
- AS#9 Using your own data (no matter in English or in Chinese) to practice text processing skills (At least 3 charts).
- Slide04. HTML Parser
- 為了通熟本章節,你會需要額外花時間了解何謂html、css、xpath?在html中,id和class的目的為何?有和特性?何謂html元素(element)?何謂html元素的屬性(attribute)?
- 你可以略讀參考w3school所提供的說明,無論是英文或者切至中文畫面。
- Learning html: 包含Introduction, Basic, Elements, Attributes, Headings, CSS, Links, Blocks, Images, Tables, Lists, Classes, Id等節。
- css syntax。
- css combinator。
- xpath introduction
- AS#5 Crawling news media search page. No later than 4/23 23:59.
- Slide03. crawler design
- AS#4. Plotting Air Quality Index on Map. No later than 4/16, 23:59.
- Slide02. Read csv and json
- AS#3. Reading open data. No later than 04/09 23:59.
- Slide02. Read csv and json - Video02-1 Paid Maternity Leave.
- AS#2 writing code with RMD. No later than Mar 26 (MON) 23:59 (Remember our assignment policy).
- AS#1 Learning with datacamp announced. Submitting to ceiba AS#1 no later than 2018/3/13 23:59.
- Reading Chapter 4 of R for data science to understand basic data types, assignments, variables, character vs. numeric variables, and functions.
- R00. Install,
- R01. R Basic
- 加退選原則:本門課除新聞所學生外,加選以曾選修新聞所開設之「新聞與數位創作」學程之學生為優先,社科院學生次之,生傳、文學院、管院學生再次之。理工科系學生由於相關程式學習資源多元,建議選修台大校方為培植學生資訊能力開設的精彩CS+X課程系列。
- 預設對象:本課程預設對象為「無程式寫作經驗」且對資料新聞抱有興趣的學生。不建議有程式撰寫經驗者選修,亦不開放旁聽。
- 課程抵用原則:新聞所學生得選修本課,但由於本課程內容與「新聞資料分析」雷同,兩門課僅可以一門抵用畢業學分。
- 由於以無程式經驗者為對象,教師得依學生學習情形調整授課內容,但至少包含以下內容:
- R Programming basics
- Reading files including CSV and JSON formats
- Processing data by apply() family and dplyr package
- Exploratory Data Analysis with ggplot()
- Web crawler skill: getting data by GET and POST
- Web APIs e.g., Google Map, Facebook, and Twitter
- Text processing packages including tidytext and jiebaR
- 0% Quiz: 本門課包含二次不計分的隨堂小考,分別用以了解學生期初之學習情形與驗收期末學習成果。
- 3% absence: 未能到課需依學校規定之請假程序請假,經查缺席者一次得扣學期總成績三分。
- 40% Assignments: 作業繳交時間一律為作業公布後的五天內繳交(Mon 23:59),以便助教批改並協助同學回顧作業內容。作業公布七天內可接受補交,然成績以八折計算,次週上課前不再接受補交(Wed 11:59)。作業繳交問題請聯絡助教(郁雯)。作業應按照指定格式繳交(第三次作業開始,需以RMarkdown或R Notebook撰寫,並繳交.rmd檔及其所衍生的HTML檔),如格式不合,經助教聯繫未在次週授課當天補交者該項亦不予計分。
- 30% Digital News Project: 從政府開放資料、或指定的資料集發展一則資料新聞,著重在資料的彙整、資料的清理、視覺化敘事。該則新聞需投稿至台大新聞e論壇,或其他新聞媒體。
- 30% Data Science Project: 自行爬取非結構化文本資料進行文字探勘,著重在文字的分析、模型的應用和事後詮釋。
- R for data science
- Text mining using r
- Good jiebaR introduction
- http://www.rdatamining.com/
- Learning R in Y minutes
- Datacamp for R
Week | Date | Toipic | Activities |
---|---|---|---|
W02 | 0307 | R Basic; data types, import/export data | AS#1 |
W03 | 0314 | Reading sheet data: csv, excel | AS#2 |
W04 | 0321 | Reading hierarchical data: json and xml | AS#3 |
W05 | 0328 | Getting data by web API: Facebook/google map as an example | AS#4 |
W06 | 0404 | Spring break | |
W07 | 0411 | Visualization and ggplot President Tsai’s FB page activities | AS#5 |
W08 | 0418 | Crawler designs | AS#6, Submit mini proposal |
W09 | 0425 | HTML Parser | AS#7 |
W10 | 0502 | dplyr and ggplot Analyzing trump’s tweets | AS#8 |
W11 | 0509 | Unsupervised learning and dimensional reduction K-mean Clustering, PCA, SVD, t-SNE | AS#9 |
W12 | 0516 | Project I presentation (5 mins) | AS#10, Submit project 1 |
W13 | 0523 | Supervised learning: an overview Regression, Decision tree, Random forest, SVM, Naive bayes, | AS#11 (Submit proposal 2) |
W14 | 0530 | Project 2 proposal presentation (5 mins) Regression | AS#12 |
W15 | 0606 | Supervised learning for text Stock and epidemic predictions | |
W16 | 0613 | Topic modeling | |
W17 | 0620 | More: Deep learning for text | |
W18 | 0627 | Final presentation | Submit final project |
- 以政府開放資料或利用網路爬蟲爬取相關資料作為分析對象,並以R語言分析與視覺化。採分組進行,所有組員均應參與報告。
- (-3%) R 程式、原始資料:必須以RMD撰寫,並繳交RMD、HTML二種格式,缺繳原始資料、RMD或HTML任一項者扣3%。
- (15%) 專題報告(10 slides)。評分標準包含問題、創新、文獻與背景。
- (15%) 書面新聞專題報導(中文800~1000字以Word撰寫) 。評分標準包含背景與問題意識、文獻探討與訪談結果、結果詮釋。
- (-3%) 新聞需編輯後上傳至Medium.com。
- (Option 5%) 結果包含人物專訪。
- 需以政府開放資料、Kaggle、Facebook、Twitter或利用網路爬蟲爬取相關資料作為分析對象,並以R語言分析與視覺化。採分組進行,所有組員均應參與計畫書與期末報告。應到未到而無請假紀錄者,扣總成績三分。
- (-3%) R 程式、原始資料:必須以RMD撰寫,並繳交RMD、HTML二種格式,缺繳原始資料、RMD或HTML任一項者扣3%。新聞需編輯後上傳至Medium.com。
- (30%) 專題報告(15 slides)。評分包含創新、展演、問題背景與文獻,現場同儕互評。
- (Option 5%) 衍生新聞。由台大新聞e論壇總主編林照真教授評分。