-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRscraping.html
209 lines (173 loc) · 6.76 KB
/
Rscraping.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Scraping with R</title>
<meta name="description" content="">
<meta name="author" content="Eugene Pyatigorsky">
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<link rel="stylesheet" href="libraries/frameworks/revealjs/css/reveal.min.css">
<link rel="stylesheet" href="libraries/frameworks/revealjs/css/theme/night.css" id="theme">
<link rel="stylesheet" href="libraries/highlighters/highlight.js/css/Github.css" id="theme">
<!--[if lt IE 9]>
<script src="lib/js/html5shiv.js"></script>
<![endif]--> <link rel="stylesheet" href = "assets/css/Icon">
<link rel="stylesheet" href = "assets/css/mystyle.css">
<link rel="stylesheet" href = "assets/css/pdf.css">
</head>
<body>
<div class="reveal">
<div class="slides">
<section class='' data-state='' id='slide-1'>
<h1>Scraping with <strong>R</strong></h1>
<p><br><br><br><br>
<footer>
Cincinnati R Users Group<br>
June 21, 2016 <br><br>
<strong>Eugene Pyatigorsky</strong>
<br><br>
This presentation and supporting materials available at:<br>
<a href="https://github.com/epspi/Rscraping">https://github.com/epspi/Rscraping</a>
</footer></p>
</section>
<section class='' data-state='' id='slide-2'>
<h2>Agenda</h2>
<ul>
<li>Overview of packages</li>
<li>A look at how to scrape</li>
<li>Working example</li>
<li>Best practices</li>
</ul>
</section>
<section class='chapter' data-state='' id='slide-3'>
<h1>Overview of packages</h1>
</section>
<section>
<section class='' data-state=''>
<h2>rvest</h2>
<p>Most of the work will be done by Hadley's package <code>rvest</code></p>
<ul>
<li>Based on Python's <code>beautifulsoup</code></li>
<li>Extracts elements from the dom using CSS or XPath</li>
</ul>
<p class='fragment'><strong>e.g.<br><code>rvest::read_table()</code></strong></p>
<aside class='notes'>
</aside>
</section>
<section class='' data-state=''>
<h2>httr</h2>
<p>This is (Hadley's) wrapper for <code>curl</code></p>
<ul>
<li>Really useful for making customized calls to APIs</li>
<li>Can also be used for writing your own APIs</li>
</ul>
<p class='fragment'><strong>e.g.<br><code>httr::GET("some_endpoint", config)</code></strong></p>
<aside class='notes'>
</aside>
</section>
</section>
<section class='chapter' data-state='' id='slide-5'>
<h1>How to Scrape: An Example</h1>
</section>
<section>
<section class='' data-state=''>
<h2>Let's ask Bing about the R Users Group</h2>
<pre><code class="r">lnk <- 'http://www.bing.com/search?q=Cincinnati+R+users+group&go=Submit&qs=n&form=QBLH&pq=cincinnati+r+users+g&sc=0-20&sp=-1&sk=&cvid=4A13A7CB066B419B9F7BD75777D68F09'
read_html(lnk) %>%
html_nodes("h2 a") %>%
html_text
</code></pre>
<pre><code>## [1] "Cincinnati UC Users Group (Cincinnati, OH) - Meetup"
## [2] "Local R User Group Directory - Revolutions"
## [3] "New R User Group in Cincinnati / Dayton - Revolutions"
## [4] "Cincinnati Sharepoint User Group - Facebook"
## [5] "Cincinnati .Net Users Group"
## [6] "CincyPowerShell | PowerShell Community Groups"
## [7] "Reinaldo R. - Cincinnati UC Users Group (Cincinnati, OH ..."
## [8] "Group: Cincinnati |Tableau Support Community"
</code></pre>
<aside class='notes'>
</aside>
</section>
<section class='' data-state=''>
<h2>Common CSS Selectors</h2>
<ul>
<li><code>#</code> for "id="</li>
<li><code>.</code> for "class="</li>
</ul>
<p class='fragment'><strong>OR you can use SelectorGadget for Chrome</strong>
<br>
<a href="https://chrome.google.com/webstore/detail/selectorgadget/">https://chrome.google.com/webstore/detail/selectorgadget/</a></p>
<aside class='notes'>
</aside>
</section>
</section>
<section class='chapter' data-state='' id='slide-7'>
<h1>A Working Site</h1>
</section>
<section>
<section class='' data-state=''>
<p><a href="http://cincyreal.followthenumbers.com">Cincinnati Foreclosures - A Real Estate Scraper</a></p>
<aside class='notes'>
</aside>
</section>
</section>
<section class='chapter' data-state='' id='slide-9'>
<h1>Best Practices</h1>
</section>
<section>
<section class='' data-state=''>
<h2>Authentication</h2>
<p>Use APIs instead of scraping whenever possible. There isn't a lot of documentation for <code>rvest</code> and cookie-based authentication can be tricky.</p>
<aside class='notes'>
</aside>
</section>
<section class='' data-state=''>
<h2>Automation</h2>
<ul>
<li>The real power of <code>R</code> and <code>rvest</code> shines when used with <code>shiny</code> (npi).</li>
<li>Put your scraping code in a standalone R script and automate with <code>cron</code>. </li>
</ul>
<aside class='notes'>
</aside>
</section>
<section class='' data-state=''>
<h2>End</h2>
<aside class='notes'>
</aside>
</section>
</section>
</div>
</div>
</body>
<script src="libraries/frameworks/revealjs/lib/js/head.min.js"></script>
<script src="libraries/frameworks/revealjs/js/reveal.min.js"></script>
<script>
// Full list of configuration options available here:
// https://github.com/hakimel/reveal.js#configuration
Reveal.initialize({
controls: true,
progress: true,
history: true,
center: true,
theme: Reveal.getQueryHash().theme || 'night',
transition: Reveal.getQueryHash().transition || 'default',
dependencies: [
// Cross-browser shim that fully implements classList -
// https://github.com/eligrey/classList.js/
{ src: 'libraries/frameworks/revealjs/lib/js/classList.js', condition: function() { return !document.body.classList;}},
// Zoom in and out with Alt+click
{ src: 'libraries/frameworks/revealjs/plugin/zoom-js/zoom.js', async: true, condition: function() { return !!document.body.classList; } },
// Speaker notes
{ src: 'libraries/frameworks/revealjs/plugin/notes/notes.js', async: true, condition: function() { return !!document.body.classList; } },
// Remote control your reveal.js presentation using a touch device
//{ src: 'libraries/frameworks/revealjs/plugin/remotes/remotes.js', async: true, condition: function() { return !!document.body.classList; } }
]
});
</script> <!-- LOAD HIGHLIGHTER JS FILES -->
<script src="libraries/highlighters/highlight.js/highlight.pack.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<!-- DONE LOADING HIGHLIGHTER JS FILES -->
</html>