forked from BrianWeinstein/googlenlp
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
166 lines (111 loc) · 7.01 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
googlenlp
---
[![Travis-CI Build Status](https://travis-ci.org/BrianWeinstein/googlenlp.svg?branch=master)](https://travis-ci.org/BrianWeinstein/googlenlp)
---
The googlenlp package provides an R interface to Google's [Cloud Natural Language API](https://cloud.google.com/natural-language/).
"Google Cloud Natural Language API reveals the structure and meaning of text by offering powerful machine learning models in an easy to use REST API. You can use it to **extract information** about people, places, events and much more, mentioned in text documents, news articles or blog posts. You can use it to **understand sentiment** about your product on social media or **parse intent** from customer conversations happening in a call center or a messaging app." [[source](https://cloud.google.com/natural-language/)]
There are four main features of the API, all of which are available through this R package [[source](https://cloud.google.com/natural-language/)]:
* **Syntax Analysis:** "Extract tokens and sentences, identify parts of speech (PoS) and create dependency parse trees for each sentence."
* **Entity Analysis:** "Identify entities and label by types such as person, organization, location, events, products and media."
* **Sentiment Analysis:** "Understand the overall sentiment expressed in a block of text."
* **Multi-Language:** "Enables you to easily analyze text in multiple languages including English, Spanish and Japanese."
### Resources
* [API Documentation](https://cloud.google.com/natural-language/docs/)
* [Natural Language API Basics](https://cloud.google.com/natural-language/docs/basics)
* [Morphology & Dependency Trees](https://cloud.google.com/natural-language/docs/morphology)
### Installation
You can install the development version from GitHub:
```{r eval = FALSE}
devtools::install_github("BrianWeinstein/googlenlp")
```
### Authentication
To use the API, you'll first need to [create a Google Cloud project and enable billing](https://cloud.google.com/natural-language/docs/getting-started), and get an [API key](https://cloud.google.com/natural-language/docs/common/auth).
### Getting started
Load the package and set your API key.
```{r eval = FALSE}
library(googlenlp)
set_api_key("MY_API_KEY") # replace this with your API key
```
```{r eval = TRUE, include = FALSE}
library(googlenlp)
library(dplyr)
set_api_key(readLines("tests/testthat/api_key.txt"))
```
Define the text you'd like to analyze.
```{r eval = TRUE}
text <- "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show.
Sundar Pichai said in his keynote that users love their new Android phones."
```
The `annotate_text` function analyzes the text's syntax (sentences and tokens), entities, sentiment, and language; and returns the result as a five-element list.
```{r eval = TRUE}
analyzed <- annotate_text(text_body = text)
str(analyzed, max.level = 1)
```
#### Sentences
"Sentence extraction breaks up the stream of text into a series of sentences." [[API Documentation](https://cloud.google.com/natural-language/docs/basics#sentence-extraction)]
* `beginOffset` indicates the (zero-based) character index of where the sentence begins (wtih UTF-8 encoding).
* The `magnitude` and `score` fields quantify each sentence's sentiment — see the [Document Sentiment](#document-sentiment) section for more details.
```{r eval = FALSE}
analyzed$sentences
```
```{r eval = TRUE, echo = FALSE}
knitr::kable(analyzed$sentences, format = "markdown")
```
#### Tokens
"Tokenization breaks the stream of text up into a series of tokens, with each token usually corresponding to a single word.
The Natural Language API then processes the tokens and, using their locations within sentences, adds syntactic information to the tokens." [[API Documentation](https://cloud.google.com/natural-language/docs/basics#tokenization)]
* `lemma` indicates the token's "root" word, and can be useful in standardizing the word within the text.
* `tag` indicates the token's part of speech.
* Additional column definitions are outlined [here](https://cloud.google.com/natural-language/docs/basics#tokenization) and [here](https://cloud.google.com/natural-language/docs/morphology#parts_of_speech).
```{r eval = FALSE}
analyzed$tokens
```
```{r eval = TRUE, echo = FALSE}
knitr::kable(analyzed$tokens, format = "markdown")
```
<!---
```{r eval = TRUE, echo = FALSE}
options(width=400)
analyzed$tokens
```
--->
#### Entities
"Entity Analysis provides information about entities in the text, which generally refer to named 'things' such as famous individuals, landmarks, common objects, etc... A good general practice to follow is that if something is a noun, it qualifies as an 'entity.'" [[API Documentation](https://cloud.google.com/natural-language/docs/basics#entity_analysis)]
* `entity_type` indicates the type of entity (i.e., it classifies the entity as a person, location, consumer good, etc.).
* `mid` provides a "machine-generated identifier" correspoding to the entity's [Google Knowledge Graph](https://www.google.com/intl/bn/insidesearch/features/search/knowledge.html) entry.
* `wikipedia_url` provides the entity's [Wikipedia](https://www.wikipedia.org/) URL.
* `salience` indicates the entity's importance to the entire text. Scores range from 0.0 (less important) to 1.0 (highly important).
* Additional column definitions are outlined [here](https://cloud.google.com/natural-language/docs/basics#entity_analysis_response_fields).
```{r eval = FALSE}
analyzed$entities
```
```{r eval = TRUE, echo = FALSE}
knitr::kable(analyzed$entities, format = "markdown")
```
#### Document sentiment {#document-sentiment}
"Sentiment analysis attempts to determine the overall attitude (positive or negative) expressed within the text. Sentiment is represented by numerical `score` and `magnitude` values." [[API Documentation](https://cloud.google.com/natural-language/docs/basics#sentiment_analysis)]
* `score` ranges from -1.0 (negative) to 1.0 (positive), and indicates to the "overall emotional leaning of the text".
* `magnitude` "indicates the overall strength of emotion (both positive and negative) within the given text, between 0.0 and +inf. Unlike score, magnitude is not normalized; each expression of emotion within the text (both positive and negative) contributes to the text's magnitude (so longer text blocks may have greater magnitudes)."
A note on how to interpret these sentiment values is posted [here](https://cloud.google.com/natural-language/docs/basics#interpreting_sentiment_analysis_values).
```{r eval = FALSE}
analyzed$documentSentiment
```
```{r eval = TRUE, echo = FALSE}
knitr::kable(analyzed$documentSentiment, format = "markdown")
```
#### Language
`language` indicates the detected language of the document. Only English ("en"), Spanish ("es") and Japanese ("ja") are currently supported by the API.
```{r eval = TRUE}
analyzed$language
```