Skip to content

An Interface to Google's Cloud Natural Language API

License

Notifications You must be signed in to change notification settings

yl3394/googlenlp

 
 

Repository files navigation

googlenlp

Travis-CI Build Status


The googlenlp package provides an R interface to Google's Cloud Natural Language API.

"Google Cloud Natural Language API reveals the structure and meaning of text by offering powerful machine learning models in an easy to use REST API. You can use it to extract information about people, places, events and much more, mentioned in text documents, news articles or blog posts. You can use it to understand sentiment about your product on social media or parse intent from customer conversations happening in a call center or a messaging app." [source]

There are four main features of the API, all of which are available through this R package [source]:

  • Syntax Analysis: "Extract tokens and sentences, identify parts of speech (PoS) and create dependency parse trees for each sentence."
  • Entity Analysis: "Identify entities and label by types such as person, organization, location, events, products and media."
  • Sentiment Analysis: "Understand the overall sentiment expressed in a block of text."
  • Multi-Language: "Enables you to easily analyze text in multiple languages including English, Spanish and Japanese."

Resources

Installation

You can install the development version from GitHub:

devtools::install_github("BrianWeinstein/googlenlp")

Authentication

To use the API, you'll first need to create a Google Cloud project and enable billing, and get an API key.

Getting started

Load the package and set your API key.

library(googlenlp)

set_api_key("MY_API_KEY") # replace this with your API key

Define the text you'd like to analyze.

text <- "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show.
         Sundar Pichai said in his keynote that users love their new Android phones."

The annotate_text function analyzes the text's syntax (sentences and tokens), entities, sentiment, and language; and returns the result as a five-element list.

analyzed <- annotate_text(text_body = text)

str(analyzed, max.level = 1)
#> List of 5
#>  $ sentences        :Classes 'rowwise_df', 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  4 variables:
#>  $ tokens           :Classes 'tbl_df', 'tbl' and 'data.frame':   32 obs. of  17 variables:
#>  $ entities         :Classes 'tbl_df', 'tbl' and 'data.frame':   10 obs. of  8 variables:
#>  $ documentSentiment:'data.frame':   1 obs. of  2 variables:
#>  $ language         : chr "en"

Sentences

"Sentence extraction breaks up the stream of text into a series of sentences." [API Documentation]

  • beginOffset indicates the (zero-based) character index of where the sentence begins (wtih UTF-8 encoding).
  • The magnitude and score fields quantify each sentence's sentiment — see the Document Sentiment section for more details.
analyzed$sentences
content beginOffset magnitude score
Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show. 0 0.2 0.2
Sundar Pichai said in his keynote that users love their new Android phones. 113 0.6 0.6

Tokens

"Tokenization breaks the stream of text up into a series of tokens, with each token usually corresponding to a single word. The Natural Language API then processes the tokens and, using their locations within sentences, adds syntactic information to the tokens." [API Documentation]

  • lemma indicates the token's "root" word, and can be useful in standardizing the word within the text.
  • tag indicates the token's part of speech.
  • Additional column definitions are outlined here and here.
analyzed$tokens
content beginOffset lemma tag aspect case form gender mood number person proper reciprocity tense voice dependencyEdge_headTokenIndex dependencyEdge_label
Google 0 Google NOUN ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN SINGULAR PERSON_UNKNOWN PROPER RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 7 NSUBJ
, 6 , PUNCT ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 0 P
headquartered 8 headquarter VERB ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN PAST VOICE_UNKNOWN 0 VMOD
in 22 in ADP ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 2 PREP
Mountain 25 Mountain NOUN ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN SINGULAR PERSON_UNKNOWN PROPER RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 5 NN
View 34 View NOUN ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN SINGULAR PERSON_UNKNOWN PROPER RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 3 POBJ
, 38 , PUNCT ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 0 P
unveiled 40 unveil VERB ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN INDICATIVE NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN PAST VOICE_UNKNOWN 7 ROOT
the 49 the DET ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 11 DET
new 53 new ADJ ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 11 AMOD
Android 57 Android NOUN ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN SINGULAR PERSON_UNKNOWN PROPER RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 11 NN
phone 65 phone NOUN ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN SINGULAR PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 7 DOBJ
at 71 at ADP ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 7 PREP
the 74 the DET ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 16 DET
Consumer 78 Consumer NOUN ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN SINGULAR PERSON_UNKNOWN PROPER RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 16 NN
Electronic 87 Electronic NOUN ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN SINGULAR PERSON_UNKNOWN PROPER RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 16 NN
Show 98 Show NOUN ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN SINGULAR PERSON_UNKNOWN PROPER RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 12 POBJ
. 102 . PUNCT ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 7 P
Sundar 113 Sundar NOUN ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN SINGULAR PERSON_UNKNOWN PROPER RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 19 NN
Pichai 120 Pichai NOUN ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN SINGULAR PERSON_UNKNOWN PROPER RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 20 NSUBJ
said 127 say VERB ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN INDICATIVE NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN PAST VOICE_UNKNOWN 20 ROOT
in 132 in ADP ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 20 PREP
his 135 his PRON ASPECT_UNKNOWN GENITIVE FORM_UNKNOWN MASCULINE MOOD_UNKNOWN SINGULAR THIRD PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 23 POSS
keynote 139 keynote NOUN ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN SINGULAR PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 21 POBJ
that 147 that ADP ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 26 MARK
users 152 user NOUN ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN PLURAL PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 26 NSUBJ
love 158 love VERB ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN INDICATIVE NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN PRESENT VOICE_UNKNOWN 20 CCOMP
their 163 their PRON ASPECT_UNKNOWN GENITIVE FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN PLURAL THIRD PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 30 POSS
new 169 new ADJ ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 30 AMOD
Android 173 Android NOUN ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN SINGULAR PERSON_UNKNOWN PROPER RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 30 NN
phones 181 phone NOUN ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN PLURAL PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 26 DOBJ
. 187 . PUNCT ASPECT_UNKNOWN CASE_UNKNOWN FORM_UNKNOWN GENDER_UNKNOWN MOOD_UNKNOWN NUMBER_UNKNOWN PERSON_UNKNOWN PROPER_UNKNOWN RECIPROCITY_UNKNOWN TENSE_UNKNOWN VOICE_UNKNOWN 20 P

Entities

"Entity Analysis provides information about entities in the text, which generally refer to named 'things' such as famous individuals, landmarks, common objects, etc... A good general practice to follow is that if something is a noun, it qualifies as an 'entity.'" [API Documentation]

  • entity_type indicates the type of entity (i.e., it classifies the entity as a person, location, consumer good, etc.).
  • mid provides a "machine-generated identifier" correspoding to the entity's Google Knowledge Graph entry.
  • wikipedia_url provides the entity's Wikipedia URL.
  • salience indicates the entity's importance to the entire text. Scores range from 0.0 (less important) to 1.0 (highly important).
  • Additional column definitions are outlined here.
analyzed$entities
name entity_type mid wikipedia_url salience content beginOffset mentions_type
Google ORGANIZATION /m/045c7b http://en.wikipedia.org/wiki/Google 0.2559538 Google 0 PROPER
phone CONSUMER_GOOD NA NA 0.1384906 phone 65 COMMON
Android CONSUMER_GOOD /m/02wxtgw http://en.wikipedia.org/wiki/Android_(operating_system) 0.1294144 Android 57 PROPER
Android CONSUMER_GOOD /m/02wxtgw http://en.wikipedia.org/wiki/Android_(operating_system) 0.1294144 Android 173 PROPER
users PERSON NA NA 0.1198345 users 152 COMMON
Sundar Pichai PERSON /m/09gds74 http://en.wikipedia.org/wiki/Sundar_Pichai 0.1123451 Sundar Pichai 113 PROPER
Mountain View LOCATION /m/0r6c4 http://en.wikipedia.org/wiki/Mountain_View,_California 0.1103145 Mountain View 25 PROPER
Consumer Electronic Show EVENT /m/01p15w http://en.wikipedia.org/wiki/Consumer_Electronics_Show 0.0781073 Consumer Electronic Show 78 PROPER
phones CONSUMER_GOOD NA NA 0.0336798 phones 181 COMMON
keynote OTHER NA NA 0.0218599 keynote 139 COMMON

Document sentiment

"Sentiment analysis attempts to determine the overall attitude (positive or negative) expressed within the text. Sentiment is represented by numerical score and magnitude values." [API Documentation]

  • score ranges from -1.0 (negative) to 1.0 (positive), and indicates to the "overall emotional leaning of the text".
  • magnitude "indicates the overall strength of emotion (both positive and negative) within the given text, between 0.0 and +inf. Unlike score, magnitude is not normalized; each expression of emotion within the text (both positive and negative) contributes to the text's magnitude (so longer text blocks may have greater magnitudes)."

A note on how to interpret these sentiment values is posted here.

analyzed$documentSentiment
magnitude score
0.9 0.4

Language

language indicates the detected language of the document. Only English ("en"), Spanish ("es") and Japanese ("ja") are currently supported by the API.

analyzed$language
#> [1] "en"

About

An Interface to Google's Cloud Natural Language API

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 100.0%