Skip to content

Commit

Permalink
## [v8.4]() (2018-08-24)
Browse files Browse the repository at this point in the history
**Added**
- added `EventRegistry.getUsageInfo()` method, which returns the number of used tokens and the total number of available tokens for the given user. The existing methods `EventRegisty.getRemainingAvailableRequests()` and `EventRegistry.getDailyAvailableRequests()` are still there, but their value is only valid after making at least one request.
- added searching of articles and events based on article authors. You can now provide `authorUri` parameter when creating the `QueryArticles` and `QueryEvents` instances.
- added author related methods to `EventRegistry` class: `EventRegistry.suggestAuthors()` to obtain uris of authors for given (partial) name and `EventRegistry.getAuthorUri()` to obtain a single author uri for the given (partial) name.
- added ability to search articles and events by authors. `QueryArticles` and `QueryEvents` constructors now also accept `authorUri` parameter that can be used to limit the results to articles/events by those authors. Use `QueryOper.AND()` or `QueryOper.OR()` to specify multiple authors in the same query.
- BETA: added a filter for returning only articles that are written by sources that have a certain ranking. The filter can be specified by setting the parameters `startSourceRankPercentile` and `endSourceRankPercentile` when creating the `QueryArticles` instance. The default value for `startSourceRankPercentile` is 0 and for `endSourceRankPercentile` is 100. The values that can be set are not any value between 0 and 100 but has to be a number divisible by 10. By setting `startSourceRankPercentile` to 0 and `endSourceRankPercentile` to 20 you would get only articles from top ranked news sources (according to [Alexa site ranking](https://www.alexa.com/siteinfo)) that would amount to about *approximately 20%* of all matching content. Note: 20 percentiles do not represent 20% of all top sources. The value is used to identify the subset of news sources that generate approximately 20% of our collected news content. The reason for this choice is that top ranked 10% of news sources writes about 30% of all news content and our choice normalizes this effect. This feature could potentially change in the future.
- `QueryEventArticlesIter` is now able to return only a subset of articles assigned to an event. You can use the same filters as with the `QueryArticles` constructor and you can specify them when constructing the instance of `QueryEventArticlesIter`. The same kind of filtering is also possible if you want to use the `RequestEventArticles()` class instead.
- added some parameters and changed default values in some of the result types to reflect the backend changes.
- added optional parameter `proxyUrl` to `Analytics.extractArticleInfo()`. It can be used to download article info through a proxy that you provide (to avoid potential GDPR issues). The `proxyUrl` should be in format `{schema}://{username}:{pass}@{proxy url/ip}`.
  • Loading branch information
gregorleban committed Aug 24, 2018
1 parent 30e3bad commit e09a66c
Show file tree
Hide file tree
Showing 12 changed files with 486 additions and 105 deletions.
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# Change Log

## [v8.4]() (2018-08-24)

**Added**
- added `EventRegistry.getUsageInfo()` method, which returns the number of used tokens and the total number of available tokens for the given user. The existing methods `EventRegisty.getRemainingAvailableRequests()` and `EventRegistry.getDailyAvailableRequests()` are still there, but their value is only valid after making at least one request.
- added searching of articles and events based on article authors. You can now provide `authorUri` parameter when creating the `QueryArticles` and `QueryEvents` instances.
- added author related methods to `EventRegistry` class: `EventRegistry.suggestAuthors()` to obtain uris of authors for given (partial) name and `EventRegistry.getAuthorUri()` to obtain a single author uri for the given (partial) name.
- added ability to search articles and events by authors. `QueryArticles` and `QueryEvents` constructors now also accept `authorUri` parameter that can be used to limit the results to articles/events by those authors. Use `QueryOper.AND()` or `QueryOper.OR()` to specify multiple authors in the same query.
- BETA: added a filter for returning only articles that are written by sources that have a certain ranking. The filter can be specified by setting the parameters `startSourceRankPercentile` and `endSourceRankPercentile` when creating the `QueryArticles` instance. The default value for `startSourceRankPercentile` is 0 and for `endSourceRankPercentile` is 100. The values that can be set are not any value between 0 and 100 but has to be a number divisible by 10. By setting `startSourceRankPercentile` to 0 and `endSourceRankPercentile` to 20 you would get only articles from top ranked news sources (according to [Alexa site ranking](https://www.alexa.com/siteinfo)) that would amount to about *approximately 20%* of all matching content. Note: 20 percentiles do not represent 20% of all top sources. The value is used to identify the subset of news sources that generate approximately 20% of our collected news content. The reason for this choice is that top ranked 10% of news sources writes about 30% of all news content and our choice normalizes this effect. This feature could potentially change in the future.
- `QueryEventArticlesIter` is now able to return only a subset of articles assigned to an event. You can use the same filters as with the `QueryArticles` constructor and you can specify them when constructing the instance of `QueryEventArticlesIter`. The same kind of filtering is also possible if you want to use the `RequestEventArticles()` class instead.
- added some parameters and changed default values in some of the result types to reflect the backend changes.
- added optional parameter `proxyUrl` to `Analytics.extractArticleInfo()`. It can be used to download article info through a proxy that you provide (to avoid potential GDPR issues). The `proxyUrl` should be in format `{schema}://{username}:{pass}@{proxy url/ip}`.

## [v8.3.1]() (2018-08-12)

**Updated**
Expand Down
9 changes: 7 additions & 2 deletions eventregistry/Analytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,13 +75,18 @@ def detectLanguage(self, text):
return self._er.jsonRequestAnalytics("/api/v1/detectLanguage", { "text": text })


def extractArticleInfo(self, url):
def extractArticleInfo(self, url, proxyUrl = None):
"""
extract all available information about an article available at url `url`. Returned information will include
article title, body, authors, links in the articles, ...
@param url: article url to extract article information from
@param proxyUrl: proxy that should be used for downloading article information. format: {schema}://{username}:{pass}@{proxy url/ip}
@returns: dict
"""
return self._er.jsonRequestAnalytics("/api/v1/extractArticleInfo", { "url": url })
params = { "url": url }
if proxyUrl:
params["proxyUrl"] = proxyUrl
return self._er.jsonRequestAnalytics("/api/v1/extractArticleInfo", params)


def ner(self, text):
Expand Down
46 changes: 24 additions & 22 deletions eventregistry/Base.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,27 +196,6 @@ def _getQueryParams(self):
return dict(self.queryParams)



class Query(QueryParamsBase):
def __init__(self):
QueryParamsBase.__init__(self)
self.resultTypeList = []


def _getQueryParams(self):
"""encode the request."""
allParams = {}
if len(self.resultTypeList) == 0:
raise ValueError("The query does not have any result type specified. No sense in performing such a query")
allParams.update(self.queryParams)
for request in self.resultTypeList:
allParams.update(request.__dict__)
# all requests in resultTypeList have "resultType" so each call to .update() overrides the previous one
# since we want to store them all we have to add them here:
allParams["resultType"] = [request.__dict__["resultType"] for request in self.resultTypeList]
return allParams


def _setQueryArrVal(self, value, propName, propOperName, defaultOperName):
"""
parse the value "value" and use it to set the property propName and the operator with name propOperName
Expand Down Expand Up @@ -251,4 +230,27 @@ def _setQueryArrVal(self, value, propName, propOperName, defaultOperName):

# there should be no other valid types
else:
assert False, "Parameter '%s' was of unsupported type. It should either be None, a string or an instance of QueryItems" % (propName)
assert False, "Parameter '%s' was of unsupported type. It should either be None, a string or an instance of QueryItems" % (propName)



class Query(QueryParamsBase):
def __init__(self):
QueryParamsBase.__init__(self)
self.resultTypeList = []


def _getQueryParams(self):
"""encode the request."""
allParams = {}
if len(self.resultTypeList) == 0:
raise ValueError("The query does not have any result type specified. No sense in performing such a query")
allParams.update(self.queryParams)
for request in self.resultTypeList:
allParams.update(request.__dict__)
# all requests in resultTypeList have "resultType" so each call to .update() overrides the previous one
# since we want to store them all we have to add them here:
allParams["resultType"] = [request.__dict__["resultType"] for request in self.resultTypeList]
return allParams


56 changes: 47 additions & 9 deletions eventregistry/EventRegistry.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,15 +141,20 @@ def printConsole(self, text):


def getRemainingAvailableRequests(self):
"""get the number of requests that are still available for the user today"""
"""get the number of requests that are still available for the user today. Information is only accessible after you make some query."""
return self._remainingAvailableRequests


def getDailyAvailableRequests(self):
"""get the total number of requests that the user can make in a day"""
"""get the total number of requests that the user can make in a day. Information is only accessible after you make some query."""
return self._dailyAvailableRequests


def getUsageInfo(self):
"""return the number of used and total available tokens. Can be used at any time (also before making queries)"""
return self.jsonRequest("/api/v1/usage", { "apiKey": self._apiKey })


def getUrl(self, query):
"""
return the url that can be used to get the content that matches the query
Expand Down Expand Up @@ -349,7 +354,7 @@ def suggestConcepts(self, prefix, sources = ["concepts"], lang = "eng", conceptL
params = { "prefix": prefix, "source": sources, "lang": lang, "conceptLang": conceptLang, "page": page, "count": count}
params.update(returnInfo.getParams())
params.update(kwargs)
return self.jsonRequest("/json/suggestConcepts", params)
return self.jsonRequest("/json/suggestConceptsFast", params)


def suggestCategories(self, prefix, page = 1, count = 20, returnInfo = ReturnInfo(), **kwargs):
Expand All @@ -364,7 +369,7 @@ def suggestCategories(self, prefix, page = 1, count = 20, returnInfo = ReturnInf
params = { "prefix": prefix, "page": page, "count": count }
params.update(returnInfo.getParams())
params.update(kwargs)
return self.jsonRequest("/json/suggestCategories", params)
return self.jsonRequest("/json/suggestCategoriesFast", params)


def suggestNewsSources(self, prefix, dataType = ["news", "pr", "blog"], page = 1, count = 20, **kwargs):
Expand All @@ -378,7 +383,7 @@ def suggestNewsSources(self, prefix, dataType = ["news", "pr", "blog"], page = 1
assert page > 0, "page parameter should be above 0"
params = {"prefix": prefix, "dataType": dataType, "page": page, "count": count}
params.update(kwargs)
return self.jsonRequest("/json/suggestSources", params)
return self.jsonRequest("/json/suggestSourcesFast", params)


def suggestSourceGroups(self, prefix, page = 1, count = 20, **kwargs):
Expand Down Expand Up @@ -413,7 +418,7 @@ def suggestLocations(self, prefix, sources = ["place", "country"], lang = "eng",
assert len(sortByDistanceTo) == 2, "The sortByDistanceTo should contain two float numbers"
params["closeToLat"] = sortByDistanceTo[0]
params["closeToLon"] = sortByDistanceTo[1]
return self.jsonRequest("/json/suggestLocations", params)
return self.jsonRequest("/json/suggestLocationsFast", params)


def suggestLocationsAtCoordinate(self, latitude, longitude, radiusKm, limitToCities = False, lang = "eng", count = 20, ignoreNonWiki = True, returnInfo = ReturnInfo(), **kwargs):
Expand All @@ -433,7 +438,7 @@ def suggestLocationsAtCoordinate(self, latitude, longitude, radiusKm, limitToCit
params = { "action": "getLocationsAtCoordinate", "lat": latitude, "lon": longitude, "radius": radiusKm, "limitToCities": limitToCities, "count": count, "lang": lang }
params.update(returnInfo.getParams())
params.update(kwargs)
return self.jsonRequest("/json/suggestLocations", params)
return self.jsonRequest("/json/suggestLocationsFast", params)


def suggestSourcesAtCoordinate(self, latitude, longitude, radiusKm, count = 20, **kwargs):
Expand All @@ -448,7 +453,7 @@ def suggestSourcesAtCoordinate(self, latitude, longitude, radiusKm, count = 20,
assert isinstance(longitude, (int, float)), "The 'longitude' should be a number"
params = {"action": "getSourcesAtCoordinate", "lat": latitude, "lon": longitude, "radius": radiusKm, "count": count}
params.update(kwargs)
return self.jsonRequest("/json/suggestSources", params)
return self.jsonRequest("/json/suggestSourcesFast", params)


def suggestSourcesAtPlace(self, conceptUri, dataType = "news", page = 1, count = 20, **kwargs):
Expand All @@ -461,7 +466,21 @@ def suggestSourcesAtPlace(self, conceptUri, dataType = "news", page = 1, count =
"""
params = {"action": "getSourcesAtPlace", "conceptUri": conceptUri, "page": page, "count": count, "dataType": dataType}
params.update(kwargs)
return self.jsonRequest("/json/suggestSources", params)
return self.jsonRequest("/json/suggestSourcesFast", params)


def suggestAuthors(self, prefix, page = 1, count = 20, **kwargs):
"""
return a list of news sources that match the prefix
@param prefix: input text that should be contained in the author name and source url
@param page: page of results
@param count: number of returned suggestions
"""
assert page > 0, "page parameter should be above 0"
params = {"prefix": prefix, "page": page, "count": count}
params.update(kwargs)
return self.jsonRequest("/json/suggestAuthorsFast", params)



def suggestConceptClasses(self, prefix, lang = "eng", conceptLang = "eng", source = ["dbpedia", "custom"], page = 1, count = 20, returnInfo = ReturnInfo(), **kwargs):
Expand Down Expand Up @@ -552,6 +571,13 @@ def getNewsSourceUri(self, sourceName, dataType = ["news", "pr", "blog"]):
return None


def getSourceUri(self, sourceName, dataType=["news", "pr", "blog"]):
"""
alternative (shorter) name for the method getNewsSourceUri()
"""
return self.getNewsSourceUri(sourceName, dataType)


def getSourceGroupUri(self, sourceGroupName):
"""
return the URI of the source group that best matches the name
Expand Down Expand Up @@ -600,6 +626,18 @@ def getCustomConceptUri(self, label, lang = "eng"):
return None


def getAuthorUri(self, authorName):
"""
return author uri that that is the best match for the given author name (and potentially source url)
if there are multiple matches for the given author name, they are sorted based on the number of articles they have written (from most to least frequent)
@param authorName: partial or full name of the author, potentially also containing the source url (e.g. "george brown nytimes")
"""
matches = self.suggestAuthors(authorName)
if matches != None and isinstance(matches, list) and len(matches) > 0 and "uri" in matches[0]:
return matches[0]["uri"]
return None


@staticmethod
def getUriFromUriWgt(uriWgtList):
"""
Expand Down
18 changes: 11 additions & 7 deletions eventregistry/Query.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,22 +33,24 @@ def __init__(self,
dateEnd = None,
dateMention = None,
sourceLocationUri = None,
sourceGroupUri = None,
sourceGroupUri=None,
authorUri = None,
keywordLoc = "body",
minMaxArticlesInEvent = None,
exclude = None):
"""
@param keyword: keyword(s) to query. Either None, string or QueryItems
@param conceptUri: concept(s) to query. Either None, string or QueryItems
@param sourceUri: source(s) to query. Either None, string or QueryItems
@param locationUri: location(s) to query. Either None, string or QueryItems
@param categoryUri: categories to query. Either None, string or QueryItems
@param lang: language(s) to query. Either None, string or QueryItems
@param keyword: keyword(s) to query. Either None, string or QueryItems instance
@param conceptUri: concept(s) to query. Either None, string or QueryItems instance
@param sourceUri: source(s) to query. Either None, string or QueryItems instance
@param locationUri: location(s) to query. Either None, string or QueryItems instance
@param categoryUri: categories to query. Either None, string or QueryItems instance
@param lang: language(s) to query. Either None, string or QueryItems instance
@param dateStart: starting date. Either None, string or date or datetime
@param dateEnd: ending date. Either None, string or date or datetime
@param dateMention: search by mentioned dates - Either None, string or date or datetime or a list of these types
@param sourceLocationUri: find content generated by news sources at the specified geographic location - can be a city URI or a country URI. Multiple items can be provided using a list
@param sourceGroupUri: a single or multiple source group URIs. A source group is a group of news sources, commonly defined based on common topic or importance
@param authorUri: author(s) to query. Either None, string or QueryItems instance
@param keywordLoc: where should we look when searching using the keywords provided by "keyword" parameter. "body" (default), "title", or "body,title"
@param minMaxArticlesInEvent: a tuple containing the minimum and maximum number of articles that should be in the resulting events. Parameter relevant only if querying events
@param exclude: a instance of BaseQuery, CombinedQuery or None. Used to filter out results matching the other criteria specified in this query
Expand Down Expand Up @@ -78,6 +80,8 @@ def __init__(self,

self._setQueryArrVal("sourceLocationUri", sourceLocationUri)
self._setQueryArrVal("sourceGroupUri", sourceGroupUri)
self._setQueryArrVal("authorUri", authorUri)

if keywordLoc != "body":
self._queryObj["keywordLoc"] = keywordLoc

Expand Down
Loading

0 comments on commit e09a66c

Please sign in to comment.