-
Notifications
You must be signed in to change notification settings - Fork 54
Source filtering
It is a very common use case that the users want to limit the results to certain subsets of news sources. In Event Registry, this kind of filtering can be done in several ways, which will be mentioned here
Overall, Event Registry separates news sources into 3 types: news, blogs, and PR content. News are the sources that regularly publish content which is typically considered to be newsworthy. Blogs are the news sources that are published typically by individuals, content published on medium.com, blogger.com, wordpress.com, and other similar sites. PR content is from the news sources who release PR content, such as marketwatch.com, globenewswire.com, and others.
To select which data type to include, you can provide the dataType
parameter when creating the QueryArticles
instance. The value that you can set for the parameter is news
, blog
or pr
(you can also provide an array with any combination of those values). When searching for events, you do not need to provide the dataType
parameter since events are only computed on news content and therefore the parameter is redundant.
A very typical way to limit the results to specific news sources it to simply provide the list of sources to include in the query. To provide the list of sources you can use the sourceUri
parameter in the QueryArticles
and QueryEvents
constructors. The parameter value should contain source URI, which is the normalized source URL (URL in lowercase without the http or www prefixes, e.g. bbc.co.uk
). You can get the source URI for a given source name by calling EventRegistry.getNewsSourceUri()
method to which you can send source name or source domain name. If you want to limit the results to multiple news sources you can use the QueryItems.OR()
class to which you can provide multiple sources. See some examples below:
from eventregistry import *
er = EventRegistry(apiKey = YOUR_API_KEY)
q = QueryArticles(
sourceUri = "bbc.co.uk"
)
q = QueryArticles(
sourceUri=QueryItems.OR([
er.getNewsSourceUri("seattletimes"),
er.getNewsSourceUri("China News Service")
])
)
Sources are located in various parts of the world and some use cases require limiting the results to only news sources located in a particular city or country. In such cases, you can set the sourceLocationUri
parameter in the QueryArticles
and QueryEvents
constructors. The values of the parameter can be the location URIs which you can obtain by calling the utility method EventRegistry.getLocationUri()
and provide it partial or complete location name. Calling er.getLocationUri("new york city")
would for example return http://en.wikipedia.org/wiki/New_York_City
. If you want to limit the search results to news sources from multiple geographical locations you can use the QueryItems.OR()
class to provide multiple location URIs. See some examples below:
from eventregistry import *
er = EventRegistry(apiKey = YOUR_API_KEY)
# find articles by news sources located in Germany
q = QueryArticles(
sourceLocationUri = "http://en.wikipedia.org/wiki/Germany"
)
# find articles by sources located in London, UK or Washington D.C.
q = QueryArticles(
sourceUri=QueryItems.OR([
er.getLocationUri("London"),
"http://en.wikipedia.org/wiki/Washington,_D.C."
])
)
News sources could also be grouped by the topics they write about (sports, politics, business, ...). In order to make it easier for our users we have predefined some groups of news sources for different topics. To view the list of predefined groups that we have you can visit the documentation page where you can also see which news sources are in each of the groups. To use a group of sources in your searches you can set them by specifying a parameter sourceGroupUri
in the QueryArticles
and QueryEvents
constructors. To get the URI for the source group you can use the helper method EventRegistry.getSourceGroupUri()
and provide the (partial) name of the source group. See some examples below:
from eventregistry import *
er = EventRegistry(apiKey = YOUR_API_KEY)
# find articles by news sources that are assigned to be among 50 top sources
q = QueryArticles(
sourceGroupUri = "general/ERtop50"
)
# find articles by sources that are assigned to be among 15 top science sources or among 25 top business sources
q = QueryArticles(
sourceUri = QueryItems.OR([
er.getSourceGroupUri("science top 15"),
er.getSourceGroupUri("business top 25"),
])
)
It is expected that you will want sometimes to combine different types of source filtering options in the same query. You can, of course, do that easily, it is just important for you to know what will be the effect, once you provide multiple source filters in the same query.
With other filters that you provide in the QueryArticles
and QueryEvents
, the effect that you achieve with multiple conditions is that you get an intersection of matching results. For example, if you do
q = QueryArticles(keywords = "iphone",
sourceUri = "nytimes.com")
you would get as a result the list of articles that are published by New York Times and that mention the keyword iphone.
When providing the sourceUri
, sourceLocationUri
and sourceGroupUri
we, however, compute a union of news sources provided by these filters. For example, if you do
q = QueryArticles(keywords = "iphone",
sourceUri = "nytimes.com",
sourceLocationUri = er.getLocationUri("Germany"),
sourceGroupUri = er.getSourceGroupUri("science top 15"))
we would first compute what is the union of all matching sources. In this case, the sources would be nytimes.com, all sources from Germany and the list of top 15 science related sources. In these sources, we would then find the articles that mention the keyword iphone and return them.
In case you would want to ignore some of the sources when using sourceLocationUri
or sourceGroupUri
you can, of course, do that by using the ignoreSourceUri
parameter.
The final way how you can restrict your number of returned results based on the news sources is also by considering their Alexa global site ranking. Alexa is able to rank all websites based on the monthly number of visits of the site. Based on this ranking we can order the news sources from most to least frequently visited. The QueryArticles
constructor then provides two parameters - startSourceRankPercentile
and endSourceRankPercentile
that can be used to restrict the final list of news sources used in the search based on the Alexa ranking of the site.
The default values of the startSourceRankPercentile
and endSourceRankPercentile
parameters are 0 and 100 correspondingly. The valid values that you can set for them have to be between 0 and 10, they have to be divisible by 10 and startSourceRankPercentile
should be smaller than endSourceRankPercentile
.
By setting the startSourceRankPercentile = 0
and endSourceRankPercentile = 30
you would limit the sources used in the search to the top-ranked news sources that would return approximately 30% of content compared to not using the percentile filter. In other words, 10 percentiles of the news sources don't necessarily represent 10% of all news sources in Event Registry. Instead, the boundaries between the percentiles are set so that the top 10 percentiles of news sources generates approximately the same amount of news content compared to the next 10 percentiles. The reason for this choice is the fact that top 10% of news sources generate about 30% of all collected content, while the 10% of worst-ranked news sources contribute only about 2% of news content.
In the bottom example, the query would return articles that mention keyword iphone and were published by a subset of German media sources that are ranked among the top 50% global sources.
q = QueryArticles(keywords = "iphone",
sourceLocationUri = er.getLocationUri("Germany"),
startSourceRankPercentile = 0,
endSourceRankPercentile = 50)
Core Information
Usage tracking
Terminology
EventRegistry
class
ReturnInfo
class
Data models for returned information
Finding concepts for keywords
Filtering content by news sources
Text analytics
Semantic annotation, categorization, sentiment
Searching
Searching for events
Searching for articles
Article/event info
Get event information
Get article information
Other
Supported languages
Different ways to search using keywords
Feed of new articles/events
Social media shares
Daily trends
Find the event for your own text
Article URL to URI mapping