-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getArticleUris sometimes null sometimes works (based on order / amount of urls) #63
Comments
Another interesting example: {
"articleUrl": [
"https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490",
"https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html"
],
"includeAllVersions": true,
"deep": true,
"apiKey": "XXX"
} {
"https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490": "7763040460",
"https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html": "7763074647"
} VS {
"articleUrl": [
"https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html",
"https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490"
],
"includeAllVersions": true,
"deep": true,
"apiKey": "XXX"
} {
"https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ito=1490&ns_campaign=1490": "7763040460"
} |
There doesn't seem to be an error related to this API call. The article that we have in our DB is "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html?ns_mchannel=rss&ns_campaign=1490&ito=1490". When mapping to the URI we also create alternative versions of the urls that we test. One version is without the parameters. Another version is without the "www." prefix. The URI that you receive is the URI of the article that we have in our database. Regarding the first reported issue (i.e. not returning uri when providing multiple urls): In your case, it seems that you've made first the call with a single url (https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html) and later repeated the query with multiple urls. I hope this explains the confusion. |
Hi Greg, thanks for the answer. Unfortunately all of the example URLs from my original message now return What I would like to achieve is:
As you see, Since you identified As it stands, I am not sure how to use the API to reliably get back URIs. |
this call does not return null, since that is the url that we have in the db. What you would like to achieve is generally exactly what the article mapper is for. The only issue is that if you have a url that we don't have in the DB, then we cannot return it. If you provide a url that is not exactly the url that we have in the DB, then in some cases we can resolve the issue and in some not. If, on the other hand, we store url We cannot return you the URI for My suggestion is that you use the API for your articles. You then take the articles for which you get a valid URI and for the remaining ones you call the Do you have a particular reason why you need specifically the articleMapper? |
Yes, the reason for using articleMapper is that I source URLs from various places, and not just from newsapi.ai. Therefore the URLs might come with various extra query params attached that might not match the ones that newsapi.ai is storing. Out of curiosity, why would the article get deleted? |
Ok. Does the endpoint that I suggested for you (https://newsapi.ai/documentation?tab=extractArticleInfo) therefore work for your purposes?
The articles that get deleted are duplicated articles that come from the same source. So if we see that we imported the same article with a different url multiple times, we remove such duplicates since they bring no value to any user. |
Yes, that endpoint returns the article content even for the URLs where ArticleMapper returns null (which is still something I don't understand the reason of -- why could not ArticleMapper use the same URL -> URI resolution logic?). However, it's ~9 times more expensive compared to ArticleMapper + GetArticle (to get 100 articles from given URLs with ExtractArticleInfo, I need 100 tokens, to get 100 articles from ArticleMapper + GetArticle, I need 11 tokens), so it won't be feasible for our usecase. |
Extract article info should use 0.05 tokens per url so 5 tokens per 100
articles. Article mapper cannot return you an article for a URL that it
hasn't seen or doesn't keep in our database. If you have articles from the
sources that we do cover, I don't think there will be may articles for
which the article mapper will return null.
…On Thu, Oct 12, 2023 at 1:52 PM Petr Pilař ***@***.***> wrote:
Yes, that endpoint returns the article content even for the URLs where
ArticleMapper returns null (which is still something I don't understand the
reason of -- why could not ArticleMapper use the same URL -> URI resolution
logic?). However, it's ~9 times more expensive compared to ArticleMapper +
GetArticle (to get 100 articles from given URLs with ExtractArticleInfo, I
need 100 tokens, to get 100 articles from ArticleMapper + GetArticle, I
need 11 tokens), so it won't be feasible for our usecase.
—
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGFVOTI7TKSJYK5U5TXNKDX67KYLANCNFSM6AAAAAA5RALIGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
-------------------------------------------
Gregor Leban
Phone: +386-31-321-804
Skype: gregorleban
-------------------------------------------
|
I see, that's great to know about ExtractArticleInfo token usage, seems even better and easier than the ArticleMapper + GetArticle. Based on this I consider my issue resolved. But I have to say the API is quite unintuitive in this regard. I see no reason why e.g. "https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html" would return |
Aha, sorry for misunderstanding. The Extract article info endpoint is
actually not using our db at all. It is using our information extraction
service to extract the article information directly from the URL. So it
downloads the page and extracts the article information directly from the
page.
…On Thu, Oct 12, 2023 at 3:30 PM Petr Pilař ***@***.***> wrote:
I see, that's great to know about ExtractArticleInfo token usage, seems
even better and easier than the ArticleMapper + GetArticle. Based on this I
consider my issue resolved.
But I have to say the API is quite unintuitive in this regard. I see no
reason why e.g. "
https://www.dailymail.co.uk/money/saving/article-12583973/Half-Premium-Bonds-prizes-won-maximum-50-000-holding.html"
would return null with ArticleMapper, yet ExtractArticleInfo has no
problem finding the URL. So given that ExtractArticleInfo has the
information, your system knows about the URL.
—
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGFVOX36NCMURAPI3UFTKDX67WI3ANCNFSM6AAAAAA5RALIGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
-------------------------------------------
Gregor Leban
Phone: +386-31-321-804
Skype: gregorleban
-------------------------------------------
|
Example (this happens for both the Python and REST API (as the Python just calls the REST API directly)
Multiple URLs (the dailymail will get null -- only if it's second, it works if it's first!):
Single URL (the dailymail will be mapped):
The text was updated successfully, but these errors were encountered: