Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REGRESSION?] API end-point detection does not take care properly of command line argument #1935

Closed
kelson42 opened this issue Oct 29, 2023 · 12 comments · Fixed by #1939
Closed
Assignees
Labels
Milestone

Comments

@kelson42
Copy link
Collaborator

See:

$ mw --mwApiPath=/api.php --mwModulePath=/load.php --mwUrl=https://minecraft.wiki/ --mwWikiPath=/ --verbose

> [email protected] start
> ./node_modules/.bin/ts-node-esm ./src/cli.ts [email protected] --osTmpDir=/dev/shm --optimisationCacheUrl=http://s3.us-west-1.wasabisys.com/?bucketName=org-kiwix-dev-mwoffliner&keyId=6JE56A5IWY5Z3H3L3KH6&secretAccessKey=m2vQMjnneDjnrYztjfHrAUKneIf0SNGt75k9iITm --mwApiPath=/api.php --mwModulePath=/load.php --mwUrl=https://minecraft.wiki/ --mwWikiPath=/ --verbose

[log] [2023-10-29T12:33:14.063Z] Starting mwoffliner v1.13.0...
[info] [2023-10-29T12:33:14.066Z] Using custom flavour: no
[log] [2023-10-29T12:33:14.071Z] closing sanitize redis DB
[log] [2023-10-29T12:33:14.466Z] Successfully logged in S3
[log] [2023-10-29T12:33:14.468Z] Getting text direction...
[log] [2023-10-29T12:33:14.468Z] Getting site info...
[log] [2023-10-29T12:33:14.468Z] Getting sub-title...
[info] [2023-10-29T12:33:14.469Z] Downloading [https://minecraft.wiki/wiki/]
[info] [2023-10-29T12:33:14.471Z] Getting JSON from [https://minecraft.wiki/w/api.php?action=query&meta=siteinfo&format=json&formatversion=2&siprop=general%7Cnamespaces%7Cstatistics%7Cvariables%7Ccategory%7Cwikidesc]
[info] [2023-10-29T12:33:14.472Z] Downloading [https://minecraft.wiki/wiki/]
[log] [2023-10-29T12:33:14.697Z] Text direction is [ltr]
[warn] [2023-10-29T12:33:14.700Z] Failed to get [https://minecraft.wiki/w/api.php?action=query&meta=siteinfo&format=json&formatversion=2&siprop=general%7Cnamespaces%7Cstatistics%7Cvariables%7Ccategory%7Cwikidesc] [status=404]
[error] [2023-10-29T12:33:14.700Z] FATAL - Failed to get MediaWiki Metadata
[error] [2023-10-29T12:33:14.703Z] Failed to run mwoffliner after [2s]: {
	"message": "Request failed with status code 404",

The URL tested is https://minecraft.wiki/w/api.php and it should be https://minecraft.wiki/api.php given the command line argument! Pretty sure this is a regression we have introduced.

@VadimKovalenkoSNF
Copy link
Collaborator

This bug could be reproduced since apiUrlDirector was responsible for building the endpoint in downloader.query() here. The problem is that mwWikiPath wasn't taken into consideration while building requests.

Regarding parameter lists:

mwWikiPath: 'Mediawiki wiki base path (per default "/wiki/")',

That means that the default logic is to place /wiki/ to the URL automatically if no mwWikiPath is specified. This logic is missing in the main so this leads to the side-effect (at least for Wikimedia wikis) - no /wiki/ attached, but articles scraped because this param is not needed for them. And that's why this bug has been missing previously. In fact, while regular scraping of en.wikipedia.org you have to get an error while trying to get site info from this endpoint:
https://en.wikipedia.org/wiki/w/api.php?action=query&meta=siteinfo&format=json&formatversion=2&siprop=general%7Cnamespaces%7Cstatistics%7Cvariables%7Ccategory%7Cwikidesc

Note the presence of /wiki/ and /w/api.php parts. /wiki/ should not be there but it is placed in this endpoint by default - I implemented this in the PR #1939. So now, for the right scrapping of Mediawiki wikis you need to explicitly specify mwWikiPath param like this --mwWikiPath=/.

Also, I noticed that mwWikiPath , mwModulePath, mwApiPath don't have test coverage at all, and this ticket brings this problem to light.

@benoit74
Copy link
Contributor

benoit74 commented Nov 7, 2023

Do you mean that /wiki/ (--mwWikiPath) will always be prepended in all URLs, even for APIs URLs?

If so, I'm not sure this is a good idea, see https://pokemon.fandom.com/:

@VadimKovalenkoSNF
Copy link
Collaborator

@benoit74 , If you try to run this command:
npm start -- --mwUrl=https://pokemon.fandom.com [email protected] --articleList=Volcarona
from #1939 you end up with an error in Mediawiki.getSiteInfo() because it will try to handle the response from this URL - https://pokemon.fandom.com/w/api.php?action=query&meta=siteinfo&format=json&formatversion=2&siprop=general%7Cnamespaces%7Cstatistics%7Cvariables%7Ccategory%7Cwikidesc

But if you set --mwWikiPath='' and ----mwActionApiPath='api.php' (try this command):
npm start -- --mwUrl=https://pokemon.fandom.com [email protected] --articleList=Volcarona --mwWikiPath='' --mwActionApiPath='api.php'

You will get a valid Fandom wiki action api endpoint to get site info:
https://pokemon.fandom.com//api.php?action=query&meta=siteinfo&format=json&formatversion=2&siprop=general%7Cnamespaces%7Cstatistics%7Cvariables%7Ccategory%7Cwikidesc

To get site info from Minecraft wiki, you can try this command as well (only from #1939)
npm start -- --mwUrl=https://minecraft.wiki [email protected] --articleList=Enchanting --mwWikiPath='' --mwActionApiPath='api.php'

And get
https://minecraft.wiki/api.php?action=query&format=json&formatversion=2&meta=siteinfo&siprop=general%7Cnamespaces%7Cstatistics%7Cvariables%7Ccategory%7Cwikidesc

Note: scrape process will failed with error anyway because Mediawiki REST API (/rest.php) still not adapted in mwoffliner

@kelson42
Copy link
Collaborator Author

kelson42 commented Nov 8, 2023

I see clearly a problem which is that in Zimfarm, we have no way to make the differrence bezween an empty string argument and an undefined argument. Usually we use the value "/index.php" for the wikipath when needed, but sounds more like a workaround.

@VadimKovalenkoSNF
Copy link
Collaborator

I have no idea where the REST API is ^^

Fandom wiki supports it (check https://pokemon.fandom.com/wiki/Special:Version.

Example: https://pokemon.fandom.com/rest.php/v1/page/Volcarona/html

Another way to get article from Fandom wiki is by using Action API:
https://pokemon.fandom.com/api.php/?action=parse&format=json&page=Volcarona&prop=text%7Clinks%7Cimages%7Cexternallinks%7Csections%7Ciwlinks&formatversion=2

@benoit74
Copy link
Contributor

benoit74 commented Nov 8, 2023

Setting --mwWikiPath='' for pokemon.fandom.com is not a problem if all articles are in the /wiki path?

@VadimKovalenkoSNF
Copy link
Collaborator

Setting --mwWikiPath='' for pokemon.fandom.com is not a problem if all articles are in the /wiki path?

It will depend on the chosen renderer for the Fandom wiki, but probably no: both rest.php and action=parse don't need /wiki path.

@VadimKovalenkoSNF
Copy link
Collaborator

I see clearly a problem which is that in Zimfarm, we have no way to make the differrence bezween an empty string argument and an undefined argument.

@kelson42 This is a blocker because I need to omit wikiPath somehow during the concatenation process in URL builder in #1939 but at the same time, set default value equal to wiki/ as it was described in src/parameterList.ts. The only way to do this was to set it explicitly as empty string to prevent usage of the default wiki/ string defined in Mediawiki instance.

@kelson42
Copy link
Collaborator Author

kelson42 commented Nov 8, 2023

@VadimKovalenkoSNF then default value of mwWikiPath has to be made empty string. I see no other solution.

@benoit74
Copy link
Contributor

benoit74 commented Nov 9, 2023

@kelson42 isn't this something we should fix on the Zimfarm? I don't think this is a big effort to implement something which says that "" "magic" string value means that the parameter must be set to an empty string. I don't see a situation where this "magic" value would really be needed + we might have other scrapers which might need this.

I'm more concerned about the fact that once 1.4.0 is out and we decide to use it on the farm (probably immediately) we will have to adjust set of parameters: remove mwApiPath, add mwRestApiPath, add mwActionApiPatch, ... This is usual. What is a bit unusual is that we will also have to update all recipes and transfer the mwApiPath value to mwActionApiPath. I'm not sure we already did this before, @rgaudin do we have tooling for that? Not a big issue but something to keep in mind. And it also means that we won't be able to rollback to 1.3.0 easily and that we shouldn't let the user select this image tag in the list. It makes me realize that the fact that parameters are directly linked to the schedule is an approximation, it should be linked to a range of image versions (but this is probably a huge effort for a small occasional need).

@rgaudin
Copy link
Member

rgaudin commented Nov 9, 2023

@benoit74 we have no tools for that. We've had such migrations in the past and I would run a custom script that connected to the mongo DB.

I think tying offliner version and schema might bring more frustration given how organic the scrapers development is. History log of schedules modification seems more useful for instance in a similar vein.

I've long been a supporter of the farm remaining a tool for skilled farmers… with the addition of Wizards to help create/update recipes by non-skilled people.

@kelson42
Copy link
Collaborator Author

kelson42 commented Nov 9, 2023

@benoit74 I prefer to have default empty string for this specific parameter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment