Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Kingfisher Process integration #745

Merged
merged 37 commits into from
Aug 30, 2021
Merged

Conversation

jpmckinney
Copy link
Member

No description provided.

@jpmckinney
Copy link
Member Author

Dockerizing scrapyd makes it more complicated to deploy new spiders, so I will remove that part and deploy scrapyd normally.

@jpmckinney
Copy link
Member Author

I'm not sure why this is needed:

[deploy:kingfisher]
url = http://localhost:6800
project = kingfisher

Maybe it was needed for the Docker deployment? But we will not need it for the regular deployment.

@jpmckinney
Copy link
Member Author

jpmckinney commented Aug 29, 2021

Comment from #656:

The current extension also implements item_error and spider_error signals.

I think we decided not to continue this feature in the new version of Kingfisher Process. Analysts can instead check the Scrapy log to find these errors. See #531.

ideally, we would have coverage for the error scenarios. The existing tests provide a lot of inspiration for how to construct new tests.

This still needs to be addressed. Update: Done now.

The current extension also sends the file_name and url from the item, for which there are corresponding columns in the collection_file table. Also, for FileItem's, it sends the number.

Need to check how the new Process is implemented to see if this needs to be addressed or or not. Update: See #745

@jpmckinney
Copy link
Member Author

I'm not sure how the bug referenced in this commit could occur: 04135ea

- Disable KingfisherProcessAPI2 if DatabaseStore would be enabled
- Pass spider to helper, instead of setting it in one handler
- Increase logging level for some exceptions
- Use a single RABBIT_URL environment variable
- Put the Kingfisher Process API basic authentication in the URL
- Don't use set_value, since inc_value already uses 0 as default

Style changes:

- if-statements should have the error case in the else branch
- Follow style guide for logging messages
- Increase consistency of variable names
- Use consistent quote characters
@jpmckinney
Copy link
Member Author

I'm not sure how the bug referenced in this commit could occur: 04135ea

@jakubkrafka Do you remember how this error occurred? The FilesStore extension needs to be enabled for the Kingfisher integration to work (if it's disabled, no files are written to disk), and that extension will always add the files_store and path fields before the Kingfisher extension handles the item. And errors is a required field for FileError, so the item will not reach the Kingfisher extension if it's missing, because it will fail validation.

@jpmckinney
Copy link
Member Author

jpmckinney commented Aug 29, 2021

If I understand the integration correctly, then FILES_STORE must be an absolute directory. Otherwise, I'm not sure how Kingfisher Process can reliably read the file. Update: It is now guaranteed to be absolute.

@jpmckinney
Copy link
Member Author

jpmckinney commented Aug 30, 2021

The following were sent to the old Kingfisher Process, but are not sent to the new Kingfisher Process:

  • file_name: Kingfisher Process now uses the full path, which is sent.
  • data_type: Kingfisher Process now detects this using OCDS Kit's detect_format.
  • encoding: Kingfisher Process expects UTF-8. I opened an issue to ensure files are written in UTF-8: Ensure files are UTF-8 encoded #783
  • number: Always 0. Kingfisher Process now has a 1:1 relationship between File and FileItem. It simply loads each file written by Kingfisher Collect (which writes separate files for each FileItem).
  • FileItem data: The old Kingfisher Process would receive this via web request. The new Kingfisher Process reads it from a file, same as for File data.

@jpmckinney
Copy link
Member Author

I'm not sure how the bug referenced in this commit could occur: 04135ea

@jakubkrafka Do you remember how this error occurred? The FilesStore extension needs to be enabled for the Kingfisher integration to work (if it's disabled, no files are written to disk), and that extension will always add the files_store and path fields before the Kingfisher extension handles the item. And errors is a required field for FileError, so the item will not reach the Kingfisher extension if it's missing, because it will fail validation.

Got it: FileError does not have files_store or path. I can't reproduce a case where errors or url are missing though.

- Move ExpectedError to tests/__init__.py
- Configure KingfisherProcessAPI2 in spider_with_files_store
- Re-order test_item_scraped_plucked_item test
- Re-add yields in inlineCallbacks tests
- Ensure KingfisherProcessAPI2.channel is always defined
- Ensure sample is a boolean as expected by Kingfisher Process NG
- Use [] instead of get() to avoid shadowing errors
- Add more tests
- Add docstrings
- Open files as binary for ijson
- Move instance variables into __init__ method
- Use spider.logger instead of new logger instance
- In-line absolute_crawl_directory (which is not guaranteed to be absolute)
@jpmckinney jpmckinney merged commit 7de72bd into main Aug 30, 2021
@jpmckinney jpmckinney deleted the NG_kingfisher-process_integration branch August 30, 2021 05:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants