New Kingfisher Process integration #745

jpmckinney · 2021-06-22T15:08:39Z

No description provided.

jpmckinney · 2021-08-29T00:12:04Z

Dockerizing scrapyd makes it more complicated to deploy new spiders, so I will remove that part and deploy scrapyd normally.

…omponents

…id "new new generation" in future

jpmckinney · 2021-08-29T00:39:10Z

I'm not sure why this is needed:

[deploy:kingfisher]
url = http://localhost:6800
project = kingfisher

Maybe it was needed for the Docker deployment? But we will not need it for the regular deployment.

…distinguish versions.

jpmckinney · 2021-08-29T17:59:15Z

Comment from #656:

The current extension also implements item_error and spider_error signals.

I think we decided not to continue this feature in the new version of Kingfisher Process. Analysts can instead check the Scrapy log to find these errors. See #531.

ideally, we would have coverage for the error scenarios. The existing tests provide a lot of inspiration for how to construct new tests.

This still needs to be addressed. Update: Done now.

The current extension also sends the file_name and url from the item, for which there are corresponding columns in the collection_file table. Also, for FileItem's, it sends the number.

Need to check how the new Process is implemented to see if this needs to be addressed or or not. Update: See #745

jpmckinney · 2021-08-29T18:50:52Z

I'm not sure how the bug referenced in this commit could occur: 04135ea

…ariables

- Disable KingfisherProcessAPI2 if DatabaseStore would be enabled - Pass spider to helper, instead of setting it in one handler - Increase logging level for some exceptions - Use a single RABBIT_URL environment variable - Put the Kingfisher Process API basic authentication in the URL - Don't use set_value, since inc_value already uses 0 as default Style changes: - if-statements should have the error case in the else branch - Follow style guide for logging messages - Increase consistency of variable names - Use consistent quote characters

jpmckinney · 2021-08-29T23:10:03Z

I'm not sure how the bug referenced in this commit could occur: 04135ea

@jakubkrafka Do you remember how this error occurred? The FilesStore extension needs to be enabled for the Kingfisher integration to work (if it's disabled, no files are written to disk), and that extension will always add the files_store and path fields before the Kingfisher extension handles the item. And errors is a required field for FileError, so the item will not reach the Kingfisher extension if it's missing, because it will fail validation.

jpmckinney · 2021-08-29T23:24:28Z

If I understand the integration correctly, then FILES_STORE must be an absolute directory. Otherwise, I'm not sure how Kingfisher Process can reliably read the file. Update: It is now guaranteed to be absolute.

…ares, closes #579

jpmckinney · 2021-08-30T03:14:45Z

The following were sent to the old Kingfisher Process, but are not sent to the new Kingfisher Process:

file_name: Kingfisher Process now uses the full path, which is sent.
data_type: Kingfisher Process now detects this using OCDS Kit's detect_format.
encoding: Kingfisher Process expects UTF-8. I opened an issue to ensure files are written in UTF-8: Ensure files are UTF-8 encoded #783
number: Always 0. Kingfisher Process now has a 1:1 relationship between File and FileItem. It simply loads each file written by Kingfisher Collect (which writes separate files for each FileItem).
FileItem data: The old Kingfisher Process would receive this via web request. The new Kingfisher Process reads it from a file, same as for File data.

jpmckinney · 2021-08-30T03:43:14Z

I'm not sure how the bug referenced in this commit could occur: 04135ea

@jakubkrafka Do you remember how this error occurred? The FilesStore extension needs to be enabled for the Kingfisher integration to work (if it's disabled, no files are written to disk), and that extension will always add the files_store and path fields before the Kingfisher extension handles the item. And errors is a required field for FileError, so the item will not reach the Kingfisher extension if it's missing, because it will fail validation.

Got it: FileError does not have files_store or path. I can't reproduce a case where errors or url are missing though.

- Move ExpectedError to tests/__init__.py - Configure KingfisherProcessAPI2 in spider_with_files_store - Re-order test_item_scraped_plucked_item test - Re-add yields in inlineCallbacks tests

- Ensure KingfisherProcessAPI2.channel is always defined - Ensure sample is a boolean as expected by Kingfisher Process NG - Use [] instead of get() to avoid shadowing errors - Add more tests - Add docstrings

- Open files as binary for ijson - Move instance variables into __init__ method - Use spider.logger instead of new logger instance - In-line absolute_crawl_directory (which is not guaranteed to be absolute)

jakubkrafka added 8 commits June 20, 2021 11:12

added integration to extensions and settings

4aa7d34

isort

b739fef

dockerized

73aaf13

updagraded zope.interface

f6e6b10

python 3,6

902664e

shasum fix

3163fd3

added deployment of spiders

99754f7

added deployment

0856ccb

jpmckinney mentioned this pull request Jun 25, 2021

603 kingfisher process middleware ng #656

Closed

increased download limit

787053f

jpmckinney added 6 commits August 28, 2021 20:13

build: Remove Docker configuration

1b8ca33

test: Use lowercase filenames

7434180

chore(settings): Use same environment variables for Rabbit as other c…

ebf2531

…omponents

chore: Rename KingfisherProcessNGAPI to KingfisherProcessAPI2, to avo…

0fb05a8

…id "new new generation" in future

chore(requirements): Sort

da471d7

build: pip-compile

aa7b99b

jpmckinney added 2 commits August 28, 2021 20:39

build: Remove deploy section of scrapy.cfg

8ba68db

chore(settings): Adjust KingfisherProcessAPI2 priority. Edit docs to …

46e70f5

…distinguish versions.

jpmckinney mentioned this pull request Aug 29, 2021

Handle spider_closed signal, if reason is 'cancelled' or 'shutdown' #278

Closed

jpmckinney added 5 commits August 29, 2021 15:02

fix: Use urljoin to avoid errors in trailing slashes in environment v…

8b93e84

…ariables

test: Fix tests broken by 8b93e84

aec0aeb

chore: Use routing_key instead of publish_key

ae161c3

test: Add tests for pluck and keep_collection_open branches

3dfe185

jpmckinney added 9 commits August 29, 2021 19:37

test: Skip some tests if KINGFISHER_COLLECT_DATABASE_URL isn't set

c59552d

test: Fix "yield tests were removed in pytest 4.0"

52937ec

uk_contracts_finder: Update encoding (was set in #102)

f06a6e1

chore: Fix indentation and trailing comma in base_spider

4114166

ci: Add RabbitMQ

29d1f94

chore: Split middlewares into downloadermiddlewares and spidermiddlew…

5417d0a

…ares, closes #579

australia: Use consistent class attribute order

6d34723

docs: Add docstrings to pipelines and add to documentation

34dfe09

docs: Re-organize BaseSpider docstring

41ac215

jpmckinney added 6 commits August 30, 2021 00:43

test: Tidy test_kingfisher_process_api.py

0ddb9d7

- Move ExpectedError to tests/__init__.py - Configure KingfisherProcessAPI2 in spider_with_files_store - Re-order test_item_scraped_plucked_item test - Re-add yields in inlineCallbacks tests

feat: Update KingfisherProcessAPI2 after adding tests

25e329a

- Ensure KingfisherProcessAPI2.channel is always defined - Ensure sample is a boolean as expected by Kingfisher Process NG - Use [] instead of get() to avoid shadowing errors - Add more tests - Add docstrings

chore: Tidy DatabaseStore

a501a1d

- Open files as binary for ijson - Move instance variables into __init__ method - Use spider.logger instead of new logger instance - In-line absolute_crawl_directory (which is not guaranteed to be absolute)

fix: Ensure path is absolute for Kingfisher Process NG

ac40973

test: Ensure skippable tests are always run in CI

977d622

test: Test KingfisherProcessAPI2 stats

ba820fc

jpmckinney merged commit 7de72bd into main Aug 30, 2021

jpmckinney deleted the NG_kingfisher-process_integration branch August 30, 2021 05:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Kingfisher Process integration #745

New Kingfisher Process integration #745

jpmckinney commented Jun 22, 2021

jpmckinney commented Aug 29, 2021

jpmckinney commented Aug 29, 2021

jpmckinney commented Aug 29, 2021 •

edited

Loading

jpmckinney commented Aug 29, 2021

jpmckinney commented Aug 29, 2021

jpmckinney commented Aug 29, 2021 •

edited

Loading

jpmckinney commented Aug 30, 2021 •

edited

Loading

jpmckinney commented Aug 30, 2021

New Kingfisher Process integration #745

New Kingfisher Process integration #745

Conversation

jpmckinney commented Jun 22, 2021

jpmckinney commented Aug 29, 2021

jpmckinney commented Aug 29, 2021

jpmckinney commented Aug 29, 2021 • edited Loading

jpmckinney commented Aug 29, 2021

jpmckinney commented Aug 29, 2021

jpmckinney commented Aug 29, 2021 • edited Loading

jpmckinney commented Aug 30, 2021 • edited Loading

jpmckinney commented Aug 30, 2021

jpmckinney commented Aug 29, 2021 •

edited

Loading

jpmckinney commented Aug 29, 2021 •

edited

Loading

jpmckinney commented Aug 30, 2021 •

edited

Loading