-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA Ventura Sheriff #42
base: dev
Are you sure you want to change the base?
Conversation
For the record, I'm trying to finish this scraper. A few things I know need to be worked on:
Some of this requires the visually identifiable content section to be properly extruded, and codewise it's a mess, e.g., things I thought might be good indicators turn out to be the 18th incarnation of a particular class. This may work on index and subpages but needs to be tested:
|
newsroomdev suggests trying to cure BeautifulSoup vs. mypy problems particularly with lines 214-215. https://github.com/biglocalnews/clean-scraper/blob/ca-51/clean/ca/sacramento_pd.py#L180-L224 |
9be0947
to
da91b41
Compare
8bb2691
to
8ec8ffc
Compare
@irenecasado , I have ... destroyed all the history here. I am so so very sorry, and so tired. Please make a copy of your local version ... |
@newsroomdev @zstumgoren I believe this is ready for review if you want. |
self.cache.download( | ||
full_filename, | ||
target_url, | ||
force=False, # Do NOT automatically rescrape subpages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you walk me through setting it to False
here? On the backend side, once these are saved to a local cache, they're copied over to Azure Blob Storage with timestamps. Leaving it as True
ensures we're catching changes to the page and its asset_url's byte lengths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever you'd like to do is fine by me. Looks like there were about 58 subpages and I was assuming there was no real reason to rescrape those. Index pages are of course forced to rescrape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no info on file size here that I'm aware of, FYI. So unless you rescrape the asset itself; or alternately grab the HEAD and get accurate, reliable results; rescraping the subpage would only show you any new assets, and I don't know if anything ever gets refreshed.
2bba448
to
5d22458
Compare
No description provided.