Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA Ventura Sheriff #42

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from
Open

CA Ventura Sheriff #42

wants to merge 5 commits into from

Conversation

irenecasado
Copy link
Collaborator

No description provided.

@newsroomdev newsroomdev requested a review from zstumgoren July 2, 2024 18:32
@newsroomdev newsroomdev linked an issue Aug 14, 2024 that may be closed by this pull request
3 tasks
@stucka
Copy link
Contributor

stucka commented Aug 15, 2024

For the record, I'm trying to finish this scraper. A few things I know need to be worked on:

  • Include AB748 page
  • Ensure single-document cases are properly scraped from the index pages ... possibly need to split by the H2s to get the case IDs?
  • Ensure multi-document cases are properly scraped, as the URLs will look a lot like the generic nav links.
  • Ensure subpages are properly scraped, to verify that no index pages are ignored.

Some of this requires the visually identifiable content section to be properly extruded, and codewise it's a mess, e.g., things I thought might be good indicators turn out to be the 18th incarnation of a particular class. This may work on index and subpages but needs to be tested:

pageguts = soup.find("div", attrs={"class": "page-content"})
focusedguts = BeautifulSoup("<h2" + "<h2".join(pageguts.prettify().split("<h2")[1:]))

@stucka
Copy link
Contributor

stucka commented Aug 15, 2024

newsroomdev suggests trying to cure BeautifulSoup vs. mypy problems particularly with lines 214-215. https://github.com/biglocalnews/clean-scraper/blob/ca-51/clean/ca/sacramento_pd.py#L180-L224

@stucka stucka force-pushed the ca_ventura_sheriff branch from 9be0947 to da91b41 Compare August 16, 2024 13:03
@stucka stucka force-pushed the ca_ventura_sheriff branch from 8bb2691 to 8ec8ffc Compare August 16, 2024 13:18
@stucka
Copy link
Contributor

stucka commented Aug 16, 2024

@irenecasado , I have ... destroyed all the history here. I am so so very sorry, and so tired. Please make a copy of your local version ...

@stucka stucka requested a review from newsroomdev August 16, 2024 13:28
@stucka stucka marked this pull request as ready for review August 19, 2024 19:37
@stucka
Copy link
Contributor

stucka commented Sep 9, 2024

@newsroomdev @zstumgoren I believe this is ready for review if you want.

@stucka stucka changed the title CA Ventura Sheriff - Draft CA Ventura Sheriff Sep 9, 2024
self.cache.download(
full_filename,
target_url,
force=False, # Do NOT automatically rescrape subpages
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you walk me through setting it to False here? On the backend side, once these are saved to a local cache, they're copied over to Azure Blob Storage with timestamps. Leaving it as True ensures we're catching changes to the page and its asset_url's byte lengths

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever you'd like to do is fine by me. Looks like there were about 58 subpages and I was assuming there was no real reason to rescrape those. Index pages are of course forced to rescrape.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no info on file size here that I'm aware of, FYI. So unless you rescrape the asset itself; or alternately grab the HEAD and get accurate, reliable results; rescraping the subpage would only show you any new assets, and I don't know if anything ever gets refreshed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create clean/ca/ventura_county_sheriff.py
3 participants