Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Muckrock filename collisions #165

Open
stucka opened this issue Oct 30, 2024 · 4 comments
Open

Fix Muckrock filename collisions #165

stucka opened this issue Oct 30, 2024 · 4 comments
Assignees

Comments

@stucka
Copy link
Contributor

stucka commented Oct 30, 2024

Additional details

The Muckrock suite is working exactly as planned, except some agencies are sending multiple files with filenames like files.zip and images.png in the same FOI package, just under different branches of the "communication" key in the Muckrock JSON return. Our scraper code then will only make one of those available.

So a few tweaks of the API handling should get us there without too much of a lift:

import the pathlib module for better filename handling

Build an empty dictionary to identify duplicates
Go through the communication['files'] entries
If a filename native to the API does not exist, add it to the dictionary
For each filename in the tally, increase the count by 1

In the existing parser: for communication in communications: should go with an enumerate to track how which sequential entry it belongs to.
Store that entry number within the CLEAN scraper properties.
If the filename appears in the duplicate dictionary with a count greater than 1, we need to append an identifier. For example, if entries 18 and 23 both had files.zip, the CLEAN filenames would become files_convo18.zip and files_convo23.zip using pathlib.

We also need some accountability here, as @tarakc02 has identified:

confirm that len(set(asset_urls)) == len(set(local_filenames))

This approach should minimize the filename changes but allow everything to get saved on our end.

@stucka stucka self-assigned this Oct 30, 2024
@stucka
Copy link
Contributor Author

stucka commented Oct 31, 2024

Proposed tweaks to this approach:
We've got the file uploaded datetime. I think it'd be better to rename these things like files_from_2022-09-30.zip if that works.

I think we also want another test:
len(set(asset_urls)) == len(set(local_filenames)) == len(set(file_entities))

@tarakc02
Copy link
Collaborator

tarakc02 commented Dec 4, 2024

^ yup i think that's right. within a datetime group, Path(asset_url.name) is 1-1 with asset_url.
we sometimes have the same file (same asset url) given more than once within the same datetime group, but we don't need to download it twice

@stucka
Copy link
Contributor Author

stucka commented Dec 4, 2024

Can you point me to an example of that repetition within a datetime group? That'd throw off my plans ...

@tarakc02
Copy link
Collaborator

tarakc02 commented Dec 4, 2024

^ If you check out the Sutter County DA request, the communication from May 30 2023 contains 1,906 file entries. But it only has 1,884 distinct asset urls (the 'ffile' for each of the files). https://cdn.muckrock.com/foia_files/2023/05/30/IMG_6497.JPG is an example of an asset that shows up multiple times, as is https://cdn.muckrock.com/foia_files/2023/05/30/9_16_19_PURSUIT__2_VIDEO_990C.mp4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants