Standardizing `metadata` output #58

newsroomdev · 2024-08-02T19:46:50Z

newsroomdev
Aug 2, 2024
Maintainer

EDITED 08/13: case_num >> case_id for consistency

Hello!

As part of our ongoing efforts to maintain consistency and reliability with our data, we need to standardize the metadata dictionary that scrape_meta generates. Adopting a set specification with tests allows downstream consumers (i.e. scraper orchestration servers and data analysts) to store and retrieve assets from disparate websites.

This is an archival project in many ways. The following proposal aims to codify existing patterns while adapting to various websites contributors encounter.

Background

Goal: Streamline contributing and add more informative pull request checks. By adopting specific metadata key-value pairs, separate codebases can read the information and handle scraper orchestration and downloading.

The first step is to streamline contribution by creating a Site class with a public scrape_meta method. Please see https://github.com/biglocalnews/clean-scraper/blob/dev/docs/decisions/00-deprecate-scrape.md for more information. This method produces a Python list of dictionaries. Each dictionary contains shared, required information and additional distinctive information.

Proposal

Here is an example of expected JSON output going forward.

Required:

asset_url to download the respective assets
case_id to properly store the asset

Additional information, like parent_page is saved via a persistent storage server for bookkeeping. Please use details for further information to record.

[
    {
        "asset_url": "https://example.com/asset.wav",
        "case_id": "abc123",
        "name": "example_asset.wav",
        "parent_page": "agency/case_id/page.html",
        "title": "Example Title",
        "details": {
            "file_size": 9999,
            "date_modified": "2024-01-01T19:20:00+1:00"
            ...
        }
    },
]

Field Descriptions:

required asset_url (str): The URL to the asset. This is the most important field, ensuring we know what to download.
required case_id (str): A unique identifier for the case.
name (str): The asset's file name.
parent_page (str): The asset's parent page file path.
title (str): A title for the asset.
details (dict): An object containing additional details. Examples:
- file_size: The size of the file in bytes. (Useful for seeing if the file changed between runs)
- date_modified: The date and time when the file was last modified.

Why?

Standardizing helps in several ways:

Consistency: Ensures that all metadata objects are uniform, making it easier to process them programmatically.
Reliability: Reduces the chances of errors and inconsistencies in the data. We strive to
Efficiency: Simplifies integrating new data sources and maintaining existing ones.

Validation

If this proposal is adopted, pull request checks can test additional Site modules for the scrape_meta method and whether it returns a list of dictionaries that include asset_url and case_id. Additional options can be passed to these tests to avoid long-running GitHub Actions tasks while testing scrape_meta

Currently, contributors can check types before opening pull requests via pre-commits and/or VSCode plugins

VSCode plugins

Here are a few that I use. If adopted, we can merge in workplace settings with these plugins

code --install-extension ms-python.python && \
code --install-extension ms-python.black-formatter && \
code --install-extension ms-python.isort && \
code --install-extension ms-python.mypy-type-checker && \
code --install-extension ms-python.vscode-pylance

Feedback/Comments

Review the proposed schema and provide feedback.
Update your contributions to adopt this standard.
Use the provided validation tools to check code.

Thank you for your help in scaling up clean-scraper!

Best regards,
Gerald Rich
@biglocalnews

newsroomdev · 2024-08-02T20:18:50Z

newsroomdev
Aug 2, 2024
Maintainer Author

cc @zstumgoren @tarakc02

0 replies

stucka · 2024-08-02T20:29:53Z

stucka
Aug 2, 2024
Maintainer

I believe @newsroomdev said @pickoffwhite said some assets may belong to multiple cases; in that case, perhaps the case_num should be a list, and the first entry should be used as the definitive location for saving ... ? (If "first" we should explain ... how "first" is determined.)

filesize under details seems to miss the underscoring we're using elsewhere. file_size might be more appropriate.

We have no metadata schema for case information. At an initial pass, I lumped all that into the metatdata details section, which is rather repetitive but not the end of the world. If we go that way, I would recommend using a "case_" prefix for those variables such that it's clear whether things like date_modified refers to the asset or the case.

I don't know if those other "details" sections should be standardized in some way.

We should be clear about whether date_modified refers to the original asset's format (which makes comparisons perhaps a bit faster) or whether we're standardizing on what looks like ISO 8601.

1 reply

newsroomdev Aug 2, 2024
Maintainer Author

Using the same asset for different case numbers is an edge case that must be evaluated more thoroughly. Is the same asset using different URLs? How many instances has this occurred? Does this affect the 90th percentile of assets? Currently, each metadata dictionary is representative of an anchor link <a> on an agency page; multiple anchors to the same link would result in multiple metadata dictionaries with different case_id values and potentially the same or different asset_url strings.
Updated filesize >> file_size ✅
details is optional, allowing for flexibility and variation between sites while capturing information, and would require additional discussion and development after required standardization is checked. Note: Contributors are strongly encouraged to include type hinting and tests in their pull requests to assist other contributors in meeting spec 🙏

tarakc02 · 2024-08-02T20:40:27Z

tarakc02
Aug 2, 2024
Collaborator

regarding case-numbers, it's great to grab those if they're readily available on the site we're scraping from, but our code for processing and organizing files (that we'll still be running these files through) has logic for case number extraction from file content, so don't know if this should be required? and this goes to mike's question above about multiple cases also -- we want to let the downstream processing code figure these things out.
what do we do if there is no static asset_url (e.g. pages with dynamically loaded content)

16 replies

naumansharifwork Aug 8, 2024

@stucka got it I will change the approach to this, so we will use this link as a asset url and handle the case in the scrape phase to construct the pdf out of it.

stucka Aug 8, 2024
Maintainer

I spent entirely too much time trying to see if there was some way to access the native image file but I think that can be done only through the API, and API keys only go to clients.

For audio, at least, there's a simple download schema, e.g.: https://portal.laserfiche.com/Portal/ElectronicFile.aspx?docid=1504576&repo=r-3261686e&mode=Export ...

naumansharifwork Aug 13, 2024

Hey @stucka so shall we keep the pdf download as it is and change the audio and video to use direct link, or shall we make a function to first get all the image block links in meta phase and construct the pdf from them in scrape phase?

stucka Aug 13, 2024
Maintainer

@naumansharifwork Above my paygrade and I may be misunderstanding intent someone's intent here -- I've asked @newsroomdev to look.

newsroomdev Aug 13, 2024
Maintainer Author

hey @naumansharifwork! i think the URL @stucka mentioned earlier, https://portal.laserfiche.com/Portal/DocView.aspx?id=1502660&repo=r-3261686e will suffice as the asset_url for the scrape_meta method and then we can devise a solution for downloading it in the application layer. i'll make a ticket for it, but if there's any additional information that can help us interact with the Laserfiche that's worthwhile to include in details

tarakc02 · 2024-08-19T16:27:33Z

tarakc02
Aug 19, 2024
Collaborator

In some cases, especially with smaller departments, the site will mention that the case is being investigated by a different agency. Nauman highlighted the example of Fremont PD, see the cases for 2018. In these cases, should we include something like:

"related_agencies": "XYZ PD"

As one of the details? It would be connected to each asset. It's not directly connected to the scraping/archival, but might aid downstream file/case organization work.

3 replies

stucka Aug 19, 2024
Maintainer

@tarakc02 If you have that, I think it'd be superb to include it!

I drafted something this morning -- I have even less of an idea if any of this will ever come into use, but, what the heck, can't hurt. I don't even know if these fields are all found in other data sets using the system, or if other datasets will have different variables:

    for item in ["created_at", "description", "redacted_at", "doc_date", "highlights"]:
        if item in entry:
            line['details'][item] = entry[item]

newsroomdev Aug 26, 2024
Maintainer Author

That sounds like a good candidate for the nested metadata.details dictionary

stucka Aug 26, 2024
Maintainer

@newsroomdev If I may suggest something ... any project-specific details can go into details with whatever native names and values are available ... except perhaps for a set of standardized names and values where available, e.g., asset_created and asset_modified in ISO 8601 datetime format, or whatever. So in my example, created_at would remain with the original value, but I'd be encouraged to build out an asset_created field with the converted date.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardizing `metadata` output #58

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 20 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Standardizing metadata output #58

newsroomdev Aug 2, 2024 Maintainer

Background

Proposal

Field Descriptions:

Why?

Validation

VSCode plugins

Feedback/Comments

Replies: 4 comments · 20 replies

newsroomdev Aug 2, 2024 Maintainer Author

stucka Aug 2, 2024 Maintainer

newsroomdev Aug 2, 2024 Maintainer Author

tarakc02 Aug 2, 2024 Collaborator

naumansharifwork Aug 8, 2024

stucka Aug 8, 2024 Maintainer

naumansharifwork Aug 13, 2024

stucka Aug 13, 2024 Maintainer

newsroomdev Aug 13, 2024 Maintainer Author

tarakc02 Aug 19, 2024 Collaborator

stucka Aug 19, 2024 Maintainer

newsroomdev Aug 26, 2024 Maintainer Author

stucka Aug 26, 2024 Maintainer

Standardizing `metadata` output #58

newsroomdev
Aug 2, 2024
Maintainer

Replies: 4 comments 20 replies

newsroomdev
Aug 2, 2024
Maintainer Author

stucka
Aug 2, 2024
Maintainer

newsroomdev Aug 2, 2024
Maintainer Author

tarakc02
Aug 2, 2024
Collaborator

stucka Aug 8, 2024
Maintainer

stucka Aug 13, 2024
Maintainer

newsroomdev Aug 13, 2024
Maintainer Author

tarakc02
Aug 19, 2024
Collaborator

stucka Aug 19, 2024
Maintainer

newsroomdev Aug 26, 2024
Maintainer Author

stucka Aug 26, 2024
Maintainer