Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden WAF ETL pipeline #4598

Open
5 tasks
btylerburton opened this issue Jan 30, 2024 · 4 comments
Open
5 tasks

Harden WAF ETL pipeline #4598

btylerburton opened this issue Jan 30, 2024 · 4 comments
Assignees
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0

Comments

@btylerburton
Copy link
Contributor

btylerburton commented Jan 30, 2024

User Story

In order to harvest WAF sources effectively and at scale, datagovteam would like to harden the current WAF ETL pipeline.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN [a contextual precondition]
    [AND optionally another precondition]
    WHEN [a triggering event] happens
    THEN [a verifiable outcome]
    [AND optionally another verifiable outcome]

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

  • add record partition logic into harvesting logic repo
  • benchmark and report metrics on traversal and download ( how many files vs how long it took ). total processing time.
  • get number of WAF harvest sources
  • consider implementing download xml inside traversal instead of separate function depending if performance impact is noticeable
@btylerburton btylerburton converted this from a draft issue Jan 30, 2024
@btylerburton btylerburton changed the title [Placeholder] Harden WAF ETL pipeline Harden WAF ETL pipeline Jan 30, 2024
@rshewitt rshewitt moved this from 🏗 In Progress [8] to 📔 Product Backlog in data.gov team board Feb 7, 2024
@rshewitt rshewitt removed their assignment Feb 7, 2024
@rshewitt rshewitt moved this from 📔 Product Backlog to 🏗 In Progress [8] in data.gov team board Mar 25, 2024
@rshewitt rshewitt self-assigned this Mar 25, 2024
@rshewitt
Copy link
Contributor

noaa waf

@rshewitt
Copy link
Contributor

rshewitt commented Mar 26, 2024

processing reached 12 hours for the noaa waf so i stopped it ( the conclusion being...it's gonna take awhile ). I duplicated our waf test but added a new fixture with an updated url. I didn't commit anything. considering how long it was running, I didn't see the benefit of knowing exactly how much longer it would take. the bottleneck is requesting/downloading the documents. requesting the initial page, parsing it with beautifulsoup, and getting a list of all the anchors with a populated href attr took 46 seconds ( this is our waf traversal function ).

conclusion of test

  • how long did the process we control take? 46 seconds
  • how long did the process we don't control take? a long time

@rshewitt
Copy link
Contributor

json with list of all waf urls
waf_sources.json

@rshewitt
Copy link
Contributor

pausing on this. more discussion on waf needed.

@rshewitt rshewitt moved this from 🏗 In Progress [8] to 📔 Product Backlog in data.gov team board Mar 26, 2024
@btylerburton btylerburton moved this from 📔 Product Backlog to Harvester 2.0 in data.gov team board May 2, 2024
@btylerburton btylerburton moved this from H2.0 Backlog to 📥 Queue in data.gov team board Oct 10, 2024
@btylerburton btylerburton added the H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 label Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0
Projects
Status: 📥 Queue
Development

No branches or pull requests

2 participants