Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

144 - Add fetch records for collect when crawling. #154

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

luisgmetzger
Copy link
Contributor

@luisgmetzger luisgmetzger commented Mar 5, 2025

Haiku-length summary

URLs logged in streams
Counting fetches as they crawl
Data flows to store

Story

Additional details

  • This PR adds collect service to fetch to collect data during crawl.

Testing

  • To test out a crawl, you can use the following curl command:
 curl -X PUT "http://localhost:10001/api/fetch" \
     -H "Content-Type: application/json" \
     -d '{
       "scheme": "https",
       "host": "www.fac.gov",
       "path": "/",
       "api-key": "test-key",
       "data": { "id": "1", "source": "fetch", "payload": "sample-data" }
     }'
  • Expected JSON collected in data file in S3:
{"data":{"count":2,"id":"ed4f5ccd-7302-4999-9b50-82cae9a03608","payload":"default-payload","source":"fetch","url":"https://www.fac.gov/"}}

PR Checklist: Submitter

  • Link to an issue if possible. If there’s no issue, describe what your branch does. Even if there is an issue, a brief description in the PR is still useful.
  • List any special steps reviewers have to follow to test the PR. For example, adding a local environment variable, creating a local test file, etc.
  • For extra credit, submit a screen recording like this one.
  • Make sure you’ve merged main into your branch shortly before creating the PR. (You should also be merging main into your branch regularly during development.)
  • Make sure that whatever feature you’re adding has tests that cover the feature. This includes test coverage to make sure that the previous workflow still works, if applicable.
  • Make sure the E2E tests pass.
  • Do manual testing locally.
    • If that’s not applicable for some reason, check this box.
  • Once a PR is merged, keep an eye on it until it’s deployed to dev, and do enough testing on dev to verify that it deployed successfully, the feature works as expected, and the happy path for the broad feature area still works.

PR Checklist: Reviewer

  • Pull the branch to your local environment and run make macup ; make e2e" (FIXME)
  • Manually test out the changes locally
    • Check this box if not applicable in this case.
  • Check that the PR has appropriate tests.

The larger the PR, the stricter we should be about these points.

@luisgmetzger luisgmetzger linked an issue Mar 5, 2025 that may be closed by this pull request
1 task
Copy link
Contributor

@jadudm jadudm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to commit the review with comments. Apologies. Feel free to grab ☕ to discuss any of the questions.

@@ -313,5 +315,42 @@ func (w *FetchWorker) Work(_ context.Context, job *river.Job[common.FetchArgs])
Path: job.Args.Path,
}

// Generate UUID
id := uuid.New().String()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to want IDs to be unique globally, but constant. That is, we need to know that the id for this is the fetch_collect_count, not that it is a unique ID. In the schemas folder, I'd consider adding a constants.go that has the names of the ids, so we can keep them consistent. E.g.

var FetchCountSchemaId = "fetch_count"

or similar, one for each schema. This way, we can also refer to these in our conditionals when we are trying to figure out what schema to apply to the data payload.

id := uuid.New().String()

// Create data to send to the `collect` service
collectData := map[string]interface{}{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is possible to wrap lines 322 through 341 into a common helper function? That is, these all take a map[string]any and convert it into a JSON structure that we send to ChQSHP. However, another question: should we just pass the map[string]any over the channel, and let the process at the other end do this conversion? That is, we should be able to declare RawData to be of type map[string]any, and then pass these hash tables/maps/dictionaries over the channel directly. That way, when it gets to the other end, we can do this work in one place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had two ideas in there; I suspect the better idea is to pass the map over the channel, so that RawData is a map[string]any as opposed to type string. This saves us from doing work at every place we want to send data.

@@ -7,7 +7,9 @@
"properties": {
"id": { "type": "string" },
"source": { "type": "string" },
"payload": { "type": "string" }
"payload": { "type": "string" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More a question: what is the payload in this schema? If the payload is just a string that says "default-payload"... shouldn't that actually be the data?

"properties": {
  "id" ...,
  "payload": {
    "url": ...,
    "count": ...,
}

the payload is what is being carried by the data packet, no? If it isn't, then what is it for? Happy to pair if that doesn't make sense.

if err := json.Unmarshal([]byte(jsonString), &jsonData); err != nil {
zap.L().Error("failed to unmarshal JSON", zap.Error(err))

return nil, fmt.Errorf("deserializeJSON: failed to unmarshal input JSON: %w", err)
}

// Pull in IsFull and hallpass
isFull, _ := jsonData["IsFull"].(bool)
hallPass, _ := jsonData["hallpass"].(bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to pull these out? Also, are they always present? I don't think they are. I would say that we need to have a case for why these are being singled out as opposed to all the other fields.

@@ -7,7 +7,9 @@
"properties": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question I have... which may be because I'm missing something. (These comments did not happen linearly...)

Why do we have an object with a "data" member, and under it is everything? Is there a reason we ended up with this nested design? Should id, source, payload all be at the top level, and payload contains the interesting data? Otherwise, we've just nested everything one level deep unnecessarily?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Log every URL that is being fetched via collect service
2 participants