144 - Add fetch records for collect when crawling. #154

luisgmetzger · 2025-03-05T01:55:06Z

Haiku-length summary

URLs logged in streams
Counting fetches as they crawl
Data flows to store

Story

Log every URL that is being fetched via collect service #144

Additional details

This PR adds collect service to fetch to collect data during crawl.

Testing

To test out a crawl, you can use the following curl command:

 curl -X PUT "http://localhost:10001/api/fetch" \
     -H "Content-Type: application/json" \
     -d '{
       "scheme": "https",
       "host": "www.fac.gov",
       "path": "/",
       "api-key": "test-key",
       "data": { "id": "1", "source": "fetch", "payload": "sample-data" }
     }'

Expected JSON collected in data file in S3:

{"data":{"count":2,"id":"ed4f5ccd-7302-4999-9b50-82cae9a03608","payload":"default-payload","source":"fetch","url":"https://www.fac.gov/"}}

PR Checklist: Submitter

Link to an issue if possible. If there’s no issue, describe what your branch does. Even if there is an issue, a brief description in the PR is still useful.
List any special steps reviewers have to follow to test the PR. For example, adding a local environment variable, creating a local test file, etc.
For extra credit, submit a screen recording like this one.
Make sure you’ve merged main into your branch shortly before creating the PR. (You should also be merging main into your branch regularly during development.)
Make sure that whatever feature you’re adding has tests that cover the feature. This includes test coverage to make sure that the previous workflow still works, if applicable.
Make sure the E2E tests pass.
Do manual testing locally.
- If that’s not applicable for some reason, check this box.
Once a PR is merged, keep an eye on it until it’s deployed to dev, and do enough testing on dev to verify that it deployed successfully, the feature works as expected, and the happy path for the broad feature area still works.

PR Checklist: Reviewer

Pull the branch to your local environment and run make macup ; make e2e" (FIXME)
Manually test out the changes locally
- Check this box if not applicable in this case.
Check that the PR has appropriate tests.

The larger the PR, the stricter we should be about these points.

jadudm

I forgot to commit the review with comments. Apologies. Feel free to grab ☕ to discuss any of the questions.

jadudm · 2025-03-05T11:48:13Z

cmd/fetch/work.go

@@ -313,5 +315,42 @@ func (w *FetchWorker) Work(_ context.Context, job *river.Job[common.FetchArgs])
 		Path:   job.Args.Path,
 	}

+	// Generate UUID
+	id := uuid.New().String()


We're going to want IDs to be unique globally, but constant. That is, we need to know that the id for this is the fetch_collect_count, not that it is a unique ID. In the schemas folder, I'd consider adding a constants.go that has the names of the ids, so we can keep them consistent. E.g.

var FetchCountSchemaId = "fetch_count"

or similar, one for each schema. This way, we can also refer to these in our conditionals when we are trying to figure out what schema to apply to the data payload.

jadudm · 2025-03-05T11:50:08Z

cmd/fetch/work.go

+	id := uuid.New().String()
+
+	// Create data to send to the `collect` service
+	collectData := map[string]interface{}{


Do you think it is possible to wrap lines 322 through 341 into a common helper function? That is, these all take a map[string]any and convert it into a JSON structure that we send to ChQSHP. However, another question: should we just pass the map[string]any over the channel, and let the process at the other end do this conversion? That is, we should be able to declare RawData to be of type map[string]any, and then pass these hash tables/maps/dictionaries over the channel directly. That way, when it gets to the other end, we can do this work in one place.

I had two ideas in there; I suspect the better idea is to pass the map over the channel, so that RawData is a map[string]any as opposed to type string. This saves us from doing work at every place we want to send data.

jadudm · 2025-03-05T11:54:11Z

internal/common/schemas/fetch_schema.json

@@ -7,7 +7,9 @@
      "properties": {
        "id": { "type": "string" },
        "source": { "type": "string" },
-        "payload": { "type": "string" }
+        "payload": { "type": "string" },


More a question: what is the payload in this schema? If the payload is just a string that says "default-payload"... shouldn't that actually be the data?

"properties": { "id" ..., "payload": { "url": ..., "count": ..., }

the payload is what is being carried by the data packet, no? If it isn't, then what is it for? Happy to pair if that doesn't make sense.

jadudm · 2025-03-05T11:55:35Z

cmd/collect/work.go

 	if err := json.Unmarshal([]byte(jsonString), &jsonData); err != nil {
 		zap.L().Error("failed to unmarshal JSON", zap.Error(err))

 		return nil, fmt.Errorf("deserializeJSON: failed to unmarshal input JSON: %w", err)
 	}

-	// Pull in IsFull and hallpass
-	isFull, _ := jsonData["IsFull"].(bool)
-	hallPass, _ := jsonData["hallpass"].(bool)


Why do we need to pull these out? Also, are they always present? I don't think they are. I would say that we need to have a case for why these are being singled out as opposed to all the other fields.

jadudm · 2025-03-05T11:57:21Z

internal/common/schemas/fetch_schema.json

@@ -7,7 +7,9 @@
      "properties": {


Another question I have... which may be because I'm missing something. (These comments did not happen linearly...)

Why do we have an object with a "data" member, and under it is everything? Is there a reason we ended up with this nested design? Should id, source, payload all be at the top level, and payload contains the interesting data? Otherwise, we've just nested everything one level deep unnecessarily?

Add fetch records for collect when crawling.

67f44e2

luisgmetzger linked an issue Mar 5, 2025 that may be closed by this pull request

Log every URL that is being fetched via collect service #144

Open

1 task

jadudm reviewed Mar 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

144 - Add fetch records for collect when crawling. #154

144 - Add fetch records for collect when crawling. #154

luisgmetzger commented Mar 5, 2025 •

edited

Loading

jadudm left a comment

jadudm Mar 5, 2025

jadudm Mar 5, 2025

jadudm Mar 5, 2025

jadudm Mar 5, 2025

jadudm Mar 5, 2025

jadudm Mar 5, 2025

144 - Add fetch records for collect when crawling. #154

Are you sure you want to change the base?

144 - Add fetch records for collect when crawling. #154

Conversation

luisgmetzger commented Mar 5, 2025 • edited Loading

Haiku-length summary

Story

Additional details

Testing

PR Checklist: Submitter

PR Checklist: Reviewer

jadudm left a comment

Choose a reason for hiding this comment

jadudm Mar 5, 2025

Choose a reason for hiding this comment

jadudm Mar 5, 2025

Choose a reason for hiding this comment

jadudm Mar 5, 2025

Choose a reason for hiding this comment

jadudm Mar 5, 2025

Choose a reason for hiding this comment

jadudm Mar 5, 2025

Choose a reason for hiding this comment

jadudm Mar 5, 2025

Choose a reason for hiding this comment

luisgmetzger commented Mar 5, 2025 •

edited

Loading