-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
144 - Add fetch records for collect when crawling. #154
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to commit the review with comments. Apologies. Feel free to grab ☕ to discuss any of the questions.
@@ -313,5 +315,42 @@ func (w *FetchWorker) Work(_ context.Context, job *river.Job[common.FetchArgs]) | |||
Path: job.Args.Path, | |||
} | |||
|
|||
// Generate UUID | |||
id := uuid.New().String() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're going to want IDs to be unique globally, but constant. That is, we need to know that the id for this is the fetch_collect_count
, not that it is a unique ID. In the schemas
folder, I'd consider adding a constants.go
that has the names of the ids, so we can keep them consistent. E.g.
var FetchCountSchemaId = "fetch_count"
or similar, one for each schema. This way, we can also refer to these in our conditionals when we are trying to figure out what schema to apply to the data payload.
id := uuid.New().String() | ||
|
||
// Create data to send to the `collect` service | ||
collectData := map[string]interface{}{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it is possible to wrap lines 322 through 341 into a common helper function? That is, these all take a map[string]any
and convert it into a JSON structure that we send to ChQSHP. However, another question: should we just pass the map[string]any
over the channel, and let the process at the other end do this conversion? That is, we should be able to declare RawData
to be of type map[string]any
, and then pass these hash tables/maps/dictionaries over the channel directly. That way, when it gets to the other end, we can do this work in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had two ideas in there; I suspect the better idea is to pass the map over the channel, so that RawData
is a map[string]any
as opposed to type string
. This saves us from doing work at every place we want to send data.
@@ -7,7 +7,9 @@ | |||
"properties": { | |||
"id": { "type": "string" }, | |||
"source": { "type": "string" }, | |||
"payload": { "type": "string" } | |||
"payload": { "type": "string" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More a question: what is the payload
in this schema? If the payload is just a string that says "default-payload"
... shouldn't that actually be the data?
"properties": {
"id" ...,
"payload": {
"url": ...,
"count": ...,
}
the payload is what is being carried by the data packet, no? If it isn't, then what is it for? Happy to pair if that doesn't make sense.
if err := json.Unmarshal([]byte(jsonString), &jsonData); err != nil { | ||
zap.L().Error("failed to unmarshal JSON", zap.Error(err)) | ||
|
||
return nil, fmt.Errorf("deserializeJSON: failed to unmarshal input JSON: %w", err) | ||
} | ||
|
||
// Pull in IsFull and hallpass | ||
isFull, _ := jsonData["IsFull"].(bool) | ||
hallPass, _ := jsonData["hallpass"].(bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to pull these out? Also, are they always present? I don't think they are. I would say that we need to have a case for why these are being singled out as opposed to all the other fields.
@@ -7,7 +7,9 @@ | |||
"properties": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question I have... which may be because I'm missing something. (These comments did not happen linearly...)
Why do we have an object with a "data"
member, and under it is everything? Is there a reason we ended up with this nested design? Should id
, source
, payload
all be at the top level, and payload
contains the interesting data? Otherwise, we've just nested everything one level deep unnecessarily?
Haiku-length summary
URLs logged in streams
Counting fetches as they crawl
Data flows to store
Story
Additional details
Testing
data
file in S3:PR Checklist: Submitter
main
into your branch shortly before creating the PR. (You should also be mergingmain
into your branch regularly during development.)PR Checklist: Reviewer
make macup ; make e2e"
(FIXME)The larger the PR, the stricter we should be about these points.