-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
144 - Add fetch records for collect when crawling. #154
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,7 @@ package main | |
import ( | ||
"context" | ||
_ "embed" | ||
"encoding/json" | ||
"fmt" | ||
"math" | ||
"net/url" | ||
|
@@ -21,6 +22,7 @@ import ( | |
"github.com/GSA-TTS/jemison/internal/postgres/work_db" | ||
"github.com/GSA-TTS/jemison/internal/queueing" | ||
"github.com/GSA-TTS/jemison/internal/util" | ||
"github.com/google/uuid" | ||
"github.com/jackc/pgx/v5/pgtype" | ||
"github.com/riverqueue/river" | ||
"go.uber.org/zap" | ||
|
@@ -313,5 +315,42 @@ func (w *FetchWorker) Work(_ context.Context, job *river.Job[common.FetchArgs]) | |
Path: job.Args.Path, | ||
} | ||
|
||
// Generate UUID | ||
id := uuid.New().String() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We're going to want IDs to be unique globally, but constant. That is, we need to know that the id for this is the
or similar, one for each schema. This way, we can also refer to these in our conditionals when we are trying to figure out what schema to apply to the data payload. |
||
|
||
// Create data to send to the `collect` service | ||
collectData := map[string]interface{}{ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think it is possible to wrap lines 322 through 341 into a common helper function? That is, these all take a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had two ideas in there; I suspect the better idea is to pass the map over the channel, so that |
||
"data": map[string]interface{}{ | ||
"id": id, | ||
"source": "fetch", | ||
"payload": "default-payload", | ||
"url": hostAndPath(job), // Full URL being fetched | ||
"count": fetchCount.Load(), // Total count of fetched URLs | ||
}, | ||
} | ||
|
||
// Marshal the data to JSON format | ||
collectJSON, err := json.Marshal(collectData) | ||
if err != nil { | ||
// Wrap the error with additional context | ||
wrappedErr := fmt.Errorf("failed to marshal collect data to JSON: %w", err) | ||
zap.L().Error(wrappedErr.Error(), zap.Error(err)) | ||
|
||
return wrappedErr | ||
} | ||
|
||
// Enqueue the data to the `collect` queue | ||
ChQSHP <- queueing.QSHP{ | ||
Queue: "collect", | ||
Scheme: job.Args.Scheme, | ||
Host: job.Args.Host, | ||
Path: job.Args.Path, | ||
RawData: string(collectJSON), // Include data for S3 logging | ||
} | ||
|
||
zap.L().Info("Logged URL to collect service", | ||
zap.String("url", hostAndPath(job)), | ||
zap.Int64("total_count", fetchCount.Load())) | ||
|
||
return nil | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,7 +7,9 @@ | |
"properties": { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another question I have... which may be because I'm missing something. (These comments did not happen linearly...) Why do we have an object with a |
||
"id": { "type": "string" }, | ||
"source": { "type": "string" }, | ||
"payload": { "type": "string" } | ||
"payload": { "type": "string" }, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. More a question: what is the
the payload is what is being carried by the data packet, no? If it isn't, then what is it for? Happy to pair if that doesn't make sense. |
||
"url": { "type": "string" }, | ||
"count": { "type": "integer" } | ||
}, | ||
"required": ["id", "source", "payload"] | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to pull these out? Also, are they always present? I don't think they are. I would say that we need to have a case for why these are being singled out as opposed to all the other fields.