-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest tool #30
Ingest tool #30
Conversation
Added TruncateAllTables to db.
The processor now relies on the goroutines mechanism to find the right balance of number of goroutines against bottlenecks.
Check there are no clashes among batches (per common name).
Partially use the calls in updater and map updater to serialize the set of certificates and store it in the DB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only issue I see is that SAN entries are ignored when checking for clashes. The rest looks good.
The pipeline is not efficient, as we read certificates, we write them, we read them again, update SMT, and write SMT. Also the SMT updater can't update the records in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 16 files reviewed, 3 unresolved discussions / 0 of 1 LGTMs obtained / 0 of 1 approvals obtained
cmd/ingest/batch.go
line 21 at r1 (raw file):
Previously, cyrill-k wrote…
Should be
[][]*string
since each certificate can contain multiple domains (CN + SANs)
Done. It is, though, a regular slice of pointers to string. We just collect all names in this batch here.
cmd/ingest/batch.go
line 36 at r1 (raw file):
Previously, cyrill-k wrote…
Here we are ignoring SAN entries. You could use the
extractCertDomains
function to extract CN + SAN.
Done. Can't use the exact function because we store pointers in the collection instead of string
, but it is doing the same.
cmd/ingest/batch.go
line 174 at r1 (raw file):
Previously, cyrill-k wrote…
We are currently ignoring SAN entries (see comments above), which means there could be undetected clashes between batches (because the insertion takes SAN entries into account, see line 117 which calls
GetAffectedDomainAndCertMap
. This should become a nested loop once the above is fixed.
Done. TAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks good to me.
I only suggest a minor change regarding consistent use of file paths and I was wondering if you want to keep the "deleteme ..." debug print statements before merging or keep them.
If the amount of data exceeds 1Gb, Mysql will complain with a fatal error, preventing the transaction from working. Count the bytes, and if bigger than 1Gb, split the batch in two.
… tests have succeeded.
MySQL cannot accept packets sent by connection with a bigger size than 1Gb. Work around that by inserting huge leaves via CSV direct file insertion.
This reverts commit 9c34e2a.
This now closes #32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 24 files reviewed, 5 unresolved discussions / 0 of 1 LGTMs obtained / 0 of 1 approvals obtained
cmd/ingest/main.go
line 63 at r3 (raw file):
Previously, cyrill-k wrote…
I would use filepath.Join consistently:
gzFiles, err = filepath.Glob(filepath.Join(d, "*.gz"))
Done.
cmd/ingest/main.go
line 65 at r3 (raw file):
Previously, cyrill-k wrote…
Same:
csvFiles, err = filepath.Glob(filepath.Join(dir, "*.csv"))
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good to me.
I suggested a minor change to print out the resulting root value since it is not persistent in the DB.
See PR #36 (once is is merged, you can merge this PR from my side)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 11 of 16 files at r1, 4 of 9 files at r3, 7 of 9 files at r6, 2 of 2 files at r7, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @juagargi)
We need an ingest tool that allows us to input local files to the DB.
This PR addresses #28 (but WIP).
The design of the DB structure disallows updates concurrently, which limits how efficient the data ingestion is.
For now, I submit this PR that provides an initial glimpse of how the pipeline could be constructed.
The operations in the pipeline themselves will have to be changed after fixing #29.
TODO:
This change is