Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest tool #30

Merged
merged 26 commits into from
Mar 16, 2023
Merged

Ingest tool #30

merged 26 commits into from
Mar 16, 2023

Conversation

juagargi
Copy link
Member

@juagargi juagargi commented Jan 23, 2023

We need an ingest tool that allows us to input local files to the DB.

This PR addresses #28 (but WIP).
The design of the DB structure disallows updates concurrently, which limits how efficient the data ingestion is.
For now, I submit this PR that provides an initial glimpse of how the pipeline could be constructed.
The operations in the pipeline themselves will have to be changed after fixing #29.

TODO:

  • Add a SMT updater.
  • Use the SMT updater on each batch.
  • Store the SMT changes in DB.

This change is Reviewable

Added TruncateAllTables to db.
The processor now relies on the goroutines mechanism to find the
right balance of number of goroutines against bottlenecks.
Check there are no clashes among batches (per common name).
Partially use the calls in updater and map updater to serialize
the set of certificates and store it in the DB.
Copy link
Collaborator

@cyrill-k cyrill-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only issue I see is that SAN entries are ignored when checking for clashes. The rest looks good.

cmd/ingest/batch.go Outdated Show resolved Hide resolved
cmd/ingest/batch.go Outdated Show resolved Hide resolved
cmd/ingest/batch.go Outdated Show resolved Hide resolved
The pipeline is not efficient, as we read certificates, we write them,
we read them again, update SMT, and write SMT.
Also the SMT updater can't update the records in parallel.
@juagargi juagargi marked this pull request as ready for review January 24, 2023 16:34
Copy link
Member Author

@juagargi juagargi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 16 files reviewed, 3 unresolved discussions / 0 of 1 LGTMs obtained / 0 of 1 approvals obtained


cmd/ingest/batch.go line 21 at r1 (raw file):

Previously, cyrill-k wrote…

Should be [][]*string since each certificate can contain multiple domains (CN + SANs)

Done. It is, though, a regular slice of pointers to string. We just collect all names in this batch here.


cmd/ingest/batch.go line 36 at r1 (raw file):

Previously, cyrill-k wrote…

Here we are ignoring SAN entries. You could use the extractCertDomains function to extract CN + SAN.

Done. Can't use the exact function because we store pointers in the collection instead of string, but it is doing the same.


cmd/ingest/batch.go line 174 at r1 (raw file):

Previously, cyrill-k wrote…

We are currently ignoring SAN entries (see comments above), which means there could be undetected clashes between batches (because the insertion takes SAN entries into account, see line 117 which calls GetAffectedDomainAndCertMap. This should become a nested loop once the above is fixed.

Done. TAL

Copy link
Collaborator

@cyrill-k cyrill-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me.
I only suggest a minor change regarding consistent use of file paths and I was wondering if you want to keep the "deleteme ..." debug print statements before merging or keep them.

cmd/ingest/main.go Outdated Show resolved Hide resolved
cmd/ingest/main.go Outdated Show resolved Hide resolved
juagargi and others added 12 commits February 23, 2023 14:55
If the amount of data exceeds 1Gb, Mysql will complain with a fatal
error, preventing the transaction from working.
Count the bytes, and if bigger than 1Gb, split the batch in two.
MySQL cannot accept packets sent by connection with a bigger size than 1Gb.
Work around that by inserting huge leaves via CSV direct file insertion.
@juagargi
Copy link
Member Author

juagargi commented Mar 9, 2023

This now closes #32

Copy link
Member Author

@juagargi juagargi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 24 files reviewed, 5 unresolved discussions / 0 of 1 LGTMs obtained / 0 of 1 approvals obtained


cmd/ingest/main.go line 63 at r3 (raw file):

Previously, cyrill-k wrote…

I would use filepath.Join consistently:

			gzFiles, err = filepath.Glob(filepath.Join(d, "*.gz"))

Done.


cmd/ingest/main.go line 65 at r3 (raw file):

Previously, cyrill-k wrote…

Same:

			csvFiles, err = filepath.Glob(filepath.Join(dir, "*.csv"))

Done.

Copy link
Collaborator

@cyrill-k cyrill-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me.
I suggested a minor change to print out the resulting root value since it is not persistent in the DB.
See PR #36 (once is is merged, you can merge this PR from my side)

Copy link
Collaborator

@cyrill-k cyrill-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 16 files at r1, 4 of 9 files at r3, 7 of 9 files at r6, 2 of 2 files at r7, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @juagargi)

@juagargi juagargi merged commit f17ad67 into master Mar 16, 2023
@juagargi juagargi deleted the juagargi/ingest_tool branch March 16, 2023 14:04
@juagargi juagargi mentioned this pull request Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants