-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional progress information and other enhancements for restore-dump command #910
base: main
Are you sure you want to change the base?
Conversation
When working with a larger restore using the `restore-dump` command by default there is only the "Restoring database ..." indicator by default which provides minimal insight into how things are progressing. On the other hand, if you choose to add the `--debug` flag the amount of information can then be too much as the large `INSERT` queries are then also included in the output. The new `--show-details` flag optionally allows for users to see how things are going with their restore more easily. Additionally, a `--start-from` flag was added that can be provided with a table name. This allows a user to skip earlier tables and start the import from a later point easily without having to create a separate copy of their dump folder and manually remove those files. The new `--allow-different-destination` flag primarily helps with simplifying the process of taking a folder that was created for one database, e.g. the files contain a prefix for "first-database", and allows you to restore into a second database without having to adjust the database prefix on all of those existing files. Also incorporated a check to prevent an issue for custom generated files where users might accidentally provide an `INSERT` query that is larger than 16777216 bytes. This provides some useful feedback rather than the `pkt` error that was being received previously.
From what I can tell this made no functional change and mainly allowed for the output I was providing with the new `--show-details` flag to allow showing information such as "Thread 1" rather than "Thread 0" without having to manually add one to the value elsewhere.
BuildKite was failing due to the way I was passing the helper used for printing so I made some adjustments and simplified how that worked. Afterward I added in two more optional flags: `--schema-only` and `--data-only`. If both are used at the same time then that's essentially the equivalent of a normal "restore" where both are being included. The new `--start-from` parameter can also be used together with these too.
I was thinking about this some more and it makes sense to allow for this option to be more comprehensive and allow not just >= a starting table to be included but a <= an ending table too or being able to provide a more limited starting/ending table range so I made some adjustments. This removes the `--start-from` option and replaces it with the `--starting-table` and `--ending-table` options which can be used separately or together.
I was thinking about this some more and it makes sense to allow for this option to be more comprehensive and allow not just those tables >= to a starting table to be included but also the tables <= to an ending table too as well as being able to provide a more specific starting/ending inclusive table range so I made some adjustments. The last commit removes the Examples:If not using the options at all then all tables should be included as they normally would. If the ending table provided comes before the starting table alphabetically then the restore will short circuit until that is corrected. This example skips the last two tables since the table range being provided will only include the first two:
For the case above, running the same command without the use of
The opposite however, starting from the
|
Should address these two that were noted: [2024-09-28T00:50:36Z] internal/cmd/database/restore.go:75:10: error strings should not be capitalized (ST1005) [2024-09-28T00:50:36Z] internal/cmd/database/restore.go:75:10: error strings should not end with punctuation or newlines (ST1005)
Rather than having the 16777216 bytes limit be completely hard coded I thought it might be better to allow for it to be adjusted in rare cases since it may be possible that it is sometimes configured differently. Generally though, the default 16 MiB limit will be the default one encountered by our users. Note that this option is mainly for the quick length check performed prior to running a query and doesn't / cannot adjust the system configured message size limit for queries.
internal/dumper/pool.go
Outdated
@@ -40,7 +40,7 @@ func (conn *Connection) StreamFetch(query string) (driver.Rows, error) { | |||
// NewPool creates the new pool. | |||
func NewPool(log *zap.Logger, cap int, address string, user string, password string, vars []string, database string) (*Pool, error) { | |||
conns := make(chan *Connection, cap) | |||
for i := 0; i < cap; i++ { | |||
for i := 1; i <= cap; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this change? Starting from 0
and using <
is the canonical style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had mixed feelings on making that change since I did think I was likely adjusting the canonical approach on that one.
The main reason for the change was that it allowed me to make the change in one spot so that the extra --show-details
output would match up a little more nicely with the provided number of threads the user may have used when running the command but I can also do a +1 adjustment within the individual output lines instead.
For example with 4 threads provided the output currently shows the 1-4 values for the threads as shown below rather than 0-3 which might be a little nicer for the user to see:
Started Processing Data File: <DATABASE>.<TABLE>.00002.sql in thread 4 (File 2 of 2993)
Started Processing Data File: <DATABASE>.<TABLE>.00004.sql in thread 2 (File 4 of 2993)
Started Processing Data File: <DATABASE>.<TABLE>.00001.sql in thread 3 (File 1 of 2993)
Started Processing Data File: <DATABASE>.<TABLE>.00003.sql in thread 1 (File 3 of 2993)
Processing Query 1 out of 1 within <DATABASE>.<TABLE>.00002.sql in thread 4
Processing Query 1 out of 1 within <DATABASE>.<TABLE>.00001.sql in thread 3
Processing Query 1 out of 1 within <DATABASE>.<TABLE>.00003.sql in thread 1
Processing Query 1 out of 1 within <DATABASE>.<TABLE>.00004.sql in thread 2
I'm happy to switch back to the canonical style though in pool.go
and either allow the output above to show 0-3 instead, or make some minor adjustments to the output so that potentially friendlier 1-4 output would still be shown for the thread values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted this back to the canonical style in NewPool.
The only remaining question would be whether I should leave things as-is for --show-details
with the zero-based output like this:
Started Processing Data File: metrics.connections.00001.sql in thread 2 (File 1 of 5)
Started Processing Data File: metrics.connections.00004.sql in thread 1 (File 4 of 5)
Started Processing Data File: metrics.connections.00002.sql in thread 3 (File 2 of 5)
Started Processing Data File: metrics.connections.00003.sql in thread 0 (File 3 of 5)
Or whether I should additionally tweak the output so it appears as 1-based.
If it's fine as-is then no additional updates should be needed.
internal/dumper/loader.go
Outdated
//for i := range files.tables { | ||
// j := rand.Intn(i + 1) | ||
// files.tables[i], files.tables[j] = files.tables[j], files.tables[i] | ||
//} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if MySQL gives any guarantees about table ordering? So the start / end table logic is maybe not something very reliable to add.
Otherwise we might need to change the query to show the tables so we can enforce the order there? That means changing it to use the information_schema
tables then.
In any way, if we remove code, we should remove it, not comment it 😄. We have git
to track things and find old code 😉.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the input @dbussink ❤️!
Is your first sentence related to the --starting-table
and --ending-table
options?
Those primarily help to drive the list of files that are being picked up from within the dump folder and mainly help provide some extra control over which files are going to be selected and used for the current restore-dump
operation so the MySQL table ordering shouldn't necessarily factor in with those options?
I mainly see the --starting-table
, --ending-table
and several of the other new options added with this PR help ing to avoid needing to create multiple copies of a dump folder since normally you might have to create a new copy of an existing dump folder and then perform operations on it to keep just the tables / files you want to restore if you want to restore less than the full dataset.
As for the shuffling related code I commented out, the purpose of that code was initially confusing for me since I thought it did something different and I was unsure what the original rationale for adding it was.
I can definitely remove it completely from the file though and will make that correction shortly!
The shuffling only affects the order in which the data files get processed.
For example, with shuffling the data files will get processed out of order:
Started Processing Data File: metrics.connections.00005.sql in thread 3 (File 1 of 5)
Started Processing Data File: metrics.connections.00004.sql in thread 4 (File 2 of 5)
Started Processing Data File: metrics.connections.00001.sql in thread 1 (File 3 of 5)
Started Processing Data File: metrics.connections.00002.sql in thread 2 (File 4 of 5)
I only have 5 files in this example so it's not a great one but you can imagine things possibly starting at file 500, 1, 10, and 42 for a table with the shuffling option in place and then on a subsequent execution the data files would be processed again in a different order, etc.
I think removing the shuffling code is better for users since knowing that the files will be processed more or less in numeric order seems like a good thing? There is the caveat of threading still allowing for some processing out of order, but at least you can see that files 1-4 were started first below and then it will progress to file 5 afterward:
Started Processing Data File: metrics.connections.00001.sql in thread 3 (File 1 of 5)
Started Processing Data File: metrics.connections.00004.sql in thread 2 (File 4 of 5)
Started Processing Data File: metrics.connections.00003.sql in thread 1 (File 3 of 5)
Started Processing Data File: metrics.connections.00002.sql in thread 4 (File 2 of 5)
This seems like it would be useful for users since they would have some predictability for the order in which their data files will be processed by the restore-dump
operation.
This also comes out of working with a current user that is migrating from Postgres and generating files in the format needed for restore-dump
.
Throughout that process different SQL errors have cropped up and it can be unclear which file needs to be reviewed more closely without the extra output that --show-details
now provides.
With the extra context of where you are in the processing of the files with --show-details
this can help point a user in the right direction more quickly and they may even be able to more easily know where they can try and continue from in a subsequent restore-dump
attempt without necessarily having to restore all of their data files again.
Going back to the shuffling example once more, if data file 500 for a table contained a problematic row when running a restore-dump
attempt the amount of time it would take to encounter the error right now would vary since it would depend on where that file lands in the shuffled ordering. Without the shuffling we would find that the error should be encountered a fairly consistent manner since the data files will be processed in the same order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the commented out shuffling code.
I'm not sure of the original intention for the shuffling code but I think allowing the data files to be processed more or less in numeric order will be of benefit to users with the other new options added within this branch.
Now that #932 was merged in a little earlier I was able to incorporate some extra show details output specific for views into this PR now. The current `canRestoreSchema()` has also been wrapped around the portion of code that restores the view definitions. In the future we may want to differentiate between restoring tables vs. views only but for now they are both considered to be part of the schema so I'm using the same check for both at the moment which should work ok for now.
I had worked with the user I originally created a lot of these options for on their Postgres migration a few weekends ago but I hadn't confirmed whether they had been able to utilize the custom build including these features during the migration itself. They were able to confirm that earlier today and shared: "...we couldn’t have completed the migration without it. It was super helpful to be able to see progress as before it was a complete black box". |
While an improved ability to detect Ctrl+C context cancellation was added in #933 I was noticing in this branch that when using the `--show-details` output after pressing Ctrl+C I would still see a lot of extra output displayed about data files being processed that were actually getting skipped so this helps to add an extra cancellation check to avoid that unneeded output.
While running throught the commands again today I noticed that if an error was encountered while a query was executing, which could happen when using the new `--data-only` flag if a prior attempt had been canceled partway through, that the subsequent data files would continue to process. This could be helpful perhaps in some cases but in most we would probably want the restore attempt to error out early in a case like this so that the issue can be addressed and some useful error information displayed. This commit adds some improved context usage so that things will error out earlier and also adds a snippet of the larger error output we would normally return that way the user might be able to see the error more easily and then adjust their usage of the command or manually empty the table before trying again.
These changes add a few potentially useful capabilities to the
restore-dump
command.The first is the
--show-details
flag, which helps provide a middle ground between the minimal default "Restoring database ..." output and the potentially too noisy--debug
output, since that ends up outputting the largeINSERT
queries being executed as well to the terminal. This flag will output the names of the files that will be included within the restore folder and will also show progress processing the queries within the various data files.The--start-from
flag allows you to provide a table name which will then be used to skip earlier tables so that the restore will begin from the table name that is provided which can be helpful to avoid having to start a restore completely from the very first table again.(Replaced by the
--starting-table
and--ending-table
options described in the later comment down below.)The
--allow-different-destination
flag addresses #540 and adds extra flexibility since users would not be required to adjust the database name prefix embedded into the files within a dump folder for a different database and can simply enable this option and it will allow those files to be utilized for a database with a different name.The
--schema-only
and--data-only
flags can help address situations where you may have a folder containing both sets of files but only need one type. While a new folder containing only the files needed is an option this mainly helps add extra flexibility without needing to take that extra step.Example output:
The following two examples use an example database name of
metrics
into itsmain
branch.The first example shows using the
--schema-only
flag:The second example demonstrates how using
--start-from
prevents files from the earlierconnections
table from being included at all and allows things to start from the provided table name which happens to be calledplaceholder
:To demonstrate the usage of
--allow-different-destination
first let's see the default behavior when attempting to restore a dump folder, created for themetrics
database, into a database namednot-metrics
:As expected, the restore is unable to complete since the filenames within the dump folder have the embedded
metrics
name which doesn't match with thenot-metrics
name being passed into the command.But if we add the new
--allow-different-destination
flag in, this allows the restore to proceed into the differently named destination database.I didn't want to include a lot of lines showing data files being processed but below is a snippet showing the extra details that would be included towards the end of processing a file: