Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional progress information and other enhancements for restore-dump command #910

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

orware
Copy link
Contributor

@orware orware commented Sep 27, 2024

These changes add a few potentially useful capabilities to the restore-dump command.

The first is the --show-details flag, which helps provide a middle ground between the minimal default "Restoring database ..." output and the potentially too noisy --debug output, since that ends up outputting the large INSERT queries being executed as well to the terminal. This flag will output the names of the files that will be included within the restore folder and will also show progress processing the queries within the various data files.

The --start-from flag allows you to provide a table name which will then be used to skip earlier tables so that the restore will begin from the table name that is provided which can be helpful to avoid having to start a restore completely from the very first table again.
(Replaced by the --starting-table and --ending-table options described in the later comment down below.)

The --allow-different-destination flag addresses #540 and adds extra flexibility since users would not be required to adjust the database name prefix embedded into the files within a dump folder for a different database and can simply enable this option and it will allow those files to be utilized for a database with a different name.

The --schema-only and --data-only flags can help address situations where you may have a folder containing both sets of files but only need one type. While a new folder containing only the files needed is an option this mainly helps add extra flexibility without needing to take that extra step.

Example output:

The following two examples use an example database name of metrics into its main branch.

The first example shows using the --schema-only flag:

pscale database restore-dump metrics main --dir="<DIRECTORY_NAME>" --org=<ORG> --overwrite-tables --threads=2 --schema-only --show-details

Starting to restore database metrics from folder <DIRECTORY_NAME>
Restoring database ...
The schema only option is enabled for this restore.
Collecting files from folder <DIRECTORY_NAME>
  |- Table file: metrics.connections-schema.sql
    |- Data file: metrics.connections.00001.sql
    |- Data file: metrics.connections.00002.sql
    |- Data file: metrics.connections.00003.sql
    |- Data file: metrics.connections.00004.sql
    |- Data file: metrics.connections.00005.sql
  |- Table file: metrics.placeholder-schema.sql
Dropping Existing Table (if it exists): `metrics`.`connections`
Creating Table: `metrics`.`connections` (Table 1 of 2)
Dropping Existing Table (if it exists): `metrics`.`placeholder`
Creating Table: `metrics`.`placeholder` (Table 2 of 2)
Skipping restoring data files...
Restore is finished! (elapsed time: 1.189019208s)

The second example demonstrates how using --start-from prevents files from the earlier connections table from being included at all and allows things to start from the provided table name which happens to be called placeholder:

pscale database restore-dump metrics main --dir="<DIRECTORY_NAME>" --org=<ORG> --overwrite-tables --threads=2 --schema-only --show-details --start-from="placeholder"

Starting to restore database metrics from folder <DIRECTORY_NAME>
Restoring database ...
The schema only option is enabled for this restore.
Collecting files from folder <DIRECTORY_NAME>
Skipping files associated with the connections table...
  |- Table file: metrics.placeholder-schema.sql
Starting from placeholder table...
Dropping Existing Table (if it exists): `metrics`.`placeholder`
Creating Table: `metrics`.`placeholder` (Table 1 of 1)
Skipping restoring data files...
Restore is finished! (elapsed time: 562.755959ms)

To demonstrate the usage of --allow-different-destination first let's see the default behavior when attempting to restore a dump folder, created for the metrics database, into a database named not-metrics:

pscale database restore-dump not-metrics main --dir="<DIRECTORY_NAME>" --org=<ORG> --overwrite-tables --threads=2 --schema-only --show-details --start-from="placeholder"

Starting to restore database not-metrics from folder <DIRECTORY_NAME>
Restoring database ...
The schema only option is enabled for this restore.
Collecting files from folder <DIRECTORY_NAME>
Skipping files associated with the connections table...
  |- Table file: metrics.placeholder-schema.sql
Starting from placeholder table...
Error: failed to restore database: VT05003: unknown database 'metrics' in vschema (errno 1105) (sqlstate HY000)

As expected, the restore is unable to complete since the filenames within the dump folder have the embedded metrics name which doesn't match with the not-metrics name being passed into the command.

But if we add the new --allow-different-destination flag in, this allows the restore to proceed into the differently named destination database.

pscale database restore-dump not-metrics main --dir="<DIRECTORY_NAME>" --org=<ORG> --overwrite-tables --threads=2 --schema-only --show-details --start-from="placeholder" --allow-different-destination

Starting to restore database not-metrics from folder <DIRECTORY_NAME>
Restoring database ...
The allow different destination option is enabled for this restore.
Files that do not begin with the provided database name of not-metrics will still be processed without having to rename them first.
The schema only option is enabled for this restore.
Collecting files from folder <DIRECTORY_NAME>
Skipping files associated with the connections table...
  |- Table file: metrics.placeholder-schema.sql
Starting from placeholder table...
Dropping Existing Table (if it exists): `not-metrics`.`placeholder`
Creating Table: `not-metrics`.`placeholder` (Table 1 of 1)
Skipping restoring data files...
Restore is finished! (elapsed time: 543.931875ms)

I didn't want to include a lot of lines showing data files being processed but below is a snippet showing the extra details that would be included towards the end of processing a file:

  Processing Query 126 out of 134 within metrics.connections.00005.sql in thread 1
  Processing Query 127 out of 134 within metrics.connections.00005.sql in thread 1
  Processing Query 128 out of 134 within metrics.connections.00005.sql in thread 1
  Processing Query 129 out of 134 within metrics.connections.00005.sql in thread 1
  Processing Query 130 out of 134 within metrics.connections.00005.sql in thread 1
  Processing Query 131 out of 134 within metrics.connections.00005.sql in thread 1
  Processing Query 132 out of 134 within metrics.connections.00005.sql in thread 1
  Processing Query 133 out of 134 within metrics.connections.00005.sql in thread 1
  Processing Query 134 out of 134 within metrics.connections.00005.sql in thread 1
Finished Processing Data File: metrics.connections.00005.sql in 2m45.760835334s with 6m20.652604042s elapsed so far (File 5 of 5)

When working with a larger restore using the `restore-dump` command by default there is only the "Restoring database ..." indicator by default which provides minimal insight into how things are progressing.

On the other hand, if you choose to add the `--debug` flag the amount of information can then be too much as the large `INSERT` queries are then also included in the output.

The new `--show-details` flag optionally allows for users to see how things are going with their restore more easily.

Additionally, a `--start-from` flag was added that can be provided with a table name. This allows a user to skip earlier tables and start the import from a later point easily without having to create a separate copy of their dump folder and manually remove those files.

The new `--allow-different-destination` flag primarily helps with simplifying the process of taking a folder that was created for one database, e.g. the files contain a prefix for "first-database", and allows you to restore into a second database without having to adjust the database prefix on all of those existing files.

Also incorporated a check to prevent an issue for custom generated files where users might accidentally provide an `INSERT` query that is larger than 16777216 bytes. This provides some useful feedback rather than the `pkt` error that was being received previously.
From what I can tell this made no functional change and mainly allowed for the output I was providing with the new `--show-details` flag to allow showing information such as "Thread 1" rather than "Thread 0" without having to manually add one to the value elsewhere.
BuildKite was failing due to the way I was passing the helper used for printing so I made some adjustments and simplified how that worked.

Afterward I added in two more optional flags: `--schema-only` and `--data-only`.

If both are used at the same time then that's essentially the equivalent of a normal "restore" where both are being included.

The new `--start-from` parameter can also be used together with these too.
@orware orware requested a review from a team as a code owner September 27, 2024 22:06
I was thinking about this some more and it makes sense to allow for this option to be more comprehensive and allow not just >= a starting table to be included but a <= an ending table too or being able to provide a more limited starting/ending table range so I made some adjustments.

This removes the `--start-from` option and replaces it with the `--starting-table` and `--ending-table` options which can be used separately or together.
@orware
Copy link
Contributor Author

orware commented Sep 28, 2024

I was thinking about this some more and it makes sense to allow for this option to be more comprehensive and allow not just those tables >= to a starting table to be included but also the tables <= to an ending table too as well as being able to provide a more specific starting/ending inclusive table range so I made some adjustments.

The last commit removes the --start-from option and replaces it with the --starting-table and --ending-table options which can be used separately or together.

Examples:

If not using the options at all then all tables should be included as they normally would. If the ending table provided comes before the starting table alphabetically then the restore will short circuit until that is corrected.

This example skips the last two tables since the table range being provided will only include the first two:

pscale database restore-dump metrics main --dir="<DIRECTORY_NAME>" --org=<ORG> --overwrite-tables --threads=2 --schema-only --show-details --starting-table="abc" --ending-table="bcd"

Starting to restore database metrics from folder <DIRECTORY_NAME>
Restoring database ...
The schema only option is enabled for this restore.
Collecting files from folder <DIRECTORY_NAME>
  |- Table file: metrics.abc-schema.sql
  |- Table file: metrics.bcd-schema.sql
Skipping files associated with the connections table...
Skipping files associated with the placeholder table...
Restore will be starting from the abc table...
Restore will be ending at the bcd table...
Dropping Existing Table (if it exists): `metrics`.`abc`
Creating Table: `metrics`.`abc` (Table 1 of 2)
Dropping Existing Table (if it exists): `metrics`.`bcd`
Creating Table: `metrics`.`bcd` (Table 2 of 2)
Skipping restoring data files...
Restore is finished! (elapsed time: 1.005391166s)

For the case above, running the same command without the use of --starting-table would result in the same output since the abc table is the only other one before bcd:

pscale database restore-dump metrics main --dir="<DIRECTORY_NAME>" --org=<ORG> --overwrite-tables --threads=2 --schema-only --show-details --ending-table="bcd"

The opposite however, starting from the bcd table and continuing to the end will include the remaining tables automatically:

pscale database restore-dump metrics main --dir="<DIRECTORY_NAME>" --org=<ORG> --overwrite-tables --threads=2 --schema-only --show-details --starting-table="bcd"

Starting to restore database metrics from folder <DIRECTORY_NAME>
Restoring database ...
The schema only option is enabled for this restore.
Collecting files from folder <DIRECTORY_NAME>
Skipping files associated with the abc table...
  |- Table file: metrics.bcd-schema.sql
  |- Table file: metrics.connections-schema.sql
    |- Data file: metrics.connections.00001.sql
    |- Data file: metrics.connections.00002.sql
    |- Data file: metrics.connections.00003.sql
    |- Data file: metrics.connections.00004.sql
    |- Data file: metrics.connections.00005.sql
  |- Table file: metrics.placeholder-schema.sql
Restore will be starting from the bcd table...
Dropping Existing Table (if it exists): `metrics`.`bcd`
Creating Table: `metrics`.`bcd` (Table 1 of 3)
Dropping Existing Table (if it exists): `metrics`.`connections`
Creating Table: `metrics`.`connections` (Table 2 of 3)
Dropping Existing Table (if it exists): `metrics`.`placeholder`
Creating Table: `metrics`.`placeholder` (Table 3 of 3)
Skipping restoring data files...
Restore is finished! (elapsed time: 1.321406s)

Should address these two that were noted:
[2024-09-28T00:50:36Z] internal/cmd/database/restore.go:75:10: error strings should not be capitalized (ST1005)
[2024-09-28T00:50:36Z] internal/cmd/database/restore.go:75:10: error strings should not end with punctuation or newlines (ST1005)
Rather than having the 16777216 bytes limit be completely hard coded I thought it might be better to allow for it to be adjusted in rare cases since it may be possible that it is sometimes configured differently.

Generally though, the default 16 MiB limit will be the default one encountered by our users.

Note that this option is mainly for the quick length check performed prior to running a query and doesn't / cannot adjust the system configured message size limit for queries.
@@ -40,7 +40,7 @@ func (conn *Connection) StreamFetch(query string) (driver.Rows, error) {
// NewPool creates the new pool.
func NewPool(log *zap.Logger, cap int, address string, user string, password string, vars []string, database string) (*Pool, error) {
conns := make(chan *Connection, cap)
for i := 0; i < cap; i++ {
for i := 1; i <= cap; i++ {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this change? Starting from 0 and using < is the canonical style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had mixed feelings on making that change since I did think I was likely adjusting the canonical approach on that one.

The main reason for the change was that it allowed me to make the change in one spot so that the extra --show-details output would match up a little more nicely with the provided number of threads the user may have used when running the command but I can also do a +1 adjustment within the individual output lines instead.

For example with 4 threads provided the output currently shows the 1-4 values for the threads as shown below rather than 0-3 which might be a little nicer for the user to see:

Started Processing Data File: <DATABASE>.<TABLE>.00002.sql in thread 4 (File 2 of 2993)
Started Processing Data File: <DATABASE>.<TABLE>.00004.sql in thread 2 (File 4 of 2993)
Started Processing Data File: <DATABASE>.<TABLE>.00001.sql in thread 3 (File 1 of 2993)
Started Processing Data File: <DATABASE>.<TABLE>.00003.sql in thread 1 (File 3 of 2993)
  Processing Query 1 out of 1 within <DATABASE>.<TABLE>.00002.sql in thread 4
  Processing Query 1 out of 1 within <DATABASE>.<TABLE>.00001.sql in thread 3
  Processing Query 1 out of 1 within <DATABASE>.<TABLE>.00003.sql in thread 1
  Processing Query 1 out of 1 within <DATABASE>.<TABLE>.00004.sql in thread 2

I'm happy to switch back to the canonical style though in pool.go and either allow the output above to show 0-3 instead, or make some minor adjustments to the output so that potentially friendlier 1-4 output would still be shown for the thread values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted this back to the canonical style in NewPool.

The only remaining question would be whether I should leave things as-is for --show-details with the zero-based output like this:

Started Processing Data File: metrics.connections.00001.sql in thread 2 (File 1 of 5)
Started Processing Data File: metrics.connections.00004.sql in thread 1 (File 4 of 5)
Started Processing Data File: metrics.connections.00002.sql in thread 3 (File 2 of 5)
Started Processing Data File: metrics.connections.00003.sql in thread 0 (File 3 of 5)

Or whether I should additionally tweak the output so it appears as 1-based.

If it's fine as-is then no additional updates should be needed.

//for i := range files.tables {
// j := rand.Intn(i + 1)
// files.tables[i], files.tables[j] = files.tables[j], files.tables[i]
//}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if MySQL gives any guarantees about table ordering? So the start / end table logic is maybe not something very reliable to add.

Otherwise we might need to change the query to show the tables so we can enforce the order there? That means changing it to use the information_schema tables then.

In any way, if we remove code, we should remove it, not comment it 😄. We have git to track things and find old code 😉.

Copy link
Contributor Author

@orware orware Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the input @dbussink ❤️!

Is your first sentence related to the --starting-table and --ending-table options?

Those primarily help to drive the list of files that are being picked up from within the dump folder and mainly help provide some extra control over which files are going to be selected and used for the current restore-dump operation so the MySQL table ordering shouldn't necessarily factor in with those options?

I mainly see the --starting-table, --ending-table and several of the other new options added with this PR help ing to avoid needing to create multiple copies of a dump folder since normally you might have to create a new copy of an existing dump folder and then perform operations on it to keep just the tables / files you want to restore if you want to restore less than the full dataset.


As for the shuffling related code I commented out, the purpose of that code was initially confusing for me since I thought it did something different and I was unsure what the original rationale for adding it was.

I can definitely remove it completely from the file though and will make that correction shortly!

The shuffling only affects the order in which the data files get processed.

For example, with shuffling the data files will get processed out of order:

Started Processing Data File: metrics.connections.00005.sql in thread 3 (File 1 of 5)
Started Processing Data File: metrics.connections.00004.sql in thread 4 (File 2 of 5)
Started Processing Data File: metrics.connections.00001.sql in thread 1 (File 3 of 5)
Started Processing Data File: metrics.connections.00002.sql in thread 2 (File 4 of 5)

I only have 5 files in this example so it's not a great one but you can imagine things possibly starting at file 500, 1, 10, and 42 for a table with the shuffling option in place and then on a subsequent execution the data files would be processed again in a different order, etc.

I think removing the shuffling code is better for users since knowing that the files will be processed more or less in numeric order seems like a good thing? There is the caveat of threading still allowing for some processing out of order, but at least you can see that files 1-4 were started first below and then it will progress to file 5 afterward:

Started Processing Data File: metrics.connections.00001.sql in thread 3 (File 1 of 5)
Started Processing Data File: metrics.connections.00004.sql in thread 2 (File 4 of 5)
Started Processing Data File: metrics.connections.00003.sql in thread 1 (File 3 of 5)
Started Processing Data File: metrics.connections.00002.sql in thread 4 (File 2 of 5)

This seems like it would be useful for users since they would have some predictability for the order in which their data files will be processed by the restore-dump operation.

This also comes out of working with a current user that is migrating from Postgres and generating files in the format needed for restore-dump.

Throughout that process different SQL errors have cropped up and it can be unclear which file needs to be reviewed more closely without the extra output that --show-details now provides.

With the extra context of where you are in the processing of the files with --show-details this can help point a user in the right direction more quickly and they may even be able to more easily know where they can try and continue from in a subsequent restore-dump attempt without necessarily having to restore all of their data files again.

Going back to the shuffling example once more, if data file 500 for a table contained a problematic row when running a restore-dump attempt the amount of time it would take to encounter the error right now would vary since it would depend on where that file lands in the shuffled ordering. Without the shuffling we would find that the error should be encountered a fairly consistent manner since the data files will be processed in the same order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the commented out shuffling code.

I'm not sure of the original intention for the shuffling code but I think allowing the data files to be processed more or less in numeric order will be of benefit to users with the other new options added within this branch.
Now that #932 was merged in a little earlier I was able to incorporate some extra show details output specific for views into this PR now.

The current `canRestoreSchema()` has also been wrapped around the portion of code that restores the view definitions. In the future we may want to differentiate between restoring tables vs. views only but for now they are both considered to be part of the schema so I'm using the same check for both at the moment which should work ok for now.
@orware
Copy link
Contributor Author

orware commented Nov 4, 2024

I had worked with the user I originally created a lot of these options for on their Postgres migration a few weekends ago but I hadn't confirmed whether they had been able to utilize the custom build including these features during the migration itself.

They were able to confirm that earlier today and shared: "...we couldn’t have completed the migration without it. It was super helpful to be able to see progress as before it was a complete black box".

While an improved ability to detect Ctrl+C context cancellation was added in #933 I was noticing in this branch that when using the `--show-details` output after pressing Ctrl+C I would still see a lot
of extra output displayed about data files being processed that were actually getting skipped so this helps to add an extra cancellation check to avoid that unneeded output.
While running throught the commands again today I noticed that if an error was encountered while a query was executing, which could happen when using the new `--data-only` flag if a prior attempt had been canceled partway through, that the subsequent data files would continue to process.

This could be helpful perhaps in some cases but in most we would probably want the restore attempt to error out early in a case like this so that the issue can be addressed and some useful error information displayed.

This commit adds some improved context usage so that things will error out earlier and also adds a snippet of the larger error output we would normally return that way the user might be able to see the error more easily and then adjust their usage of the command or manually empty the table before trying again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants