Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Barman postgres restore on different server failing due to "unable to prepare 'XXX' tablespace" #1048

Open
adam-shenker opened this issue Dec 22, 2024 · 5 comments
Labels

Comments

@adam-shenker
Copy link

Hi,

I took a backup and then tried to restore the database on a cleanly installed postgres db on a different server.
I am getting the following output from the command:

[barman@postgres-dev ~]$ barman restore --remote-ssh-command="ssh postgres@lisdevnew" streaming-postgres-dev-server test-backup /var/lib/pgsql/17/data/
Starting remote restore for server streaming-postgres-dev-server using backup 20241219T094317
Destination directory: /var/lib/pgsql/17/data/
Remote command: ssh postgres@lisdevnew
16988, user_data, /u01/postgres/datafiles/ora_tasks
ERROR: unable to prepare 'maharlogs' tablespace (destination '/u01/postgres/datafiles/maharlogs'): mkdir execution failed

The obvious answer would be that there is a permissions problem, but this is the second directory that is being created - not the first.

The barmon log states:
2024-12-22 06:56:53,804 [3403307] barman.recovery_executor INFO: Starting remote restore for server streaming-postgres-dev-server using backup 20241219T094317
2024-12-22 06:56:53,804 [3403307] barman.recovery_executor INFO: Destination directory: /var/lib/pgsql/17/data/
2024-12-22 06:56:53,805 [3403307] barman.recovery_executor INFO: Remote command: ssh postgres@lisdevnew
2024-12-22 06:56:53,983 [3403307] barman.recovery_executor WARNING: Unable to retrieve safe horizon time for smart rsync copy: The /var/lib/pgsql/17/data/.barman-recover.info file does not exist
2024-12-22 06:56:56,689 [3403307] barman.recovery_executor INFO: 16988, user_data, /u01/postgres/datafiles/ora_tasks
2024-12-22 06:56:56,885 [3403307] Command WARNING: kex_exchange_identification: read: Connection reset by peer^M
2024-12-22 06:56:56,885 [3403307] Command WARNING: Connection reset by 192.168.26.182 port 22^M
2024-12-22 06:56:56,907 [3403307] Command WARNING: kex_exchange_identification: read: Connection reset by peer^M
2024-12-22 06:56:56,907 [3403307] Command WARNING: Connection reset by 192.168.26.182 port 22^M
2024-12-22 06:56:56,908 [3403307] barman.recovery_executor ERROR: unable to prepare 'maharlogs' tablespace (destination '/u01/postgres/datafiles/maharlogs'): mkdir execution failed
2024-12-22 06:57:01,264 [3403341] barman.utils INFO: Cleaning up lockfiles directory.

Prior to trying the restore the directory structure on the remote server was :
rwxr-xr-x. 4 postgres postgres 40 Dec 19 10:48 datafiles
[postgres@lisdevnew postgres]$ chmod 777 datafiles/
[postgres@lisdevnew postgres]$ ll
total 0
drwxrwxrwx. 4 postgres postgres 40 Dec 19 10:48 datafiles
[postgres@lisdevnew postgres]$ cd datafiles/
[postgres@lisdevnew datafiles]$ ll /u01/postgres/datafiles/
total 0
drwxrwxrwx. 2 postgres postgres 6 Dec 19 10:48 maharlogs
drwxrwxrwx. 2 postgres postgres 6 Dec 22 06:55 ora_tasks
[postgres@lisdevnew datafiles]$ pwd
/u01/postgres/datafiles
[postgres@lisdevnew datafiles]$ rmdir maharlogs/ ora_tasks/
[postgres@lisdevnew datafiles]$ ll
total 0
[postgres@lisdevnew datafiles]$ pwd
/u01/postgres/datafiles

As you can see - the directory structure is wide open (dangerously so as this is for testing).

After the restore, the directory structure was:
After failed upgrade
[postgres@lisdevnew datafiles]$ ll
total 0
drwxr-xr-x. 2 postgres postgres 6 Dec 22 06:56 ora_tasks

Showing that one directory was created - and therefore the issue is not obviously a permissions one.

Just to confirm that the directories on the server where the database is backed up are:
root@postgres-dev barman.d]# cd /u01/postgres/
[root@postgres-dev postgres]# ll
total 0
drwxrwxr-x. 5 postgres postgres 58 Nov 18 07:18 datafiles
[root@postgres-dev postgres]# cd datafiles/
[root@postgres-dev datafiles]# ll
total 0
drwx------. 3 postgres postgres 29 Nov 18 07:17 maharlogs
drwx------. 3 postgres postgres 29 Nov 13 10:27 ora_tasks
drwx------. 3 postgres postgres 29 Nov 18 07:18 ora_tasks2
[root@postgres-dev datafiles]#

Showing that they are owned by postgres and don't have any strange permissions.

ssh is obviously working from the barman server (I only created the ssh in a unidirection):
[barman@postgres-dev ~]$ ssh postgres@lisdevnew date
Sun Dec 22 07:37:38 IST 2024

The command being used to create the new backup (from the barman server is:
barman restore --remote-ssh-command="ssh postgres@lisdevnew" streaming-postgres-dev-server test-backup /var/lib/pgsql/17/data/

There is nothing of interest on either server within /var/log/messages.

As this doesn't appear to be a permissions issue and I cannot find anything Online - I thought that I would request your aid/direction.

With thanks,

Adam.

@gustabowill
Copy link
Contributor

Hi Adam, can you provide the output of the barman diagnose command?

@adam-shenker
Copy link
Author

barman_log_20241223.txt
barman_diagnose.txt

Hi,
Thank you for your very quick response.

Please find attached the requested output.
FYI - I tried restoring this to another empty postgresdb (just run init and created extensions) on a clean server and we have the same issue exactly where the first directory (ora_tasks) is created and the second fails.

I have also set the log_level = DEBUG and so I have attached a copy of the relevant part of the barman log as well.
Please note there is only trusted ssh between the barman server and the postgres server - but not the other way around - not sure if this is important.

With thanks,

Adam.

@gustabowill
Copy link
Contributor

Hi Adam, I don't think this is a Barman problem, but most likely something on your remote host. The logs show the connection being reset for some reason. Have you tried creating these directories manually using ssh from a second host to see if it works? I also think it is worth restarting your ssh service and checking your firewall rules, just in case.

@adam-shenker
Copy link
Author

adam-shenker commented Dec 24, 2024

Hi,

I completely understand where you are coming from.
If I directly run the command:
ssh postrgres@server mkdir -p directory
The directory is created on the server.

I looked at the firewall connections page between the two servers (we use panorama) this morning.

There were no connections
I ran the restore and there were 19 connections within 4 seconds (until it failed).
All of these connections had an "allow" tag on the action.

Hence, I do not think that the issue that we are having is directly related to either the ssh command or firewall.
Especially as one directory is being created and the other one isn't.

I still agree with you that this issue is probably something local to our environment (tried on two servers which are on the same vlan) but I have no idea what....

With thanks for your help (and speedy replies),

Adam.

NB. The postgres backup is stored on a shared directory that can be seen by both servers.
Is there a way that on the "restore" server I can create a new barman install which accesses (and sees) the same backups but does not process the actual WAL files?
This way I could do an in place restore rather than through SSH and we can see if this helps to solve the issue.

@barthisrael
Copy link
Contributor

Hello @adam-shenker,

There were no connections
I ran the restore and there were 19 connections within 4 seconds (until it failed).
All of these connections had an "allow" tag on the action.

Hence, I do not think that the issue that we are having is directly related to either the ssh command or firewall.
Especially as one directory is being created and the other one isn't.

I still agree with you that this issue is probably something local to our environment (tried on two servers which are on the same vlan) but I have no idea what....

So you observed 19 connections within 4 seconds.

I was wondering if you are facing issues related with MaxStartups in SSH?

The default installation of sshd in my environment brings the following value for that setting:

MaxStartups 10:30:100

From sshd_config docs:

       MaxStartups
               Specifies the maximum number of concurrent
               unauthenticated connections to the SSH daemon.
               Additional connections will be dropped until
               authentication succeeds or the LoginGraceTime expires for
               a connection.  The default is 10:30:100.

               Alternatively, random early drop can be enabled by
               specifying the three colon separated values
               start:rate:full (e.g. "10:30:60").  sshd(8) will refuse
               connection attempts with a probability of rate/100 (30%)
               if there are currently start (10) unauthenticated
               connections.  The probability increases linearly and all
               connection attempts are refused if the number of
               unauthenticated connections reaches full (60).

With the default value that I shared above, that means:

  • 10: number of unauthenticated connections before sshd starts dropping connections
  • 30: chance (in %) of dropping once sshd reaches 10 connections
  • 100: maximum number of connections at which sshd starts dropping every connection

I was able to reproduce the error message that you are seeing by configuring MaxStartups with problematic values then firing a barman restore command with several parallel jobs:

  • My config value:
MaxStartups 5:75:10
  • The barman restore execution:
$ barman restore postgres 20250106T145019 /var/lib/pgsql/testrestore --remote-ssh-command "ssh postgres@localhost" -j 10
Starting remote restore for server postgres using backup 20250106T145019
Destination directory: /var/lib/pgsql/testrestore
Remote command: ssh postgres@localhost
Copying the base backup.
ERROR: Failure copying base backup: data transfer failure
rsync error:
kex_exchange_identification: read: Connection reset by peer
Connection reset by ::1 port 22
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(228) [sender=3.2.3]

In my case the issue occurred during a rsync call, but the connection was dropped with a similar error message that can be seen in your logs.

The issue was gone after I tweaked the value of MaxStartups to handle the amount of SSH connections.

Could you please check if you are also facing issues because of MaxStartups configuration in sshd?

NB. The postgres backup is stored on a shared directory that can be seen by both servers.
Is there a way that on the "restore" server I can create a new barman install which accesses (and sees) the same backups but does not process the actual WAL files?
This way I could do an in place restore rather than through SSH and we can see if this helps to solve the issue.

Although there might be ways of setting up Barman like that by doing some hacking, I'd advise against that.

Ideally you should troubleshoot the issue with the SSH connections and use Barman the way it was designed to.

Let us know if playing with MaxStartups help fixing your issue.

Best regards,
Israel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants