Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replication.log remains empty (and without any error in firebird.log) until concurrent FB instance is running under different account and generates segments on its master. Significant delay required after stopping concurrent FB to allow first one to write in its replication log. #7896

Closed
pavel-zotov opened this issue Nov 30, 2023 · 3 comments · Fixed by #7919

Comments

@pavel-zotov
Copy link

Our QA set has several tests which deal with replication ( https://github.com/FirebirdSQL/firebird-qa/tree/master/tests/functional/replication ).
All of them except one make verification by querying replica DB and searching there data that have been added/modified in master DB.
This approach allows to avoid periodic check of content of replication log which leads to cumbersome and error-prone code.

But one test has to deal exactly with replication log (its title: "Invalid message in replication.log (and possibly crash in the case of synchronous replication) when the target DB has no its "replica" flag set")
UDL to this test: https://github.com/FirebirdSQL/firebird-qa/blob/master/tests/functional/replication/test_invalid_msg_if_target_db_has_no_replica_flag.py

Since the beginning of nov 2023 almost every run of this test fails, and this occurs only on Windows.
To be more concrete about first fails:
* FB 4.x -- since 4.0.4.3010, date of run: 04-nov-2024
* FB 5.x -- since 5.0.0.1261, date of run: 08-nov-2024
* FB 6.x -- since 6.0.0.124, date of run: 14-nov-2024

First alerts about problem that were encountered by QA batch (before any tests run, at the 'preparing phase' of enrite job) was received 30-oct-2023.
Several reports have been sent to fb-dev team, letters:
* 11-nov-2023 13:47
* 11-nov-2023 16:14
* 13-nov-2023 10:24
* 14-nov-2023 11:53
(subj: "QA-RUNDAILY. FB 4.x ...6.x, WINDOWS ONLY, repl.conf, verbose = true: 1) replication - works, but 2) replicatin.log remains empty...")

The essence of problem: for some reason replication log (of QA instance) remains empty and no any error message about it appear in the firebird.log.
What's even more surprising is that replication per se does work correct, i.e. all segments are generated and applied to replica normally.

Log of replication is created on every QA run in the 'clean directory', together with other files (from 'fresh FB snapshot') which are unpacked there.
There are no restrictions from OS to access to any of these files (firebird.exe is launched by fbguard, which, in turn, is running under user ('IBASE\pz') who has admin rights).

The search for reason of failure was complicated because these fails were not 'stable': at some dates all was fine.
Moreover, failures could not be reproduced when testing FB instance was launched as application from QA batch which also was run 'by hands', i.e. not from scheduler.

Futher investigations have shown that failures occur only when concurrent FB 4.x is running on this machine (being launched as service; of course, using another port, IpcName etc).
This consurrent FB is working under NT AUTHORITY\SYSTEM account and its checkbox "Allow service to interact with desktop" was UNchecked.

Further studies (to reproduce problem) have shown important thing: QA and consurrent instances (FB 4.x service) must run under DIFFERENT accounts.

Also, this (concurrent) FB instance must do some operations with its master and replica databases (some changes in master in order to start generation of segments and their transfer to replica).
Then this instance does backup/restore of master DB and online replica re-initialization (via 'nbackup -L' of just replaced master DB and its file copy onto replica).
Folder with operation logs was not cleaned (and maybe this could also affect on final outcome).

If after this QA batch is started (once again: only from scheduler!) then it will not be able to write into replication.log: it will remain empty and no errors will appear in the firebird.log.

If we stop after this 'concurrent' FB-4 and repeat test which is problematic for QA then - !NB! - its outcome depends on how long we waited before this!
Namely: we have to wait not less than ~2 minutes before attempt to run QA test if we want it to complete successfully.
Otherwise it still fails with the same 'silent outcome': no data appear in its own replication.log

Further consulting with Vlad about this problem was 17-nov-2023 (privately).
Special build with additional debug features (provided by Vlad; version: 5.0.0.1268_2-x64.7z) has shown in the firebird.log that problem somehow relates to inability to get mutex on replication.log:

IBASE-CLIENT Fri Nov 17 14:21:18 2023
Guardian starting: "H:\QA\rundaily-2023\unpacked-snapshot.tmp\firebird.exe"

IBASE-CLIENT Fri Nov 17 14:21:18 2023
Failed to create 'firebird_repl_mutex', error 5

IBASE-CLIENT Fri Nov 17 14:21:18 2023
Failed to lock log file 'H:\QA\rundaily-2023\unpacked-snapshot.tmp\replication.log', error 6

... lot of similar messages ...

IBASE-CLIENT Fri Nov 17 14:22:00 2023
Failed to lock log file 'H:\QA\rundaily-2023\unpacked-snapshot.tmp\replication.log', error 6

So, the problem exists when:
* QA-FB and concurrent-FB instances are running under different accounts;
* concurrent-FB must do somewhat that leads to appearance of segments and they must be transferred and applied on replica;
* pause after termination of concurrent-FB (if it ever occurs) before QA test repliaction start is too small, less than ~2 minutes.

@pavel-zotov
Copy link
Author

PS.
Also, there will be no any error messages in firebird log if replication.log file will be locked by any other external process, so firebird engine can't write into replication.log.

@hvlad
Copy link
Member

hvlad commented Dec 21, 2023

Re-open for backporting

@hvlad hvlad reopened this Dec 21, 2023
hvlad added a commit that referenced this issue Dec 21, 2023
@hvlad
Copy link
Member

hvlad commented Dec 21, 2023

Will be ported into v5 later

hvlad added a commit that referenced this issue Jan 12, 2024
@hvlad hvlad closed this as completed Jan 12, 2024
@mrotteveel mrotteveel changed the title replication.log remains empty (and without any error in firebird.log) until concurrent FB instance is running under different account and generates segments on its master. Significant delay required after stop concurrent FB it in order allow first one to write in its replication log. replication.log remains empty (and without any error in firebird.log) until concurrent FB instance is running under different account and generates segments on its master. Significant delay required after stopping concurrent FB to allow first one to write in its replication log. Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment