Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database defined UNIQUE or FOREIGN KEY constraint causes rebase to fail #210

Open
leorudczenko opened this issue May 3, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@leorudczenko
Copy link

This is a follow-up from MerginMaps/mobile#2933. After further investigation into this bug, we have found that the bug exists in the rebase function from geodiff.

Summary

Our database/GeoPackage structure requires the use of the UNIQUE constraint on certain fields, specifically for primary/foreign keys. When trying to rebase a GeoPackage file with a UNIQUE constraint on any column, it fails with this error: pygeodiff.geodifflib.GeoDiffLibError: rebase. This error is then what leads to the strange conflict bug here: MerginMaps/mobile#2933.

To Reproduce

We have created a Python script to re-produce this error repeatedly, using a minimal table definition with 2 fields:

"""
geodiff_unique_rebase_bug.py
"""
import sqlite3
from pathlib import Path

import pygeodiff
import etlhelper as etl


def main() -> None:
    # Define GeoPackage filepaths
    gpkg_dir = Path(__file__).parent
    base = gpkg_dir / "base.gpkg"
    ours = gpkg_dir / "ours.gpkg"
    theirs = gpkg_dir / "theirs.gpkg"
    reset_gpkg_files(base, ours, theirs)
    create_gpkg_files(base, ours, theirs)
    geodiff_rebase(gpkg_dir, base, ours, theirs)


def reset_gpkg_files(base: Path, ours: Path, theirs: Path) -> None:
    """
    Delete the GeoPackage files if they already exist.
    """
    for gpkg in [base, ours, theirs]:
        gpkg.unlink(missing_ok=True)


def create_gpkg_files(base: Path, ours: Path, theirs: Path) -> None:
    """
    Create 3 GeoPackage files at the given filepaths.
    All 3 of the files will have the same table created in them.
    But, for the file 'ours' and 'theirs', they will each have 1 row inserted into the table too.
    These rows will abide by the database table constraints.
    """
    # UNIQUE constraint in this CREATE TABLE statement causes geodiff.rebase to fail
    # If you re-run this script, but remove the UNIQUE constraint, it will succeed
    create_table = """
        CREATE TABLE test (
            "fid" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
            "name" TEXT UNIQUE
        )
    """

    # Add empty table to all 3
    for gpkg in [base, ours, theirs]:
        with sqlite3.connect(gpkg) as conn:
            etl.execute(create_table, conn)

            # Add 1 row to ours and theirs
            if gpkg.stem == "ours":
                etl.load("test", conn, [{"name": "our_change"}])
            elif gpkg.stem == "theirs":
                etl.load("test", conn, [{"name": "their_change"}])


def geodiff_rebase(gpkg_dir: Path, base: Path, ours: Path, theirs: Path) -> None:
    """
    Call geodiff.rebase.
    When it works, you should end up with both of the inserted rows in the file 'ours'.
    When it fails, you should see 'pygeodiff.geodifflib.GeoDiffLibError: rebase' and no conflict file is created.
    """
    geodiff = pygeodiff.GeoDiff()
    geodiff.rebase(str(base), str(theirs), str(ours), str(gpkg_dir / "conflict.gpkg"))
    print(f"Rebased files successfully into: {ours}")


if __name__ == "__main__":
    main()

Running the Script

To run the script, you need to install the following Python libraries:

  • pygeodiff
  • etlhelper

Run the script with:
python geodiff_unique_rebase_bug.py

Results

The script will fail at the rebase step, demonstrating the bug.

If you then remove the UNIQUE constraint from the test table definition, it will allow the script to run to completion. In this circumstance, the GeoPackage ours will end up with 2 rows as expected.

Software Versions

  • python version 3.9.18
  • pygeodiff version 2.0.2
@wonder-sk
Copy link
Contributor

Thanks for the detailed bug report @leorudczenko!

I have looked into this and the reason for things falling apart is this:

  • In the final stage of rebase, we have ours.gpkg (containing a single row fid=1, name=our_change) and we have a diff to be applied, in order to contain both changes from ours.gpkg and theirs.gpkg:
    • INSERT fid=2, name=our_change
    • UPDATE fid=1, name=our_change -> name=their_change
  • The final rebase diff looks correct and it is according to the rules - the "our" changes are applied on top of "their" changes.
  • The real problem is that while we are applying the difference, the UNIQUE constraint temporarily gets violated with the first INSERT, and thus the whole rebase fails (even though at the end of the whole operation the unique constraint would fine again)

Now how to solve this issue: Ideally we would want to tell sqlite to temporarily disable constraints while we're doing updates (like what postgres has with SET CONSTRAINTS ALL DEFERRED), but this does not seem to be supported (only foreign keys can be deferred, and only "check" constraints can be disabled).

I can see two ways out:

  • when applying diffs, if a INSERT/UPDATE/DELETE fails due to a constraint, try to continue, accumulate a buffer with pending statements, and try to apply them at the end (this may need multiple rounds)
  • drop constraints before applying the rebased diff, and then add them again at the end - this seems a bit too brute force

The question is what to do when there would be simultaneous edits that actually break the unique constraint - when geodiff is used in context of Mergin Maps, this would mean that a user would end up being unable to sync their changes, due to the unique constraint error, and no way to fix it. This would be quite tricky to handle, to give mobile users enough information what has happened and offer some ways out - for the typical use case with field surveys, it is typically easier to let some data admin fix such issues in the data afterwards... I am keen to hear your thoughts!

@leorudczenko
Copy link
Author

Thanks so much for all the information and insightful discussion @wonder-sk! It's great to finally understand what's going on with this bug 😄

I personally think the first of your two potential solutions sounds like the more complex but ideal one, where you try to apply the difference and then create a buffer of changes on a failed constraint. It seems like it would better fit the current process used to apply differences, whilst avoiding potential issues with directly modifying the database schema.

I think when there are simultaneous edits that break a constraint, it would make sense then for it to create a conflict version of the database, like Mergin Maps already does. In an ideal world, the user would also be informed that this has been done, but it isn't required. Then again, Mergin Maps currently does not display any message when a conflict occurs. However, the important difference between these 2 scenarios would be the reason for the conflict, which would be very helpful to know as a developer. Especially if it could include the exact constraint that failed.

@volcan01010
Copy link

volcan01010 commented May 22, 2024

Hi @wonder-sk,

Thanks for picking this up. Although it is a bit awkward that we have constraints on the GeoPackage, I think that it is likely that others will take advantage of these features of SQLite, too.

Without being able to switch off constraints, applying the diff is a tricky problem. I have two comments:

  • This is probably a stupid question but, in the diff that you apply above can you apply the steps in reverse order? Then the constraint wouldn't fail.

  • Another possible workaround is to SELECT INTO a temporary table that doesn't have the constraints. Then you can make all the changes there. The hard part would be getting the results back into the original table. I don't know how to do that with SQLite commands, but you would probably have to UPDATE each row in turn then INSERT the new ones at the end. Below is a SQL implementation of the scenario that we described, which at least demonstrates the temporary table creation:

DROP TABLE IF EXISTS test;
DROP TABLE IF EXISTS geodiff_working;

CREATE TABLE test (
	"fid" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
	"name" TEXT UNIQUE
);

INSERT INTO test (fid, name) VALUES (1, 'our_change');

CREATE TEMPORARY TABLE geodiff_working AS SELECT * FROM test;

INSERT INTO geodiff_working (fid, name) VALUES (2, 'our_change');
UPDATE geodiff_working SET name = 'their_change' WHERE fid = 1;

-- Getting the data back isn't so easy...
-- We can't do this in real life because wiping the table will break foreign key
-- constraints and inserting all the records will fire insert triggers and
-- increment sequences.
DELETE FROM test;
INSERT INTO test SELECT *  FROM geodiff_working;

SELECT * FROM test;

Please let us know if there is anything that we can do to help. We can arrange a call if you would like to discuss options (and it would be good for us to learn how geodiff works internally) and we can do also test different versions of code.

@volcan01010
Copy link

volcan01010 commented May 27, 2024

I had another thought for a workaround. In this case you create a stash of failed values inserting a placeholder in the failed column, then update them into the correct place at the end.

  • Apply the diffs in order
  • If any fail for a unique constraint, take the table and column name from the error report
  • Replace the failed value in the row with a UUID and try the insert again. It should work. SQLite isn't strict about types, so you can get away with this, whatever type the underlying column is. Store the table, column, UUID and correct value in memory
  • Once you have made the full pass through with the inserts, go back through your stored list with the UUIDs and do "UPDATE table SET column = correct_value WHERE column= UUID"

@volcan01010
Copy link

We had another discussion about workarounds for this. One question that we had was why the UNIQUE constraint issue only affects us in the mobile app and not in the QGIS plugin.

@volcan01010
Copy link

It seems like GDAL and QGIS are both adding support for auto-detection of UNIQUE constraints in GeoPackages, so there must be other people than us that use them. They already detect foreign key relationships.

qgis/QGIS#57823

@leorudczenko
Copy link
Author

leorudczenko commented Jul 31, 2024

We tried a workaround for this issue by dropping all of the UNIQUE constraints from our table definitions, however this causes SQLite to throw an error regarding a foreign key mismatch. This happens because SQLite requires that parent key columns to have a UNIQUE constraint.

The script below will fail when geodiff tries to rebase because of the UNIQUE constraint. But if you remove the UNIQUE constraint from the test_parent.uuid column, then it will result in the following SQLite error when you try to insert data: foreign key mismatch - "test_child" referencing "test_parent"

"""
geodiff_fk_rebase_bug.py
"""
import sqlite3
from pathlib import Path

import pygeodiff
import etlhelper as etl


def main() -> None:
    # Define GeoPackage filepaths
    gpkg_dir = Path(__file__).parent
    base = gpkg_dir / "base.gpkg"
    ours = gpkg_dir / "ours.gpkg"
    theirs = gpkg_dir / "theirs.gpkg"
    reset_gpkg_files(base, ours, theirs)
    create_gpkg_files(base, ours, theirs)
    geodiff_rebase(gpkg_dir, base, ours, theirs)


def reset_gpkg_files(base: Path, ours: Path, theirs: Path) -> None:
    """
    Delete the GeoPackage files if they already exist.
    """
    for gpkg in [base, ours, theirs]:
        gpkg.unlink(missing_ok=True)


def create_gpkg_files(base: Path, ours: Path, theirs: Path) -> None:
    """
    Create 3 GeoPackage files at the given filepaths.
    All 3 of the files will have the same tables created in them.
    But, for the file 'ours' and 'theirs', they will each have 1 row inserted into each table too.
    These rows will abide by the database table constraints.
    """
    # We have to setup the "uuid" column with a UNIQUE constraint to comply with SQLite database constraints for foreign key columns
    # It is not possible to define foreign key relationships in SQLite without UNIQUE constraints on the parent key column
    # Typically, the PRIMARY KEY column is used as the parent key column, where the PRIMARY KEY by default also applies the UNIQUE constraint

    # If you remove the UNIQUE constraint from the "uuid" column in the table "test_parent", it will result in the following SQLite error:
    # foreign key mismatch - "test_child" referencing "test_parent"

    create_test_table = """
        CREATE TABLE test_parent (
            "fid" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,  -- This is the PRIMARY KEY for the GeoPackage
            "uuid" TEXT UNIQUE,  -- This is the FOREIGN KEY for parent/child relationships
            "name" TEXT
        )
    """
    create_test_child_table = """
        CREATE TABLE test_child (
            "fid" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
            "parent_fuid" TEXT,
            "name" TEXT,
            FOREIGN KEY("parent_fuid") REFERENCES "test_parent"("uuid")
        )
    """

    # Add both tables to all 3 databases
    for gpkg in [base, ours, theirs]:
        with sqlite3.connect(gpkg) as conn:
            etl.execute("PRAGMA foreign_keys = ON", conn)

            # Create the 2 tables
            for create_sql in [create_test_table, create_test_child_table]:
                etl.execute(create_sql, conn)

            # Add 1 row to ours and theirs
            if gpkg.stem == "ours":
                etl.load("test_parent", conn, [{"name": "our_change", "uuid": "{ac267d95-db70-4299-b019-c52599ca1e5f}"}])
                etl.load("test_child", conn, [{"parent_fuid": "{ac267d95-db70-4299-b019-c52599ca1e5f}", "name": "our_child"}])
            elif gpkg.stem == "theirs":
                etl.load("test_parent", conn, [{"name": "their_change", "uuid": "{abc43098-fe9b-4da0-b008-7518694466bb}"}])
                etl.load("test_child", conn, [{"parent_fuid": "{abc43098-fe9b-4da0-b008-7518694466bb}", "name": "their_child"}])


def geodiff_rebase(gpkg_dir: Path, base: Path, ours: Path, theirs: Path) -> None:
    """
    Call geodiff.rebase.
    When it works, you should end up with both of the inserted rows in the file 'ours'.
    When it fails, you should see 'pygeodiff.geodifflib.GeoDiffLibError: rebase' and no conflict file is created.
    """
    geodiff = pygeodiff.GeoDiff()
    geodiff.rebase(str(base), str(theirs), str(ours), str(gpkg_dir / "conflict.gpkg"))
    print(f"Rebased files successfully into: {ours}")


if __name__ == "__main__":
    main()

We require the foreign key relationships to be defined in the database because our tool depends on QGIS auto-detecting the relationships.

This issue may relate to these other issues concerning foreign key relationships:

@leorudczenko leorudczenko changed the title Database defined UNIQUE constraint causes rebase to fail Database defined UNIQUE or FOREIGN KEY constraint causes rebase to fail Jul 31, 2024
@wonder-sk
Copy link
Contributor

I have looked into this again and the case with foreign keys boil down again to the violation of the UNIQUE constraint on test_parent table. After all, foreign key enforcement can be disabled with PRAGMA foreign_keys=0;, but UNIQUE constraint cannot be disabled in SQLite, and one needs to go through a painful 12-step process which is going to be slow and probably prone to other issues.

Looking further into the sqlite3 session extension (which inspired geodiff implementation), it supports the approach with accumulation of failed insert/update/delete statements due to failed constraints, and re-applying them later - I am convinced now that if we want to solve these issues, we should take that route. I have done a quick&dirty proof of concept (https://gist.github.com/wonder-sk/60fa63a7b0e51e7276f5773dda6142a3) and verified this would help to fix the issue. A proper fix would need to be of course more elaborate and auto-tested...

Simply reversing the order of statements (as suggested in an earlier comment) is likely to be only a partial solution - sometimes it could work, but probably in more complex scenarios we would have failures again. Even with the solution of accumulating failed statements and retrying them, this may need to be done in multiple passes (as long as we were able to apply at least one failed statement).

By the way, as for why this issue only happens on mobile app, and not in QGIS, this is most likely simply a matter of luck, that relevant changes get written in different order, so the unique constraint does not fail.

@volcan01010
Copy link

Thank you for looking at this, @wonder-sk and for taking the time to make a proof-of-concept. Repeatedly looping through and retrying failed statements seems like a good approach then, particularly if that is the way that SQLite themselves does it.

That's interesting about the difference between mobile app and QGIS. I will keep an eye out for conflicts in other Mergin projects.

@volcan01010
Copy link

I just remembered that geodiff also works with PostGIS. We haven't tested to see if that issue also occurs there, but I would expect so as the underlying mechanism is the same. I suspect PostGIS users would be even more likely to make use of UNIQUE constraints and FOREIGN KEYs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants