Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: create a document with a provided identifier #263

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sroze
Copy link

@sroze sroze commented Dec 29, 2023

This allows to create documents with a given identifier. In particular, when integrating other systems with automerge, it is very useful to be able to create documents from predictable identifiers (i.e. UUID v5) so we don't need to store any 'reference' within existing systems.

@sroze sroze changed the title feat: create a document with a stable identifier feat: create a document with a provided identifier Dec 29, 2023
@@ -31,6 +31,7 @@ export {
isValidAutomergeUrl,
parseAutomergeUrl,
stringifyAutomergeUrl,
interpretAsDocumentId,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposed for applications to do the same as this piece of code in the tests to be able to fetch such a document.

@pvh
Copy link
Member

pvh commented Dec 29, 2023

Hi @sroze, thanks for the PR. I appreciate you taking the time and including a test. I've had to turn down variations on this patch a few times but I'd be happy to help you find some kind of a solution that works for you. Let me explain.

In early versions of Automerge-Repo, we actually required the user to provide a document ID but this lead to serious problems where users would create documents without shared ancestry but with the same document IDs. In the most naive case, the ID would just be a string like "my-document" but the same problem would exist with any externally sourced UUID.

The issue is that Automerge needs shared object history to merge. If you import the same document on two different nodes, it will have small differences (such as time stamps) that will result in the hashes in the change graph not matching and merges will result in full document conflicts or just be rejected outright. For an analogy, imagine two people importing the same codebase into git, or pasting the same text into a Google Doc. The files don't have a shared history even if they have very similar contents.

As a result, I moved to a model where we treat the documentIds as opaque system-generated identifiers. My feeling is that storing an extra ~16b per document (plus some key overhead, I suppose) is probably a good trade to avoid introducing corruption bugs in your synchronization system.

Deriving a document from a content hash at import might seem at first glance to improve the situation, but that would leave us in a position where everyone starting from the same documentID (who happens to share a sync-path) would wind up merging all the changes for their documents. We could add a salt, I suppose, which would help... but I want to be very careful whatever we do here and would want to think about both correctness/expectations and any potential security problems that could be introduced.

Anyway, sorry to be the bearer of bad news! One thing I have been considering is adding support for local-only "pet names" for documents. This would allow something like repo.openMy("rootDocument") (likely not with this name). This may not solve your problem completely but would it help?

It might also help to hear a little bit more about your integration story. Maybe there are other approaches we can take to solving the underlying problem.

@sroze
Copy link
Author

sroze commented Dec 29, 2023

Thank you so much for the detailed answer. The problem completely makes sens, I fully appreciate the challenge associated with resolving the (real) conflict from merging two documents with the same identifier coming from different peers/history. Technically, this will even be an issue at some point, with system-generated identifiers (aka UUID collisions). Is there any mechanism, currently, in Automerge to report/handle impossible synchronisations (aka conflicts) or it's been attempted to avoid it altogether given the conflict-free nature of each document?

My feeling is that storing an extra ~16b per document (plus some key overhead, I suppose) is probably a good trade to avoid introducing corruption bugs in your synchronization system.

From Automerge's perspective, I tend to agree, given it's moving away complexity from the library to its users. Just to illustrate the example I have at hand today if it wasn't unclear: is that I have a system storing its state in a traditional Postgres database and I would like to use Automerge alongside it. In order to start storing things in Automerge, I need a document identifier. In other systems, I use a UUIDv5 of the object ID stored in Postgres, and I have my 'new' identifier, without having to manage any state whatsoever regarding this integration. With Automerge, I'd have to 1) create an empty document and 2) store the document ID in Postgres. It's completely feasible but more work for most users of the library (that I assume will always use Automerge alongside something else).

@sroze
Copy link
Author

sroze commented Dec 30, 2023

@pvh as described in this comment there is a need for sync servers to reject proposed changes for authorisation reasons anyway, couldn't we see this problem as part of the same category? (ie sync servers might reject creation of existing documents).

@pvh pvh marked this pull request as draft January 24, 2024 00:50
@pvh pvh force-pushed the main branch 2 times, most recently from e61f8e3 to d3d1a7d Compare July 26, 2024 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants