Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

username uniqueness interpretation issue between database and code #664

Open
osmaa opened this issue Nov 20, 2023 · 6 comments
Open

username uniqueness interpretation issue between database and code #664

osmaa opened this issue Nov 20, 2023 · 6 comments

Comments

@osmaa
Copy link
Contributor

osmaa commented Nov 20, 2023

Encountered an error, extract from Sentry below:

Identity.MultipleObjectsReturned get() returned more than one Identity -- it returned 2! 
users/models/identity.py in by_username_and_domain at line 412
407                        username__iexact=username,
408                        domain_id=domain,
409                        local=True,
410                    )
411                else:
412                    return cls.objects.get(
413                        username__iexact=username,
414                        domain_id=domain,
415                    )
416            except cls.DoesNotExist:
417                if fetch and not local:

https://github.com/jointakahe/takahe/blob/main/users/models/identity.py#L412

the object in question: an actor presented in the database twice, once with Capitalized username, once with lowercase. Hard to say why the remote server changed the username presentation, but that's out of our control. The search above uses iexact (case insensitive) search, while the username column uses PostgreSQL's default collation, which is case sensitive, so the database didn't enforce the uniqueness as the code expected.

I couldn't find a WebFinger spec mention of username case sensitivity, but Mastodon treats them as case insensitive - so the code appears correct, but the database schema is not.

Possible solutions:

  1. redefine the users_identity (username, domain_id) unique index as (lower(username), domain_id), and quite possibly also a unique lower(domain) index for users_domain (the primary key index won't guarantee case insensitivity) as well as changing the service_domain unique index to lower(service_domain)
  2. create a custom, "nondeterministic" collation in the database init and use that for the column(s) which are expected to be case-insensitive, alter all relevant columns
  3. enforce the same with insert triggers
@andrewgodwin
Copy link
Member

This is meant to be enforced at the code level, in that it's meant to try and look up usernames before insertion to see if they exist (that's what that code block supposedly does) but the transaction isolation clearly isn't good enough to actually pull that off.

It might be worth changing the unique indexes to enforce this just as a backstop against transaction race conditions, though those will also need to be fixed to actually work fully, or users will get error pages when this happens rather than polluting the database as they do now (which is an improvement, but not a total fix).

@osmaa
Copy link
Contributor Author

osmaa commented Nov 23, 2023

The transaction isolation is good enough to allow this situation, because two concurrent transactions can pass the code block without running into each others' records prior to commit -- and some of the fetch_actor related paths take fairly long (especially before the sync_pins part was isolated). Because the index doesn't enforce case-insensitive uniqueness, both records get commited.

Changing the indexing will enforce the uniqueness independent of transaction isolation, so I'd say that would be the correct action. Users shouldn't see error pages, because it will enforce just one record existing, and the iexact search will then find it. If errors occur, they'd be occuring in stator, not the the UI, and a TryAgain should resolve them as one of the commits should remain..

A clean migration is difficult, though. Many tables refer to users_identity, and if data loss was to be avoided, all those references would need to be updated to point to the record which will be left after deleting duplicates and adding the new unique constraints.

Or, if we simply accept that this bug will cause (recoverable) data loss for remote identities, cascade-delete all the offending records first, and then change the indices.

In any event, while I know how to do this in SQL, I don't know how to do it in Django migrations...

I deleted the offending record earlier, but I still appear to have one which wasn't caught before. The second case appears to be a bunch of identity records will NULL usernames and domain_ids - a separate issue, perhaps.

takahe=# select count(id) dupes,lower(username),lower(domain_id) from users_identity group by lower(username),lower(domain_id) having count(id) > 1;
 dupes |   lower   |  lower   
-------+-----------+----------
     2 | fediverse | lemmy.ml
   797 |           | 
(2 rows)

takahe=# select id,actor_uri,username,name,domain_id from users_identity where lower(username)='fediverse' and lower(domain_id)='lemmy.ml';
         id         |          actor_uri           | username  |   name    | domain_id 
--------------------+------------------------------+-----------+-----------+-----------
 197408691327720714 | https://lemmy.ml/u/Fediverse | Fediverse |           | lemmy.ml
 201262153399299226 | https://lemmy.ml/c/fediverse | fediverse | Fediverse | lemmy.ml
(2 rows)
create temporary table migrate_users_identity
as
with b as (
    select min(id) id, lower(username) username, lower(domain_id) domain_id from users_identity
    group by lower(username), lower(domain_id) having count(*) > 1 
)
select a.* from users_identity a join b on (
    lower(a.username) = b.username and lower(a.domain_id) = b.domain_id 
    and a.id > b.id);

delete from users_identity a using migrate_users_identity b
where a.id = b.id;

create unique index lowercase_actor on users_identity (lower(username),lower(domain_id));

drop table migrate_users_identity;

@andrewgodwin
Copy link
Member

Yeah, this is a pretty nasty problem, because if you take the "correct" activitypub view - that everyone is identified purely by their Actor IRI - then it's actually fine to have two things on the same domain with the same username, as those aren't really first-class concepts in AP (and mostly come from WebFinger and the username field on Actor and a guessed domain).

It might be worth re-examining the whole usage of this function - which really should just be to link mentions and power searches - and see if it's acceptable to have multiple results and just take the newest one.

@osmaa
Copy link
Contributor Author

osmaa commented Dec 1, 2023

Right.. So this becomes a question of is Takahe an ActivityPub implementation (thus keying on actor_uri), or a "fediverse" implementation where both WebFinger and ActivityPub are relevant and discovery is based on WebFinger. I'd argue that in the latter case, though the specs are unclear, we should fall back on convention and expect that username local-parts are case insensitive. In any case, I've been running my instance with the additional unique lowercase index on users_identity for the last couple of weeks with no noticeable adverse effects.

@osmaa
Copy link
Contributor Author

osmaa commented Dec 1, 2023

I pinged the fedi wisdom on this, but so far haven't gotten any closer to a clear spec. Apparently it is undefined whether acct scheme usernames should be treated case insensitive or not (but Mastodon is case insensitive), and failing any specs to the contrary, the default for URLs (which "all" of the fedi servers use for actor id IRIs) will have case-sensitive path parts. Lemmy is particularly stupid by not mapping a canonical form of id even through its WebFinger endpoint.

https://mas.to/@osma/111504184001363691

@andrewgodwin
Copy link
Member

Yup, welcome to the world of trying to get fediverse agreement on things!

At its heart, Takahē is much more of an ActivityPub server than specifically caring about WebFinger, so I would rather keep things centered around a canonical Actor IRI and then if we need username mapping, doing it case-insensitive as WebFinger would on Mastodon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants