Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong text encoding on lists.openstreetmap.org sites #1179

Open
gy-mate opened this issue Nov 12, 2024 · 6 comments
Open

Wrong text encoding on lists.openstreetmap.org sites #1179

gy-mate opened this issue Nov 12, 2024 · 6 comments

Comments

@gy-mate
Copy link

gy-mate commented Nov 12, 2024

Mailman and/or Pipermail archives and/or shows some accented characters in the wrong encoding on https://lists.openstreetmap.org. See https://lists.openstreetmap.org/pipermail/talk-hu/2021-October/016347.html for example: következő is displayed like kďż˝vetkezďż˝ on macOS Safari 18.1 and Firefox 132.0.2:

Screenshot 2024-11-12 at 21 51 45

Same issue with French text at https://lists.openstreetmap.org/listinfo:

Screenshot 2024-11-12 at 21 51 11

Monthly archives downloaded from https://lists.openstreetmap.org/pipermail/talk-hu/ (e.g. https://lists.openstreetmap.org/pipermail/talk-hu/2024-October.txt.gz) also contain .txt files in ISO 8859-1 encoding which is not detected by macOS apps like BBEdit or TextEdit automatically.

@tomhughes
Copy link
Member

It's probably impossible to completely fix this as we have no control over the encoding of individual messages in the archive, and the archiving is way too dumb to do any transcoding. Basically each language has a language set, and each language has an associated encoding and it will assume that everything to do with that list is in that encoding.

It gets even worse with listinfo because that is merging data from different lists that may use different encodings.

Some of the list descriptions are definitely wrong though as they have double encoded characters or unicode replacement characters in them. I've fixed a few of those where I can figure out what they should be but without speaking the relevant languages I can't always work it out.

@gy-mate
Copy link
Author

gy-mate commented Nov 15, 2024

I've fixed a few of those where I can figure out what they should be

@tomhughes Thank you very much!

but without speaking the relevant languages I can't always work it out.

I think talk-latam (French) is missing an é character but that's all I know.

@gy-mate
Copy link
Author

gy-mate commented Nov 15, 2024

the archiving is way too dumb to do any transcoding.

How could this be improved?

@tomhughes
Copy link
Member

I think talk-latam (French) is missing an é character but that's all I know.

I doubt it will be French. Spanish is more likely...

the archiving is way too dumb to do any transcoding.

How could this be improved?

Realistically it can't. I mean obviously moving to mailman 3 might help but it's a huge amount of work and the mailing lists are gradually dying out anyway.

I suspect in practice that forcing utf-8 for everything would probably improve things a lot but I haven't had time yet to figure out how to do that.

@gy-mate
Copy link
Author

gy-mate commented Nov 16, 2024

I doubt it will be French. Spanish is more likely...

@tomhughes My bad, I mixed it up with talk-sn. Sénégal should be Sénégal (French), I think.

@gy-mate
Copy link
Author

gy-mate commented Nov 16, 2024

moving to mailman 3 might help but it's a huge amount of work and the mailing lists are gradually dying out anyway.

Yeah you're right. This is a low priority task then.
Should I leave the issue open anyway?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants