Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 characters in the CSS sometimes get interpreted as another encoding instead #2039

Closed
Yorwba opened this issue Dec 16, 2019 · 5 comments
Labels
bug Issue that describes a problem with a feature that doesn't work as expected.
Milestone

Comments

@Yorwba
Copy link
Contributor

Yorwba commented Dec 16, 2019

Reported by Pandaa on the wall

The character before "Advanced Search" should be . The â–¸ visible in the screenshot is the result of decoding the UTF-8 bytes as any of several other encodings like Windows code page 1250.
I'm guessing that the browser is incorrectly using that encoding because it has not been declared explicitly by the point the layout.css file is being parsed.

W3.org says the following about declaring the character encoding of CSS:

You should always use UTF-8 as the character encoding of your style sheets and your HTML pages, and declare that encoding in your HTML. If you do that, there is no need to declare the encoding of your style sheet.
Other approaches are only needed if your style sheet contains non-ASCII characters and, for some reason, you can't rely on the encoding of the HTML and the associated style sheet to be the same. In this case you should use @charset or HTTP headers to declare the encoding. (If your HTML and CSS files use the same encoding, the latest versions of major browsers will apply the encoding of the HTML file to the CSS stylesheet.)

There is no @charset declaration in layout.css, but the CSS uses the same encoding as the HTML. Under the assumption that someone who knows how to open the developer tools also keeps their browser up to date, I guess that makes it likely the HTML encoding was not declared properly.

There is a <meta charset="utf-8"> tag in the HTML, but it comes after a long list of inline styles. W3.org says that the declaration should be within the first 1024 bytes:

Always declare the encoding of your document using a meta element with a charset attribute, or using the http-equiv and content attributes (called a pragma directive). The declaration should fit completely within the first 1024 bytes at the start of the file, so it's best to put it immediately after the opening head tag.

Maybe that is the root cause.

@trang trang added the bug Issue that describes a problem with a feature that doesn't work as expected. label Jan 1, 2020
@AndiPersti
Copy link
Contributor

There is a <meta charset="utf-8"> tag in the HTML, but it comes after a long list of inline styles. W3.org says that the declaration should be within the first 1024 bytes:

But the meta tag comes directly after the opening head tag:

$ curl -s https://tatoeba.org/eng | head -n 5
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8"/>    <title>
        Tatoeba: Collection of sentences and translations    </title>

And the inline styles are generated by AngularJS Material at runtime. So I'm not sure that this is the problem.

I'll ask Pandaa whether s/he can give us more information (OS, browser, ...).

@AndiPersti
Copy link
Contributor

I've checked layout.css with the W3C CSS validator and while it shows a lot of warnings which can be ignored, it also shows the CSS it has validated. The important part is:

.search-bar-extra a:before {
  content : '� ';
} 

So it also doesn't use utf-8.

It looks like we have to add @charset "utf-8"; at the beginning of that file to be sure that it is interpreted with the correct encoding.

(Adding the charset declaration to layout.css and testing it with the W3C validator gives the correct result.)

@AndiPersti
Copy link
Contributor

I think I've found another reason why css files aren't recognized as UTF-8.

In the nginx configuration we set charset utf-8; which works for html and js files:

$ curl -s -i -I https://tatoeba.org/cache_js/layout.js | grep '^content-type'
content-type: application/javascript; charset=utf-8
$ curl -s -i -I https://tatoeba.org/index | grep '^content-type'
content-type: text/html; charset=UTF-8

But that doesn't work for css files:

$ curl -s -i -I https://tatoeba.org/cache_css/layout.css | grep '^content-type'
content-type: text/css

The reason is that nginx only adds the charset to the MIME types defined in charset_types which by default doesn't include text/css and we need to add text/css ourselves.

I've made this change on my private server and sent an unmodified layout.css from it to the W3C CSS validator and the validator recognized the UTF-8 character.

So there are two possible solutions:

  1. Add @charset "utf-8"; to the beginnig of layout.css
  2. Add charset_types text/html text/xml text/plain text/css application/javascript; to the nginx configuration. (We don't need application/rss+xml and text/vnd.wap.wml, do we?)

W3C reoommends doing both:

However, we recommend that if you need to use an HTTP declaration to set the correct encoding, you also include an @charset declaration inside the style sheet. This will ensure that the encoding is still known if the style sheet is used locally or moved, eg. for testing or editing.

What do you think?

@jiru
Copy link
Member

jiru commented Feb 29, 2020

I think putting @charset directives is the most robust way to solve the problem. However, I am not sure how to go about that because some of our CSS files, like layout.css, are generated using the asset_compress shell. We could insert a file containing only the charset directive at the beginning of the CSS file list in config/asset_compress.ini. Or maybe use the filters[] thing: create a filter that just inserts the charset directive.

@AndiPersti
Copy link
Contributor

AndiPersti commented Feb 29, 2020

We could insert a file containing only the charset directive at the beginning of the CSS file list in config/asset_compress.ini.

That's what I have planned to suggest in a PR but I was rather busy this week and hadn't had time yet.

Or maybe use the filters[] thing: create a filter that just inserts the charset directive.

We could do that, but there is a recent proposal to add a generic filter which executes a given command which is exactly what we would need. But I'm not sure how soon it will be available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue that describes a problem with a feature that doesn't work as expected.
Projects
None yet
Development

No branches or pull requests

4 participants