Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special latex characters in alt-text #804

Open
lstonys opened this issue Feb 25, 2025 · 17 comments
Open

Special latex characters in alt-text #804

lstonys opened this issue Feb 25, 2025 · 17 comments

Comments

@lstonys
Copy link

lstonys commented Feb 25, 2025

In alt text we need some encoding where we can hide any character from tex (like HTML has entities where every character can be typed as unicode).
For example what will be if I want to describe brace characters:

alt ={%
{ - small variant of curly brace....
{ - large variant ...
}%

I can hide it with \{ but in alt text backslash is useless and need to be striped.
Now I tried directly to add common special characters and I see that:

  • ~ is changed to space
  • % is gone (ofcourse)
  • # is doubled

example:

\DocumentMetadata{testphase=phase-III}
\documentclass{article}
\usepackage{graphicx}

\begin{document}
    \begin{figure}[h]
    \centering
    \includegraphics[alt={Alternative, `'@~#$^_{}%
text},width=\textwidth]{example-image}
    \end{figure}
some text
\end{document}

Image

@u-fischer
Copy link
Member

well tilde and superscript are a bit special, but beside this:


\DocumentMetadata{testphase=phase-III} \documentclass{article} \usepackage{graphicx} \begin{document} \begin{figure}

 \centering \includegraphics[alt={Alternative, `'@\string~\#\$\string^\_\{\}\% text},width=\textwidth]{example-image} 

\end{figure} 
some text 
\end{document}

Image

@u-fischer u-fischer transferred this issue from latex3/tagpdf Feb 25, 2025
@lstonys
Copy link
Author

lstonys commented Feb 25, 2025

OK, I tried \detokenize, \string, \, but didn't mixed them. Could you update docs section 4 Alternative text, ActualText and text-to-speech software with some notes about special characters. Thanks a lot !!!

@FrankMittelbach
Copy link
Member

FrankMittelbach commented Feb 25, 2025

Arguably, standard commands like \textleftbrace, etc should work too in this place, but they don't as far as I can see.

@lstonys
Copy link
Author

lstonys commented Feb 25, 2025

Some publishers now requires to add alternative texts in their tex files and authors who don't care much about tags writes just
werbatim in alt={} field. Mostly their goal is to get correct pdf view. They could escape characters but we need a clear instructions. Better would be a raw text input (for alt tex) but we can't do it in key=val system.

Alternative

\defineAltText{ID}
some verbatim text
\enddefineAltText
\includegraphics[alttextid={ID}]{...}

where \defineAltText drops all catcodes and reads every char until \enddefineAltText and it's easy to guess where I'm pointing to :) Ofcourse \defineAltText can't be in some inner macro.

Another idea that maybe \defineAltText ID could be image name (the same as mandatory \includegraphics parameter) than we don't need to pass alttextid={ID} and \includegraphics could load automatically these alt texts.

@u-fischer
Copy link
Member

Arguably, standard commands like \textleftbrace, etc should work too in this place, but they don't as far as I can see

hm, no. \text_purify:n leaves them in the stream. It works with \text_declare_purify_equivalent:Nn\textbraceleft{\{}, but it would be a bit overkill, to do that for every such command. @josephwright any thought about this?

@FrankMittelbach
Copy link
Member

hm, no. \text_purify:n leaves them in the stream. It works with \text_declare_purify_equivalent:Nn\textbraceleft{\{}, but it would be a bit overkill, to do that for every such command. @josephwright any thought about this?

There are a few dozen such commands, but for what it is worth they are standard input methods and if they appear in, say, a heading they should be replaced by something suitable when the text is moved to the book mark, for example. So I think there is some argument for it that the purify handles them. On the other hand it is a noticeable overhead for a marginal use case

If you could get all special chars using \string then this is a relatively easy way to input them, but unfortunately that isn't the case for % { # } where you really need \% etc. So it is somewhat awkward in any case and it would need documentation whatever is done about it (if anything).

@josephwright
Copy link
Member

I can certainly adjust \text_purify:n to cover more chars - it would be sensible to collect up a proper list. There are several (partial) lists of this form about - but one clear one from the team would likely be best.

A related issue for me is that currently \text_expand:n leaves the input as far as possible unchanged, leaving 'Unicode-ification' to \text_purfiy:n. My feeling is that it would be a lot easier if that were handled by \text_expand:n, as then all text functions would get as much as possible 'just chars'. But that is a change in that at present \text_expand:n is described as similar to \protected@edef. Thoughts?

@FrankMittelbach
Copy link
Member

TLC3 I-768 to I776 is what is documented as encoding specific commands.

It's a mouthful already plus packages (eg babel) might add more so you would also need an interface to add to whatever list is automatically handled.

Redefining all of them is not feasible. Instead, I think what should be done is to define a PU encoding and then provide definitions in that encoding and during purifying you change to that encoding. That then uses the encoding change approach to avoid doing all the redefinitions and and only do them on the fly when they are actually show up in the input (handwaving, may not work with the purify approach easily)

@josephwright
Copy link
Member

@FrankMittelbach Surely that's not much worse than loading puenc.def, just a question of where you store the data? (`\text_... all work in expansion contexts, so we can't read data as-and-when).

@FrankMittelbach
Copy link
Member

much worse. puenc.def is loaded once, but the redefinitions for the encoding specific commands would happen each time purify is done. In contrast a text encoding specific command checks if it already has a definition suitable for the current encoding and if not chenges to the one in the right encoding, but that happens only for those encoding specific ocmmands that are actually used so only a few if any not a few hundred each time and this is all done expandably.

@josephwright
Copy link
Member

Note that special chars are already covered:

\text_declare_purify_equivalent:Nn \\ { }
\tl_map_inline:nn
  { \{ \} \# \$ \% \_ }
  { \text_declare_purify_equivalent:Ne #1 { \cs_to_str:N #1 } }

@FrankMittelbach
Copy link
Member

FrankMittelbach commented Feb 25, 2025

but this is what I mean you do these mappings each time even if none of the commands show up. not a problem for 5 but a bit different if a few hundred.

In contrast if we are in a PU encoding the expansion of \{ would check find it is T1 so changes to \PU\{ and runs that

@josephwright
Copy link
Member

josephwright commented Feb 25, 2025

@FrankMittelbach I have a feeling we are talking at cross-purposes here! When you pass something like \textlbrace to \text_purify:n, we see exactly that token then just need to see if there is an equivalent 'purification' definition - there is no encoding change. As I said, my personal preference would be to move this to \text_expand:n (along with things like composing accent commands), but the data loading doesn't worry me at all. See the latter part of l3text-purify.dtx for what we load ATM.

@FrankMittelbach
Copy link
Member

ah ok, so your purify does something similar to what the encoding specific command mechanism does (makes me wonder if it could have used that mechanism in the first place -- probably not as you have to get rid of other stuff)

@josephwright
Copy link
Member

but this is what I mean you do these mappings each time even if none of the commands show up. not a problem for 5 but a bit different if a few hundred.

In contrast if we are in a PU encoding the expansion of { would check find it is T1 so changes to \PU{ and runs that

No, it's more-or-less the same as PU. There, we have 100s of

\DeclareTextCommand ...

which store the data (once) and then are looked up in the hash table. For \text_purify:n we need the same idea but with lots of \text_declare_purify_equivalent:Nn, which again store the data as control sequences so we look up in the hash table. As point-of-use, it's just a question of \cs_if_exist_use:cF for the right name.

@josephwright
Copy link
Member

ah ok, so your purify does something similar to what the encoding specific command mechanism does (makes me wonder if it could have used that mechanism in the first place -- probably not as you have to get rid of other stuff)

Yes, very similar to encoding and of course even more similar to what hyperref already had for the same idea. But as this is a generic expl3 function it works using just expl3's own data structures, etc., and it's expandable, and it does try to cover more stuff.

@lstonys
Copy link
Author

lstonys commented Feb 26, 2025

I don't think that we need to cover all latex input in alt text. Tex in \section{...} has to deal with macros because the same string goes to bookmark. Alt text doesn't go to output so we need to deal only with %{}# and just simply do \detokenize{#1} and later replace with regex these few patterns. With tex4ht could do the same replacement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants