Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML Transform: Erronous HTML Table Parsing #582

Open
jetlime opened this issue Jan 8, 2024 · 1 comment
Open

HTML Transform: Erronous HTML Table Parsing #582

jetlime opened this issue Jan 8, 2024 · 1 comment

Comments

@jetlime
Copy link

jetlime commented Jan 8, 2024

Bug Report 🐛

Whenever a html table is defined with a caption, the transformation to Markdown yields to an invalid md table.

Expected Behavior

The following html table,

<table>
<caption>Average monthly active recipients of the service, in the EU region over prior 6 months (est.)
</caption>
<tbody><tr>
<th>
</th>
<th>Aug. 2022 - Jan. 2023
</th>
<th>Feb. 2023 - July 2023
</th></tr>
<tr>
<td>Wikibooks
</td>
<td>6,919,000
</td>
<td>1,611,000
</td></tr>
<tr>
<td>Wikidata
</td>
<td>1,056,000
</td>
<td>1,051,000
</td></tr>
<tr>
<td>Wikimedia Commons
</td>
<td>2,845,000
</td>
<td>3,272,000
</td></tr>
<tr>
<td>Wikinews
</td>
<td>6,283,000
</td>
<td>1,035,000
</td></tr>
<tr>
<td>Wikipedia
</td>
<td><b>151,556,000</b>
</td>
<td><b>151,088,000</b>
</td></tr>
<tr>
<td>Wikiquote
</td>
<td>6,811,000
</td>
<td>1,548,000
</td></tr>
<tr>
<td>Wikisource
</td>
<td>7,106,000
</td>
<td>1,845,000
</td></tr>
<tr>
<td>Wikispecies
</td>
<td>29,000
</td>
<td>37,000
</td></tr>
<tr>
<td>Wikiversity
</td>
<td>6,360,000
</td>
<td>1,082,000
</td></tr>
<tr>
<td>Wikivoyage
</td>
<td>616,000
</td>
<td>632,000
</td></tr>
<tr>
<td>Wiktionary
</td>
<td>8,955,000
</td>
<td>8,425,000
</td></tr>
<tr>
<td><i><span style="color: gray; white-space: pre-wrap">Est. devices per person</span></i>
</td>
<td>2.4<sup id="cite_ref-Cisco_1-0" class="reference"><a href="#cite_note-Cisco-1">&#91;1&#93;</a></sup>
</td>
<td>2.4<sup id="cite_ref-Cisco_1-1" class="reference"><a href="#cite_note-Cisco-1">&#91;1&#93;</a></sup>
</td></tr></tbody></table>

Shall be parsed in the following valid markdown,

|     |     |     |
| --- | --- | --- |  
Average monthly active recipients of the service, in the EU region over prior 6 months (est.)
|     | Aug. 2022 - Jan. 2023 | Feb. 2023 - July 2023 |
| Wikibooks | 6,919,000 | 1,611,000 |
| Wikidata | 1,056,000 | 1,051,000 |
| Wikimedia Commons | 2,845,000 | 3,272,000 |
| Wikinews | 6,283,000 | 1,035,000 |
| Wikipedia | **151,556,000** | **151,088,000** |
| Wikiquote | 6,811,000 | 1,548,000 |
| Wikisource | 7,106,000 | 1,845,000 |
| Wikispecies | 29,000 | 37,000 |
| Wikiversity | 6,360,000 | 1,082,000 |
| Wikivoyage | 616,000 | 632,000 |
| Wiktionary | 8,955,000 | 8,425,000 |
|     | 2.4[\[1\]](#cite_note-Cisco-1) | 2.4[\[1\]](#cite_note-Cisco-1) |

Which parses into a valid Markdown table:

Average monthly active recipients of the service, in the EU region over prior 6 months (est.)
Aug. 2022 - Jan. 2023 Feb. 2023 - July 2023
Wikibooks 6,919,000 1,611,000
Wikidata 1,056,000 1,051,000
Wikimedia Commons 2,845,000 3,272,000
Wikinews 6,283,000 1,035,000
Wikipedia 151,556,000 151,088,000
Wikiquote 6,811,000 1,548,000
Wikisource 7,106,000 1,845,000
Wikispecies 29,000 37,000
Wikiversity 6,360,000 1,082,000
Wikivoyage 616,000 632,000
Wiktionary 8,955,000 8,425,000
2.4[1] 2.4[1]

Current Behavior

Given the previous html table, including a caption, the tool transform the html into the following markdown content,

Average monthly active recipients of the service, in the EU region over prior 6 months (est.)
|  | Aug. 2022 - Jan. 2023
 | Feb. 2023 - July 2023
 |
| Wikibooks
 | 6,919,000
 | 1,611,000
 |
| Wikidata
 | 1,056,000
 | 1,051,000
 |
| Wikimedia Commons
 | 2,845,000
 | 3,272,000
 |
| Wikinews
 | 6,283,000
 | 1,035,000
 |
| Wikipedia
 | 151,556,000 | 151,088,000 |
| Wikiquote
 | 6,811,000
 | 1,548,000
 |
| Wikisource
 | 7,106,000
 | 1,845,000
 |
| Wikispecies
 | 29,000
 | 37,000
 |
| Wikiversity
 | 6,360,000
 | 1,082,000
 |
| Wikivoyage
 | 616,000
 | 632,000
 |
| Wiktionary
 | 8,955,000
 | 8,425,000
 |
| Est. devices per person | 2.4[[1]](#cite_note-Cisco-1 "") | 2.4[[1]](#cite_note-Cisco-1 "") |

Which is an invalid md table:

Average monthly active recipients of the service, in the EU region over prior 6 months (est.)
| | Aug. 2022 - Jan. 2023
| Feb. 2023 - July 2023
|
| Wikibooks
| 6,919,000
| 1,611,000
|
| Wikidata
| 1,056,000
| 1,051,000
|
| Wikimedia Commons
| 2,845,000
| 3,272,000
|
| Wikinews
| 6,283,000
| 1,035,000
|
| Wikipedia
| 151,556,000 | 151,088,000 |
| Wikiquote
| 6,811,000
| 1,548,000
|
| Wikisource
| 7,106,000
| 1,845,000
|
| Wikispecies
| 29,000
| 37,000
|
| Wikiversity
| 6,360,000
| 1,082,000
|
| Wikivoyage
| 616,000
| 632,000
|
| Wiktionary
| 8,955,000
| 8,425,000
|
| Est. devices per person | 2.4[1] | 2.4[1] |

Steps to Reproduce

  1. npm install -g @accordproject/markdown-cli
  2. wget https://foundation.wikimedia.org/wiki/Legal:EU_DSA_Userbase_Statistics --output-file test.html
  3. markus transform --from html --to markdown --input test.html --output test.md
  4. Open test.md using a md parser to visiualise the invalid table parsing.

Context (Environment)

Parsing HTML to Markdown for web archiving.

Desktop

@dselman
Copy link
Contributor

dselman commented Jan 11, 2024

Thank you got the clear issue report. Very helpful.

It appears that the is no agreed way to represent captions for markdown tables. Given that we use markdown-it for parsing markdown, this would appear to be the best option:

https://github.com/martinring/markdown-it-table-captions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants