Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/reading html file returns empty list #3708

Open
lwollenbergfuzzy opened this issue Oct 9, 2024 · 6 comments
Open

bug/reading html file returns empty list #3708

lwollenbergfuzzy opened this issue Oct 9, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@lwollenbergfuzzy
Copy link

While reading html files we encountered the problem that we end up with an empty list.

Here is a small example:

from unstructured.partition.html import partition_html
html_content="""
<!DOCTYPE html>
<html class="client-nojs" lang="de" dir="ltr">
<head>
</head>
<body>
  <div id="content" class="mw-body" role="main">
    <Seitenname>Bestellvorschläge weiterbearbeiten</Seitenname>
    <hr>
    <div class="content">
      <hauptteil_AB>
        <div class="rumpftabelle"></div>
        <table>
          <tbody>
            <tr>
              <th>Intern</th>
              <th>Feldwerte
              </th>
            </tr>
            <tr>
              <td>J</td>
              <td>Ja
              </td>
            </tr>
            <tr>
              <td>N</td>
              <td>Nein
              </td>
            </tr>
          </tbody>
        </table>
    </div>
    </hauptteil_AB>
    <fussteil_B></fussteil_B>
  </div>
  </div>
</body>
</html>
"""
elements = partition_html(text=html_content)
print(elements)
out[0]: []

We would expect something like

out[0]: [<unstructured.documents.elements.Title object>, <unstructured.documents.elements.Table object>]

We use version 0.15.13 of unstructured.

Through try and error, the problem seems to come from the custom tags <hauptteil_AB> and <fussteil_B>.
We appreciate any help on this issue.

@lwollenbergfuzzy lwollenbergfuzzy added the bug Something isn't working label Oct 9, 2024
@deku0818
Copy link

The same problem.

@MiloMoerkerke
Copy link

Same here

@PhorstenkampFuzzy
Copy link

+1

@PhorstenkampFuzzy
Copy link

Anything new here? Even a coment would help on how to preporcess something like this.

@ajainfuzzy
Copy link

+1

@jjleng
Copy link

jjleng commented Nov 18, 2024

Same problem.

I guess the cause is here:

It seems this is a feature, but it makes the HTML parser unable to parse custom HTML tags appropriately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants
@jjleng @deku0818 @MiloMoerkerke @PhorstenkampFuzzy @lwollenbergfuzzy @ajainfuzzy and others