Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure of developer.mozilla.org_en: Can not serialize <ParseError invalid> #159

Closed
benoit74 opened this issue Jan 26, 2024 · 2 comments
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@benoit74
Copy link
Collaborator

https://farm.openzim.org/pipeline/7dd39f31-3223-475a-b522-f2f914deaabc running with zimit2 / warc2zim2 has failed

Error is in warc2zim2:

Traceback (most recent call last):
  File "/usr/bin/zimit", line 8, in <module>
    sys.exit(zimit.zimit())
             ^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/zimit/zimit.py", line 584, in zimit
    run(sys.argv[1:])
  File "/app/zimit/lib/python3.11/site-packages/zimit/zimit.py", line 485, in run
    return warc2zim(warc2zim_args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/main.py", line 89, in main
    return converter.run()
           ^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/converter.py", line 277, in run
    self.add_items_for_warc_record(record)
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/converter.py", line 490, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
                   ^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/items.py", line 72, in __init__
    ).rewrite(self.content)
      ^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/content_rewriting/html.py", line 76, in rewrite
    self.feed(content)
  File "/usr/lib/python3.11/html/parser.py", line 110, in feed
    self.goahead(0)
  File "/usr/lib/python3.11/html/parser.py", line 164, in goahead
    self.handle_data(rawdata[i:j])
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/content_rewriting/html.py", line 119, in handle_data
    data = self.css_rewriter.rewrite(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/content_rewriting/css.py", line 24, in rewrite
    output = serialize(rules)
             ^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/tinycss2/serializer.py", line 16, in serialize
    _serialize_to(nodes, chunks.append)
  File "/app/zimit/lib/python3.11/site-packages/tinycss2/serializer.py", line 116, in _serialize_to
    node._serialize_to(write)
  File "/app/zimit/lib/python3.11/site-packages/tinycss2/ast.py", line 115, in _serialize_to
    raise TypeError('Can not serialize %r' % self)
TypeError: Can not serialize <ParseError invalid>
FATAL: exception not rethrown

This looks like a new problem we did not already encountered. Unfortunately as usual WARC files are lost and the scrape takes hours.

@benoit74 benoit74 added the bug Something isn't working label Jan 26, 2024
@benoit74 benoit74 added this to the 2.0.0 milestone Jan 26, 2024
@benoit74
Copy link
Collaborator Author

I confirm issue is still there with latest zimit2 / warc2zim2

@benoit74
Copy link
Collaborator Author

benoit74 commented Feb 9, 2024

Fixed by #175

@benoit74 benoit74 closed this as completed Feb 9, 2024
@kelson42 kelson42 reopened this Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants