Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyth 0.7, with improved partial Python 3 port #44

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
cc97895
Result of running python-modernize -w pyth6
prechelt Jun 13, 2015
600b7aa
I have repaired many of the bytes-vs-str issues in
prechelt Jun 28, 2015
caef7b7
corrected remaining mistakes in rtf15/reader
prechelt Jul 11, 2015
d369a36
examples/reading/rtf15.py can now generate reference test output
prechelt Jul 11, 2015
f226808
established end-to-end tests for Rtf15Reader
prechelt Jul 11, 2015
b9acf0e
some cleanup of rtf15/reader.py
prechelt Jul 12, 2015
e602974
documentation update
prechelt Jul 12, 2015
1cd7d46
encodings/symbol.py: py3-migrated and improved
prechelt Jul 18, 2015
6f0a3a8
completely reorganized the RTF-to-XHTML-based tests
prechelt Jul 20, 2015
a040390
uncommited files??
eukreign Aug 5, 2015
c4e54bb
python3 support fixes
eukreign Aug 5, 2015
1e13afa
Merge pull request #1 from damoti/pyth-py3
prechelt Sep 4, 2015
5fe8e35
late final touches, setup.py version set to 0.7.dev77
prechelt Jul 20, 2015
4f9b4f9
Merge branch 'pyth-py3' of https://github.com/prechelt/pyth into pyth…
prechelt Jun 3, 2016
96ffe96
fix PlaintextWriter encoding
randlet Aug 19, 2016
0f02c72
plaintext seems to work now
robertour Aug 1, 2017
10a38ad
Merge pull request #2 from robertour/pyth-py3
prechelt Sep 8, 2017
1c4cd24
done a small bit of further Python 3 porting work
prechelt Sep 8, 2017
85ee4ba
test_readrtf15.py extended and much improved
prechelt Sep 8, 2017
0f571e7
*** Version 0.7 ***
prechelt Sep 8, 2017
1806716
Version 0.7: README is now README.md
prechelt Sep 8, 2017
c94af6d
pyth3 0.7.0, uploaded to PyPI
prechelt Feb 20, 2019
5b8ba49
fix signature of handle_super in rtf parser
camilstaps Jul 16, 2020
5d7d57c
Merge pull request #4 from camilstaps/patch-1
prechelt Jul 16, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.rtf eol=crlf
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
*~
*.bak
*.py[co]
*.egg-info
*.egg-info
tests/currentoutput/
56 changes: 0 additions & 56 deletions README

This file was deleted.

107 changes: 107 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
pyth3 - Python text markup and conversion
=========================================

Pyth is intended to make it easy to convert marked-up text between different common formats.
This is a (rather incomplete so far) port of pyth 0.6.0 to Python 3.

*Marked-up text* means text which has:

* Paragraphs
* Headings
* Bold, italic, and underlined text
* Hyperlinks
* Bullet lists
* Simple tables
* Very little else


Formats that have (very varying) degrees of support are

* Plain text
* XHTML
* RTF (Rich Text Format)
* PDF (output only)


Design principles/goals
=======================

* Ignore unsupported information in input formats (e.g. page layout)
* Ignore font issues -- output in a single font.
* Ignore specific text sizes, but maintain italics, boldface, subscript/superscript
* Have no dependencies unless they are written in Python, and work
* Make it easy to add support for new formats, by using an architecture based on *plugins* and *adapters*.



Examples
========

See directory `examples`.



Python 3 migration
==================

The code was originally written for Python 2.
It has been partially(!) upgraded to Python 3 compatibility (starting via 'modernize').
This does not mean it will actually work!

pyth.plugins.rtf15.reader has been debugged and now appears to work correctly.
pyth.plugins.xhtml.writer has been debugged and now appears to work correctly.
pyth.plugins.plaintext.writer has been debugged and now appears to work correctly.
Everything else is unknown (or definitely broken on Python 3: even many
of the tests fail)
See directory py3migration for a bit more detail.
(If you find something is broken on Python 2 that worked before, please
either fix it or simply stick to pyth version 0.6.0.)


Limitations
===========

pyth.plugins.rtf15.reader:
- bulleted or enumerated items will be returned
as plain paragraphs (no indentation, no bullets).
- cannot cope with Symbol font correctly:
- from MS Word: lower-coderange characters (greek mostly) work
- from MS Word: higher-coderange characters are missing, because
Word encodes them in a horribly complicated manner not supported
by pyth currently
- from Wordpad: lower- and higher-coderange characters come out in
the wrong encoding (ANSI, I think)

pyth.plugins.xhtml.writer:
- very limited functionality

pyth.plugins.plaintext.writer:
- very very limited functionality

Others:
- will not work on Python 3 without some porting love-and-care


Tests
=====

Don't try to run them all, it's frustrating.
`py.test -v test_readrtf15.py` is a good way to run the least frustrating
subset of them.
It is normal that most others will fail on Python 3.
`test_readrtf15.py` generates test cases dynamically based on
existing input files in `tests/rtfs` and
existing reference output files in `tests/rtf-as-html` and `tests/rtf-as-html`.
The empty or missing output files indicate where functionality is missing,
which nicely indicates possible places to jump in if you want to help.


Dependencies
============

Only the most important two of the dependencies,
are actually declared in `setup.py`, because the others are large, yet
are required only in pyth components not yet ported to Python 3.
They are:
- `reportlab` for PDFWriter
- `docutils` for LatexWriter
28 changes: 18 additions & 10 deletions examples/reading/rtf15.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,25 @@
from __future__ import absolute_import
from __future__ import print_function
import sys
import os.path

from pyth.plugins.rtf15.reader import Rtf15Reader
from pyth.plugins.xhtml.writer import XHTMLWriter
from pyth.plugins.xhtml.writer import XHTMLWriter, write_html_file

numargs = len(sys.argv) - 1

if len(sys.argv) > 1:
filename = sys.argv[1]
if numargs not in [1, 2]:
print("usage: rtf15 inputfile.rtf [outputdir]")
else:
filename = os.path.normpath(os.path.join(
os.path.dirname(__file__),
'../../tests/rtfs/sample.rtf'))

doc = Rtf15Reader.read(open(filename, "rb"))

print XHTMLWriter.write(doc, pretty=True).read()
inputfile = sys.argv[1]
doc = Rtf15Reader.read(open(inputfile, "rb"))
the_output = XHTMLWriter.write(doc, pretty=True).read()
if numargs == 1:
print("<!-- ##### RTF file" + inputfile + "as XHTML: -->")
print(the_output)
else:
basename = os.path.basename(inputfile)
outputdir = sys.argv[2]
outputfile = os.path.join(outputdir,
os.path.splitext(basename)[0] + ".html")
write_html_file(outputfile, the_output, print_msg=True)
4 changes: 3 additions & 1 deletion examples/reading/sampleWithImage.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from __future__ import absolute_import
from __future__ import print_function
from pyth.plugins.rtf15.reader import Rtf15Reader
import sys

Expand All @@ -8,4 +10,4 @@

doc = Rtf15Reader.read(open(filename, "rb"))

print [x.content for x in doc.content]
print([x.content for x in doc.content])
6 changes: 4 additions & 2 deletions examples/reading/xhtml.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
from __future__ import absolute_import
from __future__ import print_function
# -*- coding: utf-8 -*-

from pyth.plugins.xhtml.reader import XHTMLReader
from pyth.plugins.xhtml.writer import XHTMLWriter
import xhtml

from cStringIO import StringIO
from six import StringIO

# A simple xhtml document with limited features.
content = StringIO(r"""
Expand Down Expand Up @@ -49,4 +51,4 @@
# Parse the document and then reconstruct it using the xhtml
# writer.
doc = XHTMLReader.read(content, css)
print XHTMLWriter.write(doc).getvalue()
print(XHTMLWriter.write(doc).getvalue())
4 changes: 3 additions & 1 deletion examples/writing/latex.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
from __future__ import absolute_import
from __future__ import print_function
from pyth.plugins.latex.writer import LatexWriter
import pythonDoc

if __name__ == "__main__":
doc = pythonDoc.buildDoc()
print LatexWriter.write(doc).getvalue()
print(LatexWriter.write(doc).getvalue())
1 change: 1 addition & 0 deletions examples/writing/pdf.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from __future__ import absolute_import
# -*- coding: utf-8 -*-

from pyth.plugins.rtf15.reader import Rtf15Reader
Expand Down
4 changes: 3 additions & 1 deletion examples/writing/plaintext.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
from __future__ import absolute_import
from __future__ import print_function
from pyth.plugins.plaintext.writer import PlaintextWriter
import pythonDoc

doc = pythonDoc.buildDoc()

print PlaintextWriter.write(doc).getvalue()
print(PlaintextWriter.write(doc).getvalue())
4 changes: 3 additions & 1 deletion examples/writing/pythonDoc.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
from __future__ import absolute_import
# -*- coding: utf-8 -*-

from pyth.plugins.python.reader import *
import six

def buildDoc():
return PythonReader.read((
Expand All @@ -9,7 +11,7 @@ def buildDoc():
u", hee hee hee! ", T(url=u'http://www.google.com') [ u"This seems to work" ]
],
L [
[unicode(word) for word in ("One", "Two", "Three", "Four")]
[six.text_type(word) for word in ("One", "Two", "Three", "Four")]
],
L [
u"Introduction",
Expand Down
4 changes: 3 additions & 1 deletion examples/writing/rst.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
from __future__ import absolute_import
from __future__ import print_function
from pyth.plugins.rst.writer import RSTWriter
import pythonDoc

if __name__ == "__main__":
doc = pythonDoc.buildDoc()
print RSTWriter.write(doc).getvalue()
print(RSTWriter.write(doc).getvalue())
4 changes: 3 additions & 1 deletion examples/writing/rtf15.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
from __future__ import absolute_import
from __future__ import print_function
from pyth.plugins.rtf15.writer import Rtf15Writer
import pythonDoc

doc = pythonDoc.buildDoc()

print Rtf15Writer.write(doc).getvalue()
print(Rtf15Writer.write(doc).getvalue())
4 changes: 3 additions & 1 deletion examples/writing/xhtml.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from __future__ import absolute_import
from __future__ import print_function
from pyth.plugins.xhtml.writer import XHTMLWriter
import pythonDoc

Expand All @@ -17,4 +19,4 @@

if __name__ == "__main__":
doc = pythonDoc.buildDoc()
print docTemplate % XHTMLWriter.write(doc, pretty=True).getvalue()
print(docTemplate % XHTMLWriter.write(doc, pretty=True).getvalue())
39 changes: 39 additions & 0 deletions py3migration/STATUS.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
as of 2015-06-28:

I have made the code nearly python2/python3-duocompatible by calling
python-modernize.
Comitted.

I have inserted requires = ['six'] into setup.py
Not committed.

I have then repaired many of the bytes-vs-str issues in
pyth\plugins\rtf15\reader.py.
Dito for pyth\plugins\xhtml\writer.py.
The former in particular was tricky because most strings have to be handled
as bytestrings -- but not all of them.
See http://pythonhosted.org/six/

These two now appear to work correctly for simple RTF files (without
images, tables, headers etc).
Complex files remain to be tested.

I have established a set of system-level test cases,
with various input files with the relevant RTF features
(paragraphs, line breaks, page breaks, various characters, fonts,
bold, italics, underline, hyperlink)
and coming from MS Word, Wordpad, OpenOffice.
They are handled correctly (as per comparison with how
MS Word 2013 shows them) with one exception.

TO DO:

- For tests/rtfs/zh-cn, the conversion produces some additional
text that is not shown in MS Word 2013.
The RTF is very complicated, so I am not sure whether this is
a defect or maybe the RTF is incorrect (but even then...).

- Introduce proper handling of itemized lists (well, that is a
new feature actually).

- Debug the other plugins.
Loading