Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python3 compatibility #319

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
c4d3823
WIP: py3 compat with six
kba Jan 2, 2019
c62ebc1
require six
kba Jan 3, 2019
51d908c
circleci: update to 2.0
kba Jan 3, 2019
d5db950
simplify unpickling, py3 compatible, use GZipfile instead of gunzip c…
kba Jan 3, 2019
77ceabe
py3-forward compatible unicode decoding in normalize_text
kba Jan 3, 2019
90ac1ff
bump scipy, numpy, matplotlib, replace scipy.misc.imsave with imageio…
kba Jan 3, 2019
6bfa1ef
run-test-ci: Overrideable TRAIN_ITERATIONS
kba Jan 3, 2019
7aa8367
run-test-ci: smoke test to make sure ocropus-* --help works
kba Jan 3, 2019
949588d
sixify various py2-isms
kba Jan 3, 2019
6651063
don't smoke test lpred, ltrain (requires clstm)
kba Jan 3, 2019
3ff4cd6
:construction_worker: Test pythion 3.4 - 3.7
kba Jan 3, 2019
581e225
:green_heart: test with python3 in travis, with non-conda setup
kba Jan 3, 2019
bb76524
run-test-ci: test saving model
kba Jan 9, 2019
7db9cbc
Use gzip module instead of os.popoen callout for gzipping pickle
kba Jan 9, 2019
c773dd2
Merge remote-tracking branch 'upstream/master' into py3-again
kba Jan 14, 2019
ac5f5f7
invert requirements.txt / setup.py install_requires logic (sigh)
kba Jan 14, 2019
21b1aa8
make old-style classes inherit from object
kba Jan 14, 2019
89d982b
relax six requirement to be compatible with coremltools/kraken
kba Jan 15, 2019
332ff6f
Add a --version flag to all commands (except lpred/ltrain)
kba Jan 17, 2019
babbd42
:green_heart: travis: re-enable unit tests
kba Jan 21, 2019
3b424e8
:green_heart: run-unit with generic python, not python3
kba Jan 21, 2019
8d9b990
ocropus-gtedit: remove redundant unicode statements
kba Jan 21, 2019
d890047
change back defaults.py to default.py
kba Jan 21, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
version: 2

jobs:

build-python27: &job-template
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously circleci.yml. What caused the renaming? Unfortunately git did not recognize this as a renaming...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circle CI switched from 1.0 to 2.0 in August last year which is very different (.circle/config.yml instead of .circleci.yml, job/workflow based semantics, containers based directly on docker etc.)

docker:
- image: python:2.7.13
working_directory: ~/ocropy-build
steps:
- checkout
- restore_cache:
keys:
- default-model-{{ .Environment.CIRCLE_JOB }}
- run: '(cd models && wget -nc http://www.tmbdev.net/en-default.pyrnn.gz)'
- save_cache:
key: default-model-{{ .Environment.CIRCLE_JOB }}
paths:
- models
- run: pip install --user .
- run: ./run-test-ci

build-python34:
<<: *job-template
docker:
- image: circleci/python:3.5

build-python35:
<<: *job-template
docker:
- image: circleci/python:3.5

build-python36:
<<: *job-template
docker:
- image: circleci/python:3.6

build-python37:
<<: *job-template
docker:
- image: circleci/python:3.7

workflows:
version: 2
build:
jobs:
- build-python27
- build-python35
- build-python36
- build-python37
Comment on lines +46 to +49
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,6 @@ build/
.~*.vue
doc/.ipynb_checkpoints/
htmlcov/
dist
*.egg-info
linegen/
46 changes: 10 additions & 36 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,49 +1,23 @@
language: python
python:
# This is not actually used. Because it would take an overly long time
# to build scipy we cannot use the virtual env of travis. Instead, we
# use miniconda.
- "2.7"
- "3.4"
- "3.5"
- "3.6"
- "3.7-dev"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this work now without this whole miniconda stuff? This makes it much easier...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It always worked without miniconda (see circle config and install instructions) but since @QuLogic went through the effort of setting it up with conda, we retained it in travis. I don't know enough about how to change the conda setup to test the various versions, so this seemed the simplest solution.

Copy link
Contributor

@QuLogic QuLogic Jan 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conda environment is created with python=$TRAVIS_PYTHON_VERSION, which comes from this key; you didn't really need to re-write everything to get it to work.

But originally, miniconda was only necessary because SciPy took forever to compile from source; I assume there are wheels now.

Also, 3.7 is GA, you should use it and not 3.7-dev which is a very old snapshot.


sudo: false

cache:
directories:
- $HOME/.cache/matplotlib
- $HOME/.cache/pip
- models

install:
# Install miniconda
# -----------------
- if [[ "$TRAVIS_PYTHON_VERSION" == 2* ]]; then
wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh;
else
wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
fi
- bash miniconda.sh -b -p $HOME/miniconda
- export PATH="$HOME/miniconda/bin:$PATH"

# Create the basic testing environment
# ------------------------------------
- conda config --set always_yes yes --set changeps1 no --set show_channel_urls yes
- conda update conda
- conda create -n test-environment python=$TRAVIS_PYTHON_VERSION
- source activate test-environment

# Customise the testing environment
# ---------------------------------
- conda install --file requirements.txt

# Conda debug
#-----------
- conda list

# Install ocropy
# --------------
- wget -O models/en-default.pyrnn.gz http://www.tmbdev.net/en-default.pyrnn.gz
- python setup.py install
- (cd models && wget -nc http://www.tmbdev.net/en-default.pyrnn.gz)
- pip install .

script:
- mkdir ../test_folder
- cd ../test_folder
- ../ocropy/tests/run-unit
- ../ocropy/run-test-ci
- ./tests/run-unit
- ./run-test-ci
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened with run-unit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix. Just an oversight.

2 changes: 1 addition & 1 deletion PACKAGES
Original file line number Diff line number Diff line change
@@ -1 +1 @@
python-scipy python-matplotlib python-lxml
python-scipy python-matplotlib python-lxml python-six
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ ocropy
======

[![Build Status](https://travis-ci.org/tmbdev/ocropy.svg?branch=master)](https://travis-ci.org/tmbdev/ocropy)
[![CircleCI](https://circleci.com/gh/UB-Mannheim/ocropy.png)](https://circleci.com/gh/UB-Mannheim/ocropy.png)
[![CircleCI](https://circleci.com/gh/OCR-D/ocropy.png)](https://circleci.com/gh/OCR-D/ocropy.png)
[![Docker Automated build](https://img.shields.io/docker/automated/ubma/ocropy.svg?maxAge=86400)](https://hub.docker.com/r/ubma/ocropy/) [![Docker Pulls](https://img.shields.io/docker/pulls/ubma/ocropy.svg?maxAge=86400)](https://hub.docker.com/r/ubma/ocropy/)
[![license](https://img.shields.io/github/license/tmbdev/ocropy.svg)](https://github.com/tmbdev/ocropy/blob/master/LICENSE)
[![Wiki](https://img.shields.io/badge/wiki-11%20pages-orange.svg)](https://github.com/tmbdev/ocropy/wiki)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mentioned in the README and done in the Dockerfile:

python setup.py install

But then some warnings show up:

/usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution opti
on: 'python_requires'

Should this be changed? As you already do it now with pip in Travis and CircleCI...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to install the new version on Python 2.7 with python setup.py install and it then tried to install mathplotlib 3.02 which is not compatible with the 2.x, which then lead to some error. The pip install --user . seems to work better...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python_requires line is a feature of relatively recent setuptools/pip versions (which is ironic of course but there you are). https://packaging.python.org/guides/distributing-packages-using-setuptools/#python-requires

The best way to install IMHO is to create virtualenv with a specific python version, source it, update pip and do pip install -e ..

Otherwise it's always possible that either the python version, setuptools, pip or system-wide dependency requirements can mess up things. I would avoid system-wide pip install and esp. pip install --user at all costs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using setup.py to install is deprecated; you should always use pip now.

Expand Down
17 changes: 0 additions & 17 deletions circle.yml

This file was deleted.

9 changes: 3 additions & 6 deletions ocrolib/__init__.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,12 @@
__all__ = [
"common",
"hocr",
"lang",
"trace",
"default",
"lineest",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these changes break possibly for someone who is relying on importing from ocrolib? But even if yes, is that a realistic scenario? For whom?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, then the __all__ variable is just used for supporting from ocrolib import * which we currently don't use in this project. Should we support that then anyway? What principles to follow then? Or could probably also delete more in this file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might indeed break stuff, because default was renamed default. I would argue that from ocrolib import * really is bad practice. common.py has some ~70 functions and a few classes. If we want to make absolutely sure, nobody using the code as a library (which few people do I suppose) will experience breaks from wildcard imports, it would be better to list all those exports explicitly in __all__.

]

################################################################
### top level imports
################################################################

import default
from common import *
from default import traceback as trace
from .default import traceback as trace
from .common import *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what are these import statements needed at all?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. They are imported to be exported, so users can write

from ocrolib import allsplitext

instead of

from ocrolib.common import allsplitext

97 changes: 49 additions & 48 deletions ocrolib/chars.py
Original file line number Diff line number Diff line change
@@ -1,21 +1,22 @@
# -*- encoding: utf-8 -*-
from __future__ import unicode_literals

import re

# common character sets

digits = u"0123456789"
letters = u"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
symbols = ur"""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""
ascii = digits+letters+symbols
digits = "0123456789"
letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
symbols = """!"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is ur treated here for symbols the same as lines with only u, but some lines below similar lines result into r?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understood that. u"bla" is not necessary with Python3 anymore (strings are unicode not bytes) and not in Python2 by from __future__ import unicode_literals. r"foo" is a string variant where escape sequences with backslash are not treated as such, useful in regexes, so as not to have to escape the backslash itself.

ascii = digits+letters+symbols # pylint: disable=redefined-builtin

xsymbols = u"""€¢£»«›‹÷©®†‡°∙•◦‣¶§÷¡¿▪▫"""
german = u"ÄäÖöÜüß"
french = u"ÀàÂâÆæÇçÉéÈèÊêËëÎîÏïÔôŒœÙùÛûÜüŸÿ"
turkish = u"ĞğŞşıſ"
greek = u"ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω"
portuguese = u"ÁÃÌÍÒÓÕÚáãìíòóõú"
telugu = u" ఁంఃఅఆఇఈఉఊఋఌఎఏఐఒఓఔకఖగఘఙచఛజఝఞటఠడఢణతథదధనపఫబభమయరఱలళవశషసహఽాిీుూృౄెేైొోౌ్ౘౙౠౡౢౣ౦౧౨౩౪౫౬౭౮౯"
xsymbols = """€¢£»«›‹÷©®†‡°∙•◦‣¶§÷¡¿▪▫"""
german = "ÄäÖöÜüß"
french = "ÀàÂâÆæÇçÉéÈèÊêËëÎîÏïÔôŒœÙùÛûÜüŸÿ"
turkish = "ĞğŞşıſ"
greek = "ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω"
portuguese = "ÁÃÌÍÒÓÕÚáãìíòóõú"
telugu = " ఁంఃఅఆఇఈఉఊఋఌఎఏఐఒఓఔకఖగఘఙచఛజఝఞటఠడఢణతథదధనపఫబభమయరఱలళవశషసహఽాిీుూృౄెేైొోౌ్ౘౙౠౡౢౣ౦౧౨౩౪౫౬౭౮౯"

default = ascii+xsymbols+german+french+portuguese

Expand All @@ -35,53 +36,53 @@
# there seems to be left vs right leaning, and top-heavy vs bottom-heavy

replacements = [
(u'[_~#]',u"~"), # OCR control characters
(u'"',u"''"), # typewriter double quote
(u"`",u"'"), # grave accent
(u'[“”]',u"''"), # fancy quotes
(u"´",u"'"), # acute accent
(u"[‘’]",u"'"), # left single quotation mark
(u"[“”]",u"''"), # right double quotation mark
(u"“",u"''"), # German quotes
(u"„",u",,"), # German quotes
(u"…",u"..."), # ellipsis
(u"′",u"'"), # prime
(u"″",u"''"), # double prime
(u"‴",u"'''"), # triple prime
(u"〃",u"''"), # ditto mark
(u"µ",u"μ"), # replace micro unit with greek character
(u"[–—]",u"-"), # variant length hyphens
(u"fl",u"fl"), # expand Unicode ligatures
(u"fi",u"fi"),
(u"ff",u"ff"),
(u"ffi",u"ffi"),
(u"ffl",u"ffl"),
('[_~#]', "~"), # OCR control characters
('"', "''"), # typewriter double quote
("`", "'"), # grave accent
('[“”]', "''"), # fancy quotes
("´", "'"), # acute accent
("[‘’]", "'"), # left single quotation mark
("[“”]", "''"), # right double quotation mark
("“", "''"), # German quotes
("„", ",,"), # German quotes
("…", "..."), # ellipsis
("′", "'"), # prime
("″", "''"), # double prime
("‴", "'''"), # triple prime
("〃", "''"), # ditto mark
("µ", "μ"), # replace micro unit with greek character
("[–—]", "-"), # variant length hyphens
("fl", "fl"), # expand Unicode ligatures
("fi", "fi"),
("ff", "ff"),
("ffi", "ffi"),
("ffl", "ffl"),
]


def requote(s):
s = unicode(s)
s = re.sub(ur"''",u'"',s)
s = re.sub("''", '"', s)
return s

def requote_fancy(s,germanic=0):
s = unicode(s)

def requote_fancy(s, germanic=0):
if germanic:
# germanic quoting style reverses the shapes
# straight double quotes
s = re.sub(ur"\s+''",u"”",s)
s = re.sub(u"''\s+",u"“",s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this r?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above:

r"foo" is a string variant where escape sequences with backslash are not treated as such, useful in regexes, so as not to have to escape the backslash itself.

s = re.sub(ur"\s+,,",u"„",s)
s = re.sub(r"\s+''", "”", s)
s = re.sub(r"''\s+", "“", s)
s = re.sub(r"\s+,,", "„", s)
# straight single quotes
s = re.sub(ur"\s+'",u"’",s)
s = re.sub(ur"'\s+",u"‘",s)
s = re.sub(ur"\s+,",u"‚",s)
s = re.sub(r"\s+'", "’", s)
s = re.sub(r"'\s+", "‘", s)
s = re.sub(r"\s+,", "‚", s)
else:
# straight double quotes
s = re.sub(ur"\s+''",u"“",s)
s = re.sub(ur"''\s+",u"”",s)
s = re.sub(ur"\s+,,",u"„",s)
s = re.sub(r"\s+''", "“", s)
s = re.sub(r"''\s+", "”", s)
s = re.sub(r"\s+,,", "„", s)
# straight single quotes
s = re.sub(ur"\s+'",u"‘",s)
s = re.sub(ur"'\s+",u"’",s)
s = re.sub(ur"\s+,",u"‚",s)
s = re.sub(r"\s+'", "‘", s)
s = re.sub(r"'\s+", "’", s)
s = re.sub(r"\s+,", "‚", s)
return s
Loading