Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flag for tokens based on foreign lemmas #39

Merged
merged 7 commits into from
Apr 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,16 @@ jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.10'

- run: pip install -e .
- run: pip install pdoc
- run: pdoc -o docs/ --logo https://github.com/polm/cutlet/raw/master/cutlet.png cutlet

- uses: actions/upload-pages-artifact@v1
- uses: actions/upload-pages-artifact@v3
with:
path: docs/

Expand All @@ -41,4 +41,4 @@ jobs:
url: ${{ steps.deployment.outputs.page_url }}
steps:
- id: deployment
uses: actions/deploy-pages@v1
uses: actions/deploy-pages@v4
5 changes: 3 additions & 2 deletions .github/workflows/linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ jobs:
matrix:
python-version: [3.7, 3.8, 3.9, "3.10", "3.11"]
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install deps
Expand All @@ -33,4 +33,5 @@ jobs:
pip install twine setuptools-scm build
python -m build
twine upload dist/cutlet*.tar.gz
twine upload dist/cutlet*.whl

4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://share.streamlit.io/polm/cutlet-demo/main/demo.py)
[![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://polm-cutlet-demo-demo-0tur8v.streamlit.app/)
[![Current PyPI packages](https://badge.fury.io/py/cutlet.svg)](https://pypi.org/project/cutlet/)

# cutlet
Expand All @@ -7,7 +7,7 @@

Cutlet is a tool to convert Japanese to romaji. Check out the [interactive demo][demo]! Also see the [docs](https://polm.github.io/cutlet/cutlet.html) and the [original blog post](https://www.dampfkraft.com/nlp/cutlet-python-romaji-converter.html).

[demo]: https://share.streamlit.io/polm/cutlet-demo/main/demo.py
[demo]: https://polm-cutlet-demo-demo-0tur8v.streamlit.app/

**issueを英語で書く必要はありません。**

Expand Down
11 changes: 7 additions & 4 deletions cutlet/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,15 @@
import fileinput
import sys

# Don't print an error on SIGPIPE
from signal import signal, SIGPIPE, SIG_DFL
if sys.platform != "win32":
# Don't print an error on SIGPIPE
from signal import signal, SIGPIPE, SIG_DFL

signal(SIGPIPE, SIG_DFL)


def main():
signal(SIGPIPE, SIG_DFL)
system = sys.argv[1] if len(sys.argv) > 1 else 'hepburn'
system = sys.argv[1] if len(sys.argv) > 1 else "hepburn"

katsu = Cutlet(system)

Expand Down
22 changes: 6 additions & 16 deletions cutlet/cutlet.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,19 +24,6 @@
'nihon': NIHONSHIKI,
}

if sys.version_info >= (3, 7):
def is_ascii(s):
"""Check if a given string is ASCII."""
return s.isascii()
else:
def is_ascii(s):
"""Check if a given string is ASCII."""
# this version is for old Pythons
for c in s:
if c > '\x7f':
return False
return True

def has_foreign_lemma(word):
"""Check if a word (node) has a foreign lemma.

Expand All @@ -62,7 +49,7 @@ def has_foreign_lemma(word):
# NOTE: some words have 外国 instead of a foreign spelling. ジル
# (Jill?) is an example. Unclear why this is the case.
# There are other hyphenated lemmas, like 私-代名詞.
if is_ascii(cand):
if cand.isascii():
return True

def normalize_text(text):
Expand Down Expand Up @@ -100,6 +87,8 @@ def load_exceptions():
class Token:
surface: str
space: bool # if a space should follow
# whether this comes from a foreign lemma
foreign: bool = False

def __str__(self):
sp = " " if self.space else ""
Expand Down Expand Up @@ -242,7 +231,8 @@ def romaji_tokens(self, words, capitalize=True, title=False):
not (pw and pw.feature.pos1 == '接頭辞')):
roma = roma.title()

tok = Token(roma, False)
foreign = self.use_foreign_spelling and has_foreign_lemma(word)
tok = Token(roma, False, foreign)
# handle punctuation with atypical spacing
if word.surface in '「『':
if po:
Expand Down Expand Up @@ -322,7 +312,7 @@ def romaji_word(self, word):
if word.surface.isdigit():
return word.surface

if is_ascii(word.surface):
if word.surface.isascii():
return word.surface

# deal with unks first
Expand Down
Loading