-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automate README translations #624
Conversation
Great work! @itsamziii For parallelizing you can use parmapper library. Don't use the |
Perhaps something like this: from deep_translator import GoogleTranslator
from pathlib import Path
import re
from typing import List
import parmapper
from functools import partial
def create_translator(target: str) -> GoogleTranslator:
"""
Create a translator for a given target language.
"""
return GoogleTranslator(source="en", target=target)
def translate_raw_html(translator: GoogleTranslator, html_content: str) -> str:
"""
Translate a given raw html content using the provided translator, preserving HTML tags and newlines.
"""
html_tags_pattern = r"(<[^>]+>)"
segments = re.split(html_tags_pattern, html_content)
translated_segments = []
for segment in segments:
if re.fullmatch(html_tags_pattern, segment):
translated_segments.append(segment)
else:
try:
if re.fullmatch(r'^[!"#$%&\'()*+,\-./:;<=>?@[\]^_`{|}~]+$', segment):
translated_segments.append(segment)
continue
translated = translator.translate(segment)
translated_segments.append(translated if translated else segment)
except Exception as e:
print(f"Error translating segment '{segment}': {e}")
translated_segments.append(segment)
return "".join(translated_segments)
def translate_readme(source: str, target: str) -> str:
"""
Translate a README file from source to target language, preserving code blocks and newlines.
"""
file_content = Path(source).read_text(encoding='utf-8')
translator = create_translator(target)
code_block_pattern = r"(```[\s\S]*?```|\n)"
segments = re.split(code_block_pattern, file_content)
# Prepare a partial function with the translator fixed
translate_segment = partial(translate_raw_html, translator)
# Parallelize the translation of segments
translated_segments = list(parmapper.parmap(translate_segment, segments))
return ''.join(translated_segments)
def save_translated_readme(translated_content: str, lang: str) -> None:
"""
Save the translated README content to a file.
"""
filename = f"README-{lang.split('-')[-1].upper()}.md"
with open(filename, "w", encoding='utf-8') as file:
file.write(translated_content)
def process_language(lang: str, source_file: str) -> None:
"""
Process the translation for a single language.
"""
translated_readme = translate_readme(source_file, lang)
save_translated_readme(translated_readme, lang)
def main() -> None:
"""
Main function to translate README.md to multiple languages.
"""
source_file = "README.md"
destination_langs = ["zh-CN", "ja", "fr"]
# Prepare a partial function with the source_file fixed
process_lang = partial(process_language, source_file=source_file)
# Parallelize the translation of README for different languages
list(parmapper.parmap(process_lang, destination_langs))
if __name__ == "__main__":
main() |
I see, indeed parallelizing the tasks should improve the time performance. I'll make those changes once I find out a way to ensure TOC hyperlinking. Thank you for the guidance! It was indeed helpful ^^ |
Don't worry about TOC section actually. They are autogenerated by a different github action so it's not a big deal |
Alright, I can make this draft pr, up for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Reviewed everything up to 020e102 in 55 seconds
More details
- Looked at
3443
lines of code in6
files - Skipped
0
files when reviewing. - Skipped posting
3
drafted comments based on config settings.
1. scripts/readme_translator.py:9
- Draft comment:
TheCODEBLOCK_PATTERN
regex is incorrect for matching code blocks. It should use triple backticks (```) instead of single backticks (`). - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
The comment is incorrect because the regex pattern in the code already uses triple backticks, which is the correct syntax for code blocks in markdown. The comment seems to be based on a misunderstanding of the code.
I might be missing some context about how the regex is used elsewhere, but based on the provided code, the pattern seems correct.
The code provided is complete for the context of this comment, and the regex pattern is indeed correct as it stands.
The comment should be deleted because it incorrectly states that the regex pattern is wrong, while the code actually uses the correct pattern.
2. scripts/readme_translator.py:17
- Draft comment:
The script does not handle the translation of hyperlinks in the TOC correctly. Consider implementing a solution to translate TOC links properly. - Reason this comment was not posted:
Comment was on unchanged code.
3. scripts/readme_translator.py:52
- Draft comment:
Consider batching translation requests or implementing a rate limiter to handle potential API rate limits more efficiently. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
The comment is relevant to the changes made in the diff, as it addresses a potential issue with the current implementation. The code does not handle API rate limits, which could be a problem if the script is used frequently or with large files. The suggestion is actionable and clear, as it proposes a specific improvement to the code's robustness.
The comment is speculative, as it assumes that API rate limits will be an issue without evidence. The code might work fine without these changes if the usage is within acceptable limits.
Even though the comment is speculative, it addresses a potential issue that could arise in real-world usage. Implementing a rate limiter or batching could prevent future problems, making the suggestion valuable.
Keep the comment, as it provides a clear and actionable suggestion to improve the code's robustness against potential API rate limits.
Workflow ID: wflow_pBV9kqjbDzptPgld
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
Draft PR for #600
Currently the translator uses Google AJAX Api for translations so the translations may not be "good". Best approach would perhaps be utilizing machine translation instead.
Important
Automates translation of
README.md
into multiple languages using a GitHub workflow and Python script, preserving code blocks and HTML tags..github/workflows/translate-readme.yml
to automate translation ofREADME.md
on push.deep-translator
andparmapper
for translation and parallel processing.scripts/readme_translator.py
translatesREADME.md
, preserving code blocks and HTML tags.GoogleTranslator
fromdeep-translator
for language translation.README-CN.md
,README-JA.md
,README-FR.md
.This description was created by for 020e102. It will automatically update as commits are pushed.