Automate README translations #624

itsamziii · 2024-10-09T20:03:37Z

Draft PR for #600

Some known issues that are to be fixed yet (this is just a draft PR)

The workflow takes too long for translations, approx. ~10 mins, may batch translating content be helpful?
Code blocks are exempted from translations due to unexpected translations of syntax
Hyperlinks for TOC currently aren't being translated

Currently the translator uses Google AJAX Api for translations so the translations may not be "good". Best approach would perhaps be utilizing machine translation instead.

Important

Automates translation of README.md into multiple languages using a GitHub workflow and Python script, preserving code blocks and HTML tags.

Workflow:
- Adds .github/workflows/translate-readme.yml to automate translation of README.md on push.
- Uses deep-translator and parmapper for translation and parallel processing.
- Translates into Chinese, Japanese, and French.
Script:
- scripts/readme_translator.py translates README.md, preserving code blocks and HTML tags.
- Uses GoogleTranslator from deep-translator for language translation.
- Saves translated files as README-CN.md, README-JA.md, README-FR.md.
Known Issues:
- Translation process is slow (~10 mins), potential for batching.
- Code blocks and TOC links are not translated.

^{This description was created by}^{for 020e102. It will automatically update as commits are pushed.}

creatorrr · 2024-10-10T23:31:03Z

Great work! @itsamziii

For parallelizing you can use parmapper library. Don't use the translate_batch method of deep-translator. They don't actually do any parallel work underneath, just for i in ... wrapper.

creatorrr · 2024-10-10T23:36:31Z

Perhaps something like this:

from deep_translator import GoogleTranslator
from pathlib import Path
import re
from typing import List
import parmapper
from functools import partial

def create_translator(target: str) -> GoogleTranslator:
    """
    Create a translator for a given target language.
    """
    return GoogleTranslator(source="en", target=target)

def translate_raw_html(translator: GoogleTranslator, html_content: str) -> str:
    """
    Translate a given raw html content using the provided translator, preserving HTML tags and newlines.
    """
    html_tags_pattern = r"(<[^>]+>)"
    segments = re.split(html_tags_pattern, html_content)

    translated_segments = []
    for segment in segments:
        if re.fullmatch(html_tags_pattern, segment):
            translated_segments.append(segment)
        else:
            try:
                if re.fullmatch(r'^[!"#$%&\'()*+,\-./:;<=>?@[\]^_`{|}~]+$', segment):
                    translated_segments.append(segment)
                    continue
                translated = translator.translate(segment)
                translated_segments.append(translated if translated else segment)
            except Exception as e:
                print(f"Error translating segment '{segment}': {e}")
                translated_segments.append(segment)
    return "".join(translated_segments)

def translate_readme(source: str, target: str) -> str:
    """
    Translate a README file from source to target language, preserving code blocks and newlines.
    """
    file_content = Path(source).read_text(encoding='utf-8')
    translator = create_translator(target)
    code_block_pattern = r"(```[\s\S]*?```|\n)"
    segments = re.split(code_block_pattern, file_content)
    
    # Prepare a partial function with the translator fixed
    translate_segment = partial(translate_raw_html, translator)
    
    # Parallelize the translation of segments
    translated_segments = list(parmapper.parmap(translate_segment, segments))
    
    return ''.join(translated_segments)

def save_translated_readme(translated_content: str, lang: str) -> None:
    """
    Save the translated README content to a file.
    """
    filename = f"README-{lang.split('-')[-1].upper()}.md"
    with open(filename, "w", encoding='utf-8') as file:
        file.write(translated_content)

def process_language(lang: str, source_file: str) -> None:
    """
    Process the translation for a single language.
    """
    translated_readme = translate_readme(source_file, lang)
    save_translated_readme(translated_readme, lang)

def main() -> None:
    """
    Main function to translate README.md to multiple languages.
    """
    source_file = "README.md"
    destination_langs = ["zh-CN", "ja", "fr"]
    
    # Prepare a partial function with the source_file fixed
    process_lang = partial(process_language, source_file=source_file)
    
    # Parallelize the translation of README for different languages
    list(parmapper.parmap(process_lang, destination_langs))

if __name__ == "__main__":
    main()

itsamziii · 2024-10-11T13:01:31Z

I see, indeed parallelizing the tasks should improve the time performance. I'll make those changes once I find out a way to ensure TOC hyperlinking. Thank you for the guidance! It was indeed helpful ^^

creatorrr · 2024-10-11T19:16:59Z

Don't worry about TOC section actually. They are autogenerated by a different github action so it's not a big deal

itsamziii · 2024-10-11T19:45:34Z

Alright, I can make this draft pr, up for review.

ellipsis-dev

👍 Looks good to me! Reviewed everything up to 020e102 in 55 seconds

More details

Looked at 3443 lines of code in 6 files
Skipped 0 files when reviewing.
Skipped posting 3 drafted comments based on config settings.

1. scripts/readme_translator.py:9

Draft comment:
The CODEBLOCK_PATTERN regex is incorrect for matching code blocks. It should use triple backticks (```) instead of single backticks (`).
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
The comment is incorrect because the regex pattern in the code already uses triple backticks, which is the correct syntax for code blocks in markdown. The comment seems to be based on a misunderstanding of the code.
I might be missing some context about how the regex is used elsewhere, but based on the provided code, the pattern seems correct.
The code provided is complete for the context of this comment, and the regex pattern is indeed correct as it stands.
The comment should be deleted because it incorrectly states that the regex pattern is wrong, while the code actually uses the correct pattern.

2. scripts/readme_translator.py:17

Draft comment:
The script does not handle the translation of hyperlinks in the TOC correctly. Consider implementing a solution to translate TOC links properly.
Reason this comment was not posted:
Comment was on unchanged code.

3. scripts/readme_translator.py:52

Draft comment:
Consider batching translation requests or implementing a rate limiter to handle potential API rate limits more efficiently.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
The comment is relevant to the changes made in the diff, as it addresses a potential issue with the current implementation. The code does not handle API rate limits, which could be a problem if the script is used frequently or with large files. The suggestion is actionable and clear, as it proposes a specific improvement to the code's robustness.
The comment is speculative, as it assumes that API rate limits will be an issue without evidence. The code might work fine without these changes if the usage is within acceptable limits.
Even though the comment is speculative, it addresses a potential issue that could arise in real-world usage. Implementing a rate limiter or batching could prevent future problems, making the suggestion valuable.
Keep the comment, as it provides a clear and actionable suggestion to improve the code's robustness against potential API rate limits.

Workflow ID: wflow_pBV9kqjbDzptPgld

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

itsamziii and others added 9 commits October 10, 2024 00:38

feat(script): python script for translating readme

2e5f20a

feat(ci): readme translator GitHub action

c2f4dff

test

5b932f6

Merge branch 'dev' of https://github.com/itsamziii/julep into dev

cd35df3

chore(ci): push changes to the same branch

15fa044

chore(readme): translate README.md

06a0a31

chore(ci): update base name for translated readme

74a7a4e

test(ci): invoke readme translator gh action

f78e5ff

chore(readme): translate README.md

c88693e

itsamziii and others added 2 commits October 11, 2024 18:26

Merge branch 'dev' into dev

b437a59

chore(docs): update TOC

ad8586f

itsamziii added 2 commits October 11, 2024 19:59

chore(ci): parallelize translators utilizing parmapper

6618a77

feat(ci): install parmapper during gh action

a23786c

Merge branch 'dev' into dev

020e102

itsamziii marked this pull request as ready for review October 11, 2024 19:45

ellipsis-dev bot reviewed Oct 11, 2024

View reviewed changes

github-actions bot and others added 4 commits October 11, 2024 19:48

chore(readme): translate README.md

8b872e9

Merge branch 'dev' into dev

1d2f744

chore(docs): update TOC

ba61398

Merge branch 'dev' into dev

09ef866

creatorrr merged commit 958217f into julep-ai:dev Oct 13, 2024
1 of 2 checks passed

creatorrr mentioned this pull request Oct 13, 2024

Automated translation of README.md (GitHub action) #629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate README translations #624

Automate README translations #624

itsamziii commented Oct 9, 2024 •

edited by ellipsis-dev bot

Loading

creatorrr commented Oct 10, 2024

creatorrr commented Oct 10, 2024

itsamziii commented Oct 11, 2024

creatorrr commented Oct 11, 2024

itsamziii commented Oct 11, 2024

ellipsis-dev bot left a comment

Automate README translations #624

Automate README translations #624

Conversation

itsamziii commented Oct 9, 2024 • edited by ellipsis-dev bot Loading

creatorrr commented Oct 10, 2024

creatorrr commented Oct 10, 2024

itsamziii commented Oct 11, 2024

creatorrr commented Oct 11, 2024

itsamziii commented Oct 11, 2024

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

itsamziii commented Oct 9, 2024 •

edited by ellipsis-dev bot

Loading