Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate README translations #624

Merged
merged 18 commits into from
Oct 13, 2024
Merged

Automate README translations #624

merged 18 commits into from
Oct 13, 2024

Conversation

itsamziii
Copy link
Contributor

@itsamziii itsamziii commented Oct 9, 2024

Draft PR for #600

Some known issues that are to be fixed yet (this is just a draft PR)

  • The workflow takes too long for translations, approx. ~10 mins, may batch translating content be helpful?
  • Code blocks are exempted from translations due to unexpected translations of syntax
  • Hyperlinks for TOC currently aren't being translated

Currently the translator uses Google AJAX Api for translations so the translations may not be "good". Best approach would perhaps be utilizing machine translation instead.


Important

Automates translation of README.md into multiple languages using a GitHub workflow and Python script, preserving code blocks and HTML tags.

  • Workflow:
    • Adds .github/workflows/translate-readme.yml to automate translation of README.md on push.
    • Uses deep-translator and parmapper for translation and parallel processing.
    • Translates into Chinese, Japanese, and French.
  • Script:
    • scripts/readme_translator.py translates README.md, preserving code blocks and HTML tags.
    • Uses GoogleTranslator from deep-translator for language translation.
    • Saves translated files as README-CN.md, README-JA.md, README-FR.md.
  • Known Issues:
    • Translation process is slow (~10 mins), potential for batching.
    • Code blocks and TOC links are not translated.

This description was created by Ellipsis for 020e102. It will automatically update as commits are pushed.

@creatorrr
Copy link
Contributor

Great work! @itsamziii

For parallelizing you can use parmapper library. Don't use the translate_batch method of deep-translator. They don't actually do any parallel work underneath, just for i in ... wrapper.

@creatorrr
Copy link
Contributor

Perhaps something like this:

from deep_translator import GoogleTranslator
from pathlib import Path
import re
from typing import List
import parmapper
from functools import partial

def create_translator(target: str) -> GoogleTranslator:
    """
    Create a translator for a given target language.
    """
    return GoogleTranslator(source="en", target=target)

def translate_raw_html(translator: GoogleTranslator, html_content: str) -> str:
    """
    Translate a given raw html content using the provided translator, preserving HTML tags and newlines.
    """
    html_tags_pattern = r"(<[^>]+>)"
    segments = re.split(html_tags_pattern, html_content)

    translated_segments = []
    for segment in segments:
        if re.fullmatch(html_tags_pattern, segment):
            translated_segments.append(segment)
        else:
            try:
                if re.fullmatch(r'^[!"#$%&\'()*+,\-./:;<=>?@[\]^_`{|}~]+$', segment):
                    translated_segments.append(segment)
                    continue
                translated = translator.translate(segment)
                translated_segments.append(translated if translated else segment)
            except Exception as e:
                print(f"Error translating segment '{segment}': {e}")
                translated_segments.append(segment)
    return "".join(translated_segments)

def translate_readme(source: str, target: str) -> str:
    """
    Translate a README file from source to target language, preserving code blocks and newlines.
    """
    file_content = Path(source).read_text(encoding='utf-8')
    translator = create_translator(target)
    code_block_pattern = r"(```[\s\S]*?```|\n)"
    segments = re.split(code_block_pattern, file_content)
    
    # Prepare a partial function with the translator fixed
    translate_segment = partial(translate_raw_html, translator)
    
    # Parallelize the translation of segments
    translated_segments = list(parmapper.parmap(translate_segment, segments))
    
    return ''.join(translated_segments)

def save_translated_readme(translated_content: str, lang: str) -> None:
    """
    Save the translated README content to a file.
    """
    filename = f"README-{lang.split('-')[-1].upper()}.md"
    with open(filename, "w", encoding='utf-8') as file:
        file.write(translated_content)

def process_language(lang: str, source_file: str) -> None:
    """
    Process the translation for a single language.
    """
    translated_readme = translate_readme(source_file, lang)
    save_translated_readme(translated_readme, lang)

def main() -> None:
    """
    Main function to translate README.md to multiple languages.
    """
    source_file = "README.md"
    destination_langs = ["zh-CN", "ja", "fr"]
    
    # Prepare a partial function with the source_file fixed
    process_lang = partial(process_language, source_file=source_file)
    
    # Parallelize the translation of README for different languages
    list(parmapper.parmap(process_lang, destination_langs))

if __name__ == "__main__":
    main()

@itsamziii
Copy link
Contributor Author

I see, indeed parallelizing the tasks should improve the time performance. I'll make those changes once I find out a way to ensure TOC hyperlinking. Thank you for the guidance! It was indeed helpful ^^

@creatorrr
Copy link
Contributor

Don't worry about TOC section actually. They are autogenerated by a different github action so it's not a big deal

@itsamziii
Copy link
Contributor Author

Alright, I can make this draft pr, up for review.

@itsamziii itsamziii marked this pull request as ready for review October 11, 2024 19:45
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 020e102 in 55 seconds

More details
  • Looked at 3443 lines of code in 6 files
  • Skipped 0 files when reviewing.
  • Skipped posting 3 drafted comments based on config settings.
1. scripts/readme_translator.py:9
  • Draft comment:
    The CODEBLOCK_PATTERN regex is incorrect for matching code blocks. It should use triple backticks (```) instead of single backticks (`).
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The comment is incorrect because the regex pattern in the code already uses triple backticks, which is the correct syntax for code blocks in markdown. The comment seems to be based on a misunderstanding of the code.
    I might be missing some context about how the regex is used elsewhere, but based on the provided code, the pattern seems correct.
    The code provided is complete for the context of this comment, and the regex pattern is indeed correct as it stands.
    The comment should be deleted because it incorrectly states that the regex pattern is wrong, while the code actually uses the correct pattern.
2. scripts/readme_translator.py:17
  • Draft comment:
    The script does not handle the translation of hyperlinks in the TOC correctly. Consider implementing a solution to translate TOC links properly.
  • Reason this comment was not posted:
    Comment was on unchanged code.
3. scripts/readme_translator.py:52
  • Draft comment:
    Consider batching translation requests or implementing a rate limiter to handle potential API rate limits more efficiently.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The comment is relevant to the changes made in the diff, as it addresses a potential issue with the current implementation. The code does not handle API rate limits, which could be a problem if the script is used frequently or with large files. The suggestion is actionable and clear, as it proposes a specific improvement to the code's robustness.
    The comment is speculative, as it assumes that API rate limits will be an issue without evidence. The code might work fine without these changes if the usage is within acceptable limits.
    Even though the comment is speculative, it addresses a potential issue that could arise in real-world usage. Implementing a rate limiter or batching could prevent future problems, making the suggestion valuable.
    Keep the comment, as it provides a clear and actionable suggestion to improve the code's robustness against potential API rate limits.

Workflow ID: wflow_pBV9kqjbDzptPgld


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@creatorrr creatorrr merged commit 958217f into julep-ai:dev Oct 13, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants