Punctuate Chinese Historical Text Using Large Language Model
Punc-by-LLM is a Python-based project that uses the Deep Seek V2 API to punctuate Chinese historical texts and convert Simplified Chinese to Traditional Chinese. By utilizing a large language model, it automatically inserts punctuation into long texts, which is especially useful for processing historical documents that lack punctuation.
- Automated Punctuation: Leverages the
deepseek-chat
model to intelligently punctuate Chinese text and convert Simplified Chinese to Traditional Chinese. - Handles Long Sentences: Automatically breaks down long sentences and ensures proper punctuation insertion.
- Customizable: Configure maximum text length and stop sentences at appropriate punctuation marks.
- Input Text: The script reads unpunctuated text from
input.txt
. - Process: It processes each line, calling the
deepseek
function to retrieve punctuated text. - Punctuation: Sentences longer than the defined
MAX_LENGTH
are split and processed iteratively. - Output: The punctuated text is saved to
output.txt
.
punc-by-llm/
│
├── api_key.txt # Deep Seek API key file
├── input.txt # Input text file with unpunctuated Chinese historical text
├── output.txt # Output file with punctuated text
├── prompt.txt # Prompt for the language model to punctuate and convert Chinese
├── punc_by_llm.py # Main script
└── README.md # This file
- Python >= 3.9
- Deep Seek V2 API key
-
Download the repository.
-
Install required dependencies:
pip install openai
-
Create
api_key.txt
in the root directory. -
Add your Deep Seek V2 API key to
api_key.txt
. You can get the API key from https://www.deepseek.com.sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
-
Place your unpunctuated text in
input.txt
. -
Run the script:
python punc_by_llm.py
-
The punctuated text will be saved in
output.txt
.
- MAX_LENGTH: Set the maximum length for each text chunk. Default is
2000
.
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license