Punc-by-LLM

Punctuate Chinese Historical Text Using Large Language Model

Overview

Punc-by-LLM is a Python-based project that uses the Deep Seek V2 API to punctuate Chinese historical texts and convert Simplified Chinese to Traditional Chinese. By utilizing a large language model, it automatically inserts punctuation into long texts, which is especially useful for processing historical documents that lack punctuation.

Features

Automated Punctuation: Leverages the deepseek-chat model to intelligently punctuate Chinese text and convert Simplified Chinese to Traditional Chinese.
Handles Long Sentences: Automatically breaks down long sentences and ensures proper punctuation insertion.
Customizable: Configure maximum text length and stop sentences at appropriate punctuation marks.

How It Works

Input Text: The script reads unpunctuated text from input.txt.
Process: It processes each line, calling the deepseek function to retrieve punctuated text.
Punctuation: Sentences longer than the defined MAX_LENGTH are split and processed iteratively.
Output: The punctuated text is saved to output.txt.

Project Structure

punc-by-llm/
│
├── api_key.txt          # Deep Seek API key file
├── input.txt            # Input text file with unpunctuated Chinese historical text
├── output.txt           # Output file with punctuated text
├── prompt.txt           # Prompt for the language model to punctuate and convert Chinese
├── punc_by_llm.py       # Main script
└── README.md            # This file

Prerequisites

Python >= 3.9
Deep Seek V2 API key

Installation

Download the repository.
Install required dependencies:
```
pip install openai
```
Create api_key.txt in the root directory.
Add your Deep Seek V2 API key to api_key.txt. You can get the API key from https://www.deepseek.com.
```
sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
```

Usage

Place your unpunctuated text in input.txt.
Run the script:
```
python punc_by_llm.py
```
The punctuated text will be saved in output.txt.

Configuration

MAX_LENGTH: Set the maximum length for each text chunk. Default is 2000.

License

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
input-long-prompt-test.txt		input-long-prompt-test.txt
input.txt		input.txt
output.txt		output.txt
prompt.txt		prompt.txt
punc.py		punc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Punc-by-LLM

Overview

Features

How It Works

Project Structure

Prerequisites

Installation

Usage

Configuration

License

About

Releases

Packages

Languages

cbdb-project/punc-by-llm

Folders and files

Latest commit

History

Repository files navigation

Punc-by-LLM

Overview

Features

How It Works

Project Structure

Prerequisites

Installation

Usage

Configuration

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages