Transcript Buddy is a tool for processing and cleaning transcripts from Teams meetings. It uses language models to clean and format the transcripts into a coherent and readable format.
It was developed to process onboarding lessons transcripts, but the prompts can be adapted for other use cases.
- Processes Microsoft Teams meeting transcripts (
.docx
files) - Cleans and formats transcripts using AI language models (OpenAI/Groq)
- Handles multiple files in batch processing
- Maintains original content while improving readability
- Generates clean, structured output files
- Token limit validation to ensure model compatibility
- Support for both OpenAI and Groq language models
-
Clone the repository:
git clone <repository_url> cd transcript_buddy
-
Create and activate a virtual environment (recommended):
python -m venv .venv .venv\Scripts\activate # On Windows
-
Install the required dependencies:
pip install -r requirements.txt
-
Create a
.env
file in the root directory and add your API keys (or ask your colleague Alessio Piraccini 😉):OPENAI_API_KEY=your_openai_api_key GROQ_API_KEY=your_groq_api_key
The application can be configured through the config.py
file:
USE_OPENAI
: Toggle between OpenAI (True) and Groq (False) modelsMAX_TOKENS
: Maximum token limit for text processing (default: 128k)OPENAI_MODEL
: OpenAI model to use (default: "gpt-4o-2024-11-20")GROQ_MODEL
: Groq model to use (default: "llama-3.1-70b-versatile")
-
Create the necessary directories:
mkdir -p data/input
-
Place your
.docx
files in thedata/input
directory. -
Run the main script to process the files:
python src/main.py
-
The processed transcripts will be saved in the
data/output
directory.
The processed transcripts are saved as HTML files with:
- A header indicating the source (Amplifon CoE)
- The cleaned and formatted transcript content
- A disclaimer about AI-generated content
- Markdown formatting for better readability
transcript_buddy/
├── data/
│ ├── input/ # Place input .docx files here
│ └── output/ # Processed transcripts are saved here
├── src/
│ ├── utils/
│ │ ├── file_processing.py # File handling functions
│ │ └── llm.py # Language model interactions
│ ├── config.py # Configuration settings
│ └── main.py # Main processing script
├── .env # Environment variables (API keys)
├── .gitignore
├── README.md
└── requirements.txt
python-docx
: For reading and writing Word documentstiktoken
: Token counting for OpenAI modelsmarkdown
: Markdown processingpython-dotenv
: Environment variable managementopenai
: OpenAI API clientgroq
: Groq API client
The application includes comprehensive error handling:
- Logs all operations with timestamps
- Validates API keys before processing
- Checks token limits to prevent model errors
- Provides clear error messages for common issues
- Continues processing remaining files if one file fails
- Creates detailed processing summaries
Common issues and solutions:
-
Missing API Keys
- Ensure your
.env
file exists and contains the required API keys - Check that the API keys are valid and not expired
- Ensure your
-
Token Limit Exceeded
- The file content exceeds the maximum token limit (128k)
- Consider splitting the file into smaller parts
-
File Format Issues
- Ensure input files are in
.docx
format - Check that files are not corrupted or password-protected
- Ensure input files are in
-
Processing Errors
- Check the logs for specific error messages
- Verify that the input file is a valid Teams transcript
- Ensure you have an active internet connection
For any other issues, please check the logs or contact the maintainers.