Generalized Scraper is a Python script designed to retrieve, process, and evaluate HTML content from a specified URL. The script divides the content into chunks, processes these chunks using OpenAI's GPT models, and generates a final evaluation and response. The aim is to automate and streamline the evaluation and processing of HTML content for various purposes.
Generalized Scraper automates the following tasks:
- Retrieving HTML content from a specified URL.
- Chunking the HTML content into manageable pieces.
- Processing each chunk using OpenAI's GPT models.
- Evaluating and summarizing the processed chunks.
- Generating a Python script based on the final combined response.
The script relies on several environment variables defined in a .env file. An example configuration is provided in a .env.example file:
.env.example
OPENAI_API_KEY=your_openai_api_key
TARGET_SCHEMA_PROMPT_PATH=path_to_your_target_schema_prompt.txt
CHUNK_EVALUATOR_PATH=path_to_your_chunk_evaluator.txt
PYTHON_SCRIPT_GENERATOR_PATH=path_to_your_python_script_generator.txt
Generalized Scraper requires the following Python packages:
- python-dotenv
- openai
- selenium
- beautifulsoup4
- tiktoken
These can be installed using pip:
pip install -r requirements.txt
The script starts by loading environment variables using the dotenv package.
It initializes the OpenAI client using the API key from environment variables.
The paths to necessary configuration files are defined using environment variables.
Generalized Scraper contains several utility functions:
- read_file_content: Reads the content of a file.
- calculate_token_count: Calculates the number of tokens in a text.
- calculate_word_count: Calculates the number of words in a text.
- estimate_memory_size: Estimates the memory size of a text in MB.
- estimate_page_count: Estimates the number of pages based on character count.
- save_to_file: Saves text to a file.
- process_chunks: Processes chunks of text using OpenAI's GPT models.
The main function performs the following steps:
- Retrieves HTML content from a specified URL.
- Parses and prettifies the HTML content using BeautifulSoup.
- Calculates various metrics (token count, word count, memory size, page count).
- Divides the content into chunks based on user input.
- Saves each chunk to a file.
- Reads system prompts from configuration files.
- Processes the chunks and generates responses.
- Concatenates and saves the final response.
- Evaluates the final response using another prompt.
- Combines the evaluation summary with the final response and saves it.
- Generates a Python script based on the combined response.
Generalized Scraper includes basic error handling for file operations and HTTP requests.
- Configure the environment variables in a .env file based on the provided .env.example.
- Install the required Python packages.
- Run the script:
python generalized_scraper.py
Generalized Scraper provides an automated solution for processing and evaluating HTML content from a web page. By leveraging OpenAI's GPT models, it can efficiently handle large amounts of text and generate useful insights and scripts based on the processed content. This README outlines the script's functionality and provides guidance on setup and usage. Feedback and suggestions for improvement are welcome.