This Python script allows you to scrape articles from a list of URLs and save the content in Markdown format. The script provides a simple graphical user interface (GUI) for selecting the URLs file, configuration file, and the directory where the scraped articles will be saved. Additionally, it logs all operations in a log.txt
file in the same directory as the scraped articles.
-
Scrapes articles: Extracts article content from a list of URLs.
-
Markdown conversion: Converts HTML content to Markdown format.
-
GUI: Provides an easy-to-use interface for selecting files and directories.
-
Console Output: Displays real-time console output in the GUI.
-
Log File: Saves console output to
log.txt
in the save directory.
- Python 3.x
- Required Python packages (listed in
requirements.txt
):requests
beautifulsoup4
markdownify
tkinter
(included with Python on most systems)
git clone https://github.com/WastedInside/DelfiMarkdownScraper.git
cd DelfiMarkdownScraper
Run the installation script:
./install.sh
Run the installation script:
install.bat
Alternatively, you can manually install the required packages using pip:
pip install -r requirements.txt
- Prepare Files:
- Create a text file containing the list of URLs you want to scrape (one URL per line).
- Create a JSON configuration file specifying the HTML tag and class name for extracting the article content. Example:
{
"contesnt_tag": "div",
"content_class": "article-content",
"title_tag": "h1",
"title_class": "article-title"
}
-
Run the Script:
- For Unix-based systems:
python3 DelfiMarkdownScraper.py
- For Windows:
python DelfiMarkdownScraper.py
- For Unix-based systems:
-
Use the GUI:
- Use the "Browse" buttons to select the URLs file and save directory. Configs are loaded from configs directory.
- Click "Start Scraping" to begin the process.
-
View Output:
- Scraped articles will be saved in Markdown format in the selected save directory.
- The console output will be displayed in the GUI and saved to
log.txt
in the same directory.
{
"content_tag": "div",
"content_class": "article-body-container row",
"title_tag": "h1",
"title_class": "text-size-1 bg-delfi-black fw-semibold"
}
Feel free to fork this repository and submit pull requests.