DelfiMarkdownScraper - Article Scraper with GUI

This Python script allows you to scrape articles from a list of URLs and save the content in Markdown format. The script provides a simple graphical user interface (GUI) for selecting the URLs file, configuration file, and the directory where the scraped articles will be saved. Additionally, it logs all operations in a log.txt file in the same directory as the scraped articles.

Features

Scrapes articles: Extracts article content from a list of URLs.
Markdown conversion: Converts HTML content to Markdown format.
GUI: Provides an easy-to-use interface for selecting files and directories.
Console Output: Displays real-time console output in the GUI.
Log File: Saves console output to log.txt in the save directory.

Requirements

Python 3.x
Required Python packages (listed in requirements.txt):
- requests
- beautifulsoup4
- markdownify
- tkinter (included with Python on most systems)

Installation

1. Clone the Repository

git clone https://github.com/WastedInside/DelfiMarkdownScraper.git
cd DelfiMarkdownScraper

2. Install Dependencies

For Unix-based Systems (Linux, macOS):

Run the installation script:

./install.sh

For Windows:

Run the installation script:

install.bat

Alternatively, you can manually install the required packages using pip:

pip install -r requirements.txt

Usage

Prepare Files:
- Create a text file containing the list of URLs you want to scrape (one URL per line).
- Create a JSON configuration file specifying the HTML tag and class name for extracting the article content. Example:

{
    "contesnt_tag": "div",
    "content_class": "article-content",
    "title_tag": "h1",
    "title_class": "article-title"
}

Run the Script:

For Unix-based systems:
```
python3 DelfiMarkdownScraper.py
```
For Windows:
```
python DelfiMarkdownScraper.py
```

Use the GUI:
- Use the "Browse" buttons to select the URLs file and save directory. Configs are loaded from configs directory.
- Click "Start Scraping" to begin the process.
View Output:
- Scraped articles will be saved in Markdown format in the selected save directory.
- The console output will be displayed in the GUI and saved to log.txt in the same directory.

Example Configuration File for delfi.lv

{
    "content_tag": "div",
    "content_class": "article-body-container row",
    "title_tag": "h1",
    "title_class": "text-size-1 bg-delfi-black fw-semibold"
}

Contributing

Feel free to fork this repository and submit pull requests.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
configs		configs
.gitignore		.gitignore
DelfiMarkdownScraper.py		DelfiMarkdownScraper.py
README.md		README.md
install.bat		install.bat
install.sh		install.sh
requirements.txt		requirements.txt
ss.png		ss.png
url.txt		url.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DelfiMarkdownScraper - Article Scraper with GUI

Features

Requirements

Installation

1. Clone the Repository

2. Install Dependencies

For Unix-based Systems (Linux, macOS):

For Windows:

Usage

Example Configuration File for delfi.lv

Contributing

About

Releases 1

Packages

Languages

WastedInside/DelfiMarkdownScraper

Folders and files

Latest commit

History

Repository files navigation

DelfiMarkdownScraper - Article Scraper with GUI

Features

Requirements

Installation

1. Clone the Repository

2. Install Dependencies

For Unix-based Systems (Linux, macOS):

For Windows:

Usage

Example Configuration File for delfi.lv

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages