This project is a web scraping tool designed to extract and organize company data from the Estonian web directory, Neti.ee. The parser automates the process of gathering information such as registration codes, VAT numbers, addresses, and contact details from Neti.ee's business listings, and stores the data in a structured format using a database or CSV files.
The Neti.ee Company Parser is built to efficiently scrape business information from the Neti.ee website. It leverages Selenium for web automation, Peewee for database management, and Python's multiprocessing capabilities to handle large-scale data scraping and parsing tasks concurrently.
- Automated Web Scraping: Uses Selenium WebDriver to automate the browsing and extraction of company data from Neti.ee.
- Multithreading Support: Utilizes Python's
multiprocessing
library to enhance performance by processing multiple pages concurrently. - Data Storage: Extracted data is stored in an SQLite database using the Peewee ORM or can be exported to CSV files for further analysis.
- Configurable: Supports different operating systems (Windows and Linux) and can be configured with various service arguments and capabilities for Selenium.
- Error Handling: Includes basic exception handling to manage errors during web scraping and database operations.
To run the project locally, follow these steps:
-
Clone the Repository:
git clone https://github.com/your-username/Neti.ee-company-parser.git cd Neti.ee-company-parser
-
Install Dependencies: Make sure to have Python and pip installed, then run:
pip install -r requirements.txt
-
Configure Settings: Adjust the configuration settings in the
config.py
file, such as the operating system type and PhantomJS path. -
Run the Parser: Execute the main script to start parsing:
python main.py
- Python 3.x
- Selenium WebDriver
- PhantomJS (for headless browsing)
- Peewee (for database operations)
- Pandas (for CSV export)
Parser.py
: Contains the main parsing logic and web scraping functions.Database.py
: Handles database connections and operations using Peewee ORM.Logger.py
: A simple logger for tracking scraping progress and errors.CompanyStruct.py
: Defines theCompanyStruct
class for managing company data objects.config.py
: Configuration file for setting up scraping parameters and Selenium capabilities.main.py
: Entry point to start the parsing process.
To scrape company data from a specific Neti.ee category:
- Open
main.py
. - Update the
url
variable with the desired category URL from Neti.ee. - Run the script to start scraping and saving the data.
Contributions are welcome! Please submit a pull request or open an issue to suggest improvements or report bugs.
If you have any questions or suggestions, you can contact me via email: [email protected]
.