Skip to content

Latest commit

 

History

History
148 lines (117 loc) · 9.28 KB

update.md

File metadata and controls

148 lines (117 loc) · 9.28 KB

To improve the existing code for parsing web pages from the Neti.ee website, you need to consider that the site may have changed its HTML structure, classes, and methods of displaying data. Here is a detailed improvement plan to help adapt the code to the current state of the Neti.ee site and enhance its efficiency and reliability:

Code Improvement Plan:

  1. Analyze the Current Structure of the Website:

    • Open the Neti.ee website and navigate to the sections you want to parse.
    • Use browser developer tools (e.g., Chrome DevTools) to analyze the current HTML structure.
    • Identify which HTML classes and elements are used to display company information.
  2. Update Element Selectors in the Code:

    • Update the selectors used in the parser code to match the current structure of the website.
    • For example, if the HTML classes have changed from fc-bi-regcode-value to something else, replace the old selectors with the new ones.
    • Check all parts of the code where find_element_by_class_name, find_element_by_css_selector, and other search methods are used.
  3. Switch to More Modern Parsing Tools:

    • Consider using a more modern parsing tool, such as BeautifulSoup or Scrapy, instead of or in addition to Selenium.
    • This can significantly improve performance and simplify the code, especially if you do not need JavaScript interaction.
  4. Switch to a Modern Web Driver:

    • Update the browser driver to the latest version. For instance, PhantomJS is outdated; use Selenium with ChromeDriver or Firefox GeckoDriver instead.
    • Ensure that the web drivers are properly configured to work with the latest browser versions.
  5. Optimize Multithreading:

    • The current multithreading implementation uses ThreadPool, which can be improved for error handling and connection management.
    • Use concurrent.futures.ThreadPoolExecutor for more flexible thread management.
    • Add exception handling and retry mechanisms in case of failures or page loading errors.
  6. Error Handling and Logging:

    • Improve error handling in the code by adding more detailed error messages and the conditions under which they occur.
    • Use more advanced logging libraries, such as logging, instead of standard print output for creating structured and multi-level logs.
    • Implement retry logic to handle temporary connection failures or page loading errors.
  7. Improve Performance and Resilience:

    • Add timeouts and wait limits to Selenium requests to prevent hanging on slow or non-functional pages.
    • Implement caching for data that rarely changes to reduce server load on Neti.ee and shorten parsing time.
    • Optimize database queries using transactions or batch operations when saving data.
  8. Enhance Code Structure and Organization:

    • Break down the code into smaller modules and classes for better readability and ease of testing.
    • Use OOP (Object-Oriented Programming) approaches for structuring the code and data handling.
    • Update the code according to modern Python standards (PEP 8) and improve formatting for better readability.
  9. Documentation and Testing:

    • Improve code documentation by adding comments and docstrings to each method and class.
    • Develop and write unit tests for critical functions and methods.
    • Use libraries for automated testing (e.g., pytest) to ensure the stability and correctness of the code after making changes.
  10. Support for New Data Formats:

    • Add the ability to save data in various formats, such as JSON, Excel, or SQL databases, for flexibility in data usage.
    • Improve data export functions to support new formats that may be required by users.

Conclusion:

This improvement plan will help you modernize the parser code to adapt to the current state of the Neti.ee site, enhance its performance, resilience, and usability. By implementing these improvements, you can significantly increase the quality and efficiency of your parser.

The code you provided is a Python function designed to scrape information about a company from a web page using Selenium, a tool for automating web browsers. Here's a detailed explanation of how this code works:

Code Explanation

def parse_neti_company_page(url):
  • Function Definition: This defines a function named parse_neti_company_page that takes a single argument, url. This URL is expected to be a webpage containing the company information you want to scrape.
    # Moved out to fix multiprocessing problem - can't pickle local object
    print('Current url', url)
  • Comment: The comment suggests that some part of the code was refactored to fix an issue related to multiprocessing. In Python, certain objects (like local functions or objects) can't be pickled, which is necessary for multiprocessing. This comment might indicate that the driver creation code was moved to a different location to avoid such issues.

  • Print Statement: print('Current url', url) prints the current URL being processed to the console. This is useful for debugging and logging purposes.

    driver = Parser.create_driver()
  • Create WebDriver: This line calls a method create_driver() from an object or module named Parser. This method likely initializes a Selenium WebDriver (like ChromeDriver or FirefoxDriver). The WebDriver is what Selenium uses to automate browser actions, such as opening web pages and interacting with elements on the page.
    driver.get(url)
    time.sleep(1.5)
  • Open the Web Page: driver.get(url) instructs the WebDriver to navigate to the URL provided as an argument to the function. This effectively opens the web page in a browser window controlled by Selenium.

  • Wait for Page to Load: time.sleep(1.5) makes the script wait for 1.5 seconds. This is a simple way to ensure that the page has fully loaded before attempting to find any elements. A more robust solution would be to use WebDriver's implicit or explicit waits.

    try:
        business_info_raw = driver.find_element_by_class_name('info-tabel')
  • Try Block: The code within the try block is where the actual scraping happens. If any error occurs during this process, the except block will handle it.

  • Locate Main Container: driver.find_element_by_class_name('info-tabel') finds the first HTML element with the class name info-tabel. This is expected to be a container (e.g., a <div> or <table>) that holds all the business information.

        reg_code = business_info_raw.find_element_by_class_name('fc-bi-regcode-value').text
        KMKR = business_info_raw.find_element_by_class_name('fc-bi-kmkr-value').text
        address = business_info_raw.find_element_by_class_name('fc-bi-address-value').text
        email = business_info_raw.find_element_by_class_name('fc-bi-contact-value').text
  • Extract Specific Information:
    • Each line uses find_element_by_class_name to locate a specific child element within business_info_raw by its class name. The .text attribute retrieves the visible text content of the element.
    • reg_code: This retrieves the company's registration code.
    • KMKR: This retrieves the company's VAT number (KMKR is an Estonian abbreviation for VAT number).
    • address: This retrieves the company's address.
    • email: This retrieves the company's email address.
        company_data = {
            'reg_code': reg_code,
            'KMKR': KMKR,
            'address': address,
            'email': email
        }
  • Store Data in a Dictionary: All the extracted information is stored in a dictionary named company_data. This makes it easier to return and use the data later in the code.
        driver.close()
        driver.quit()
  • Close Browser:
    • driver.close() closes the current browser window.
    • driver.quit() shuts down the WebDriver entirely, closing all associated browser windows. This is important to free up system resources and avoid potential memory leaks.
        return company_data
  • Return Extracted Data: If everything was successful, the function returns the company_data dictionary containing the scraped information.
    except Exception as e:
        print(e)
        print('Exception at: ', url)
        return None
  • Exception Handling:
    • except Exception as e: Catches any exception that occurs in the try block.
    • print(e): Prints the error message to the console for debugging purposes.
    • print('Exception at: ', url): Prints a message indicating which URL caused the exception.
    • return None: Returns None to indicate that the function failed to scrape data from the page.

Summary

This function automates a web browser to open a given URL, scrape specific pieces of information about a company (registration code, VAT number, address, and email), and then return that information in a dictionary. If it encounters any errors during the scraping process, it handles them gracefully by printing the error and returning None.

To use this function effectively:

  • Ensure that the Parser.create_driver() method is correctly defined and sets up the Selenium WebDriver appropriately.
  • Make sure to handle the None return value in your code where you call this function, as this indicates an error occurred during scraping.