Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement deployment API for pandas and polars #5

Open
johnnylarner opened this issue Aug 18, 2023 · 3 comments
Open

Implement deployment API for pandas and polars #5

johnnylarner opened this issue Aug 18, 2023 · 3 comments
Assignees

Comments

@johnnylarner
Copy link
Owner

Description

The implementation detail of our business logic is located in two modules for polars and pandas. These modules contain the same set of public like functions that can be imported into a script to run data transformations. This common API means we can parameterise our imports at runtime such that we can use one script for both modules.

However, in its current implementation the destination script also calls static and instance methods of the two public APIs of polars and pandas. We need to design a wrapper API which:

  1. Takes one input parameter, based on which:
  2. Either module can be imported
  3. Read parquet and write CSV can be called
  4. Display functions for logging are covered

Acceptance criteria

  1. The functionality in the description is covered
  2. Unit tests for each component
  3. An integration test exists that demonstrates usage in a trivial example
  4. The API is available via the ppp module

Out of scope

  1. API does not need to support Pyspark, but it would be nice to consider this in the design decision.
@chberreth chberreth self-assigned this Aug 18, 2023
@chberreth
Copy link
Collaborator

Hi @johnnylarner, like I mentioned yesterday. I already started with this task. I will prep. a first version an state a PR Draft. Based on this we can discuss whether it is what you originally had in mind, and what is missing... :-)

@chberreth
Copy link
Collaborator

chberreth commented Aug 18, 2023

Hey @johnnylarner , one question. In "3. Read parquet and write CSV can be called" is there a typo? Do we need a write_csv method or did you mean read_csv?

@chberreth
Copy link
Collaborator

chberreth commented Aug 20, 2023

While implementation we encountered two possibile ways of implementation.

Opp. 1: Simple Wrapper Functions in polars.py and pandas.py

For each method needed we could add a wrapper in ./src/ppp/polars.py and ./src/ppp/pandas.py and call it within our feature_engineering.py script e.g.,

# % pandas.py
def read_parquet(file_path: str) -> pandas.DataFrame:
    return pandas.read_parquet(file_path)

# % polars.py
def read_parquet(file_path: str) -> polars.DataFrame:
    return polars.read_parquet(file_path)

# % feature_egineering.py
mod = import_module("ppp." + mod_name)
df = mod.read_parquet_file(parquet_path)

Upsides

  • A simple approach that allows to add all methods we need very fast
  • The single wrappers are easy to read and understand

Downsides

  • We have to write a wrapper for any method we need -> multiple wrapper functions
  • A lot of code duplication. As seen from example above we could easily use just one function as the method name is the same and at least the parameters we need are the same in both public APIs (file path)

Opp. 2: More complex wrappers that are reusable for several use cases

We write less but more complex wrappers. API-specific parameters are set within the config file. The config file is used to steer the behavior of the wrapper e.g., which public API should be used, etc. The example below suggests the idea of one.

# % common.py
def read_file(config):
    reader_settings = config["reader_settings"]
    module_name = importlib.import_module(reader_settings["module_name"])
    reader_method = getattr(module_name, reader_settings["method_name"])
    return reader_method(**reader_settings["read_kwargs"])

# % feature_egineering.py
config = load_config(CONFIG_PATH)
df = read_file(config)

Upsides

  • Less wrappers are needed as they can be reused for multiple methods.

Downsides

  • Complexity of wrappers increases drastically and so do the unit tests, as one always has to consider both APIs.
  • Methods like read_parquet might have the same name in both APIs and some parameters like path are available in both APIs. But in general available parameters are different.
  • The design / handling of both APIs is not very similar. For example when casting the data types of all columns in a data frame the approach differs that much that we have to cover each case separately, anyways. Therefore having a common wrapper is not beneficial, at all. It increases the complexity and does not really allow to unify the code.

Decision

We go for Opp. 2 and see how it will work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants