Implement deployment API for pandas and polars #5

johnnylarner · 2023-08-18T06:23:22Z

Description

The implementation detail of our business logic is located in two modules for polars and pandas. These modules contain the same set of public like functions that can be imported into a script to run data transformations. This common API means we can parameterise our imports at runtime such that we can use one script for both modules.

However, in its current implementation the destination script also calls static and instance methods of the two public APIs of polars and pandas. We need to design a wrapper API which:

Takes one input parameter, based on which:
Either module can be imported
Read parquet and write CSV can be called
Display functions for logging are covered

Acceptance criteria

The functionality in the description is covered
Unit tests for each component
An integration test exists that demonstrates usage in a trivial example
The API is available via the ppp module

Out of scope

API does not need to support Pyspark, but it would be nice to consider this in the design decision.

The text was updated successfully, but these errors were encountered:

chberreth · 2023-08-18T11:11:34Z

Hi @johnnylarner, like I mentioned yesterday. I already started with this task. I will prep. a first version an state a PR Draft. Based on this we can discuss whether it is what you originally had in mind, and what is missing... :-)

chberreth · 2023-08-18T12:32:37Z

Hey @johnnylarner , one question. In "3. Read parquet and write CSV can be called" is there a typo? Do we need a write_csv method or did you mean read_csv?

chberreth · 2023-08-20T14:46:45Z

While implementation we encountered two possibile ways of implementation.

Opp. 1: Simple Wrapper Functions in polars.py and pandas.py

For each method needed we could add a wrapper in ./src/ppp/polars.py and ./src/ppp/pandas.py and call it within our feature_engineering.py script e.g.,

# % pandas.py
def read_parquet(file_path: str) -> pandas.DataFrame:
    return pandas.read_parquet(file_path)

# % polars.py
def read_parquet(file_path: str) -> polars.DataFrame:
    return polars.read_parquet(file_path)

# % feature_egineering.py
mod = import_module("ppp." + mod_name)
df = mod.read_parquet_file(parquet_path)

Upsides

A simple approach that allows to add all methods we need very fast
The single wrappers are easy to read and understand

Downsides

We have to write a wrapper for any method we need -> multiple wrapper functions
A lot of code duplication. As seen from example above we could easily use just one function as the method name is the same and at least the parameters we need are the same in both public APIs (file path)

Opp. 2: More complex wrappers that are reusable for several use cases

We write less but more complex wrappers. API-specific parameters are set within the config file. The config file is used to steer the behavior of the wrapper e.g., which public API should be used, etc. The example below suggests the idea of one.

# % common.py
def read_file(config):
    reader_settings = config["reader_settings"]
    module_name = importlib.import_module(reader_settings["module_name"])
    reader_method = getattr(module_name, reader_settings["method_name"])
    return reader_method(**reader_settings["read_kwargs"])

# % feature_egineering.py
config = load_config(CONFIG_PATH)
df = read_file(config)

Upsides

Less wrappers are needed as they can be reused for multiple methods.

Downsides

Complexity of wrappers increases drastically and so do the unit tests, as one always has to consider both APIs.
Methods like read_parquet might have the same name in both APIs and some parameters like path are available in both APIs. But in general available parameters are different.
The design / handling of both APIs is not very similar. For example when casting the data types of all columns in a data frame the approach differs that much that we have to cover each case separately, anyways. Therefore having a common wrapper is not beneficial, at all. It increases the complexity and does not really allow to unify the code.

Decision

We go for Opp. 2 and see how it will work.

chberreth self-assigned this Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement deployment API for pandas and polars #5

Implement deployment API for pandas and polars #5

johnnylarner commented Aug 18, 2023

chberreth commented Aug 18, 2023

chberreth commented Aug 18, 2023 •

edited

Loading

chberreth commented Aug 20, 2023 •

edited

Loading

Implement deployment API for pandas and polars #5

Implement deployment API for pandas and polars #5

Comments

johnnylarner commented Aug 18, 2023

Description

Acceptance criteria

Out of scope

chberreth commented Aug 18, 2023

chberreth commented Aug 18, 2023 • edited Loading

chberreth commented Aug 20, 2023 • edited Loading

Opp. 1: Simple Wrapper Functions in polars.py and pandas.py

Opp. 2: More complex wrappers that are reusable for several use cases

Decision

chberreth commented Aug 18, 2023 •

edited

Loading

chberreth commented Aug 20, 2023 •

edited

Loading