Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(raw): add arrow support in backend #2083

Open
wants to merge 27 commits into
base: dev
Choose a base branch
from

Conversation

MartinBelthle
Copy link
Contributor

@MartinBelthle MartinBelthle commented Jul 5, 2024

There are few things to note when using the arrow format:
1- For output files the index is now inside the dataframe as the column "Index".
2- The columns names are now in string (forced by pyarrow)

@@ -59,6 +60,13 @@
".json": ("application/json", "utf-8"),
}


class MATRIX_FORMAT(EnumIgnoreCase):
JSON = "json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Selon les cas d'usage dans l'application, vous pouvez avoir "json-split" (cas général) ou "json-index" (cas des Table Form du scenario builder, par exemple).

Exemple :

>>> import pandas as pd
>>> df = pd.DataFrame(data=[[1, 2, 3.14], [4, 5, 6.18]], index=["00:00", "01:00"], columns=["TS-1", "TS-2", "TS-3"])
>>> df.to_json(orient="split")
'{"columns":["TS-1","TS-2","TS-3"],"index":["00:00","01:00"],"data":[[1,2,3.14],[4,5,6.18]]}'
>>> df.to_json(orient="index")
'{"00:00":{"TS-1":1,"TS-2":2,"TS-3":3.14},"01:00":{"TS-1":4,"TS-2":5,"TS-3":6.18}}'

antarest/study/web/raw_studies_blueprint.py Outdated Show resolved Hide resolved
antarest/study/storage/abstract_storage_service.py Outdated Show resolved Hide resolved
antarest/study/web/raw_studies_blueprint.py Outdated Show resolved Hide resolved
else:
real_format = "json" if formatted else "bytes"

output = study_service.get(uuid, path, depth=depth, format=real_format, params=parameters)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a global point of view, I think it's weird to push the formatting down all the call stack to serialize the matrix there.

Usually, application manipulate "objects" in memory, and then we serialize it according to the requested format only when we need it.

So here for example, I think it would make more sense to retrieve a DataFrame object, and serialize it here.

Do you think it's possible ? Maybe in another PR because it has a bigger impact on the codebase ?
In particular, we still want matrices to be formatted as JSON when getting a whole tree from the file tree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible and I agree it's a better way to do it. I tried inside this PR but it's too big of a work so we should tackle it inside another PR.

Copy link
Contributor

@laurent-laporte-pro laurent-laporte-pro Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that, some binary files in the study directory are not DataFrame at all : *.ico, XML files, user-defined files stored in the user directory…

@@ -174,33 +155,63 @@ def test_get_study(

# If we ask for a matrix, we should have a JSON content if formatted is True
rel_path = "/input/links/de/fr"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably test it with an output matrix too, since they are handled by different code paths

Copy link
Contributor Author

@MartinBelthle MartinBelthle Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've duplicated the input tests to do the output matrix ones. The code of the test is a bit more complicated but it doesn't bother me.

antarest/study/web/raw_studies_blueprint.py Outdated Show resolved Hide resolved
@MartinBelthle MartinBelthle force-pushed the add-arrow-support-in-backend branch from 850cff5 to 9ac959f Compare July 18, 2024 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants