Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(matrices): allow other formats for internal matrices storage #2113

Open
wants to merge 39 commits into
base: dev
Choose a base branch
from

Conversation

MartinBelthle
Copy link
Contributor

@MartinBelthle MartinBelthle commented Aug 7, 2024

This PR does several things:

  • Introduces a new field inside the application.yaml: matrixstore_format that dictates the internal storage format. Default value is still tsv to ensure backward compatibility
  • When a new format is selected, and an user saves a matrix:
    • If the matrix didn't exist, it will be saved in the new format
    • If it already existed, we will migrate the matrix from its existing format to the new one. This way we'll migrate the matrixstore on the fly without having to interrupt the app.
  • It also introduces a script that can be run if one day we want to migrate all matrices at once in a new format.

@MartinBelthle MartinBelthle force-pushed the feat/store-matrices-as-hdf5 branch from b986251 to 0f0a6eb Compare August 19, 2024 07:38
@sylvlecl sylvlecl marked this pull request as draft August 19, 2024 08:41
@laurent-laporte-pro
Copy link
Contributor

Hi @MartinBelthle,

Given that the script successfully reduces the size of the matrices folder and needs to be run while the app is down, I was thinking it might be a good idea to integrate this script into the application startup process—specifically during the FastAPI setup phase.

This way, the migration would happen automatically each time the app starts. Since the script won’t do anything once all the TSV files have been converted to HDF5, it wouldn’t matter if it runs multiple times. If it doesn’t find any TSV files, it simply won’t perform any actions.

Of course, we'd need to ensure there’s enough space for the migration, especially on production where the data size could be much larger. Testing on integration and recette environments first, as you suggested, would be crucial.

What do you think about this approach?

Here is a possible implementation using a FastAPI event:

from fastapi import FastAPI

app = FastAPI()

def migrate_tsv_to_hdf5():
    print("Migrating TSV files to HDF5 format...")

@app.on_event("startup")
async def startup_event():
    migrate_tsv_to_hdf5()
    print("Startup event completed.")

@MartinBelthle
Copy link
Contributor Author

MartinBelthle commented Sep 20, 2024

Indeed I think that's a better way to do it.
I'll try to implement this behavior.

@MartinBelthle
Copy link
Contributor Author

I believe that with the new solution Laurent proposed, this PR is mature and can be reviewed.

@MartinBelthle MartinBelthle marked this pull request as ready for review September 20, 2024 09:28
@MartinBelthle
Copy link
Contributor Author

Seen with Sylvain, we have to discuss on this

@MartinBelthle MartinBelthle marked this pull request as draft October 7, 2024 08:52
@MartinBelthle MartinBelthle marked this pull request as ready for review November 18, 2024 16:55
@MartinBelthle MartinBelthle changed the title feat(matrices): store input matrices in hdf5 format instead of tsv feat(matrices): allow other formats for internal matrices storage Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants