Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alpha pipeline] Write to hub component #140

Merged
merged 3 commits into from
May 17, 2023

Conversation

PhilippeMoussalli
Copy link
Contributor

PR that adds the write to hub component.

Right now we're running to_parquet method twice, once within the component itself to write to the hub and another time in fondant to write the dataset. We should work on implementing a DataWriter class huggingface/huggingface_hub#138.

Another thing that we need to configure is how to properly pass secrets huggingface/huggingface_hub#139. Currently the hf token is passed as an argument that is visible in kubeflow to all users who have access to it.

Linking also the issue for properly representing submitted images on the hub (right now they are represented as byte strings:
huggingface/datasets#5869

@PhilippeMoussalli PhilippeMoussalli self-assigned this May 16, 2023
@PhilippeMoussalli PhilippeMoussalli added the Components Implementation of components label May 16, 2023
@PhilippeMoussalli PhilippeMoussalli added this to the Alpha milestone May 16, 2023
@PhilippeMoussalli PhilippeMoussalli linked an issue May 16, 2023 that may be closed by this pull request
Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PhilippeMoussalli!

Now this component isn't really generic though. Would be great if we could automatically define which data to write based on the component spec instead of hardcoded.

@@ -0,0 +1,57 @@
"""
This component write an images dataset to the hub.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This component write an images dataset to the hub.
This component writes an images dataset to the hub.

@PhilippeMoussalli
Copy link
Contributor Author

Thanks @PhilippeMoussalli!

Now this component isn't really generic though. Would be great if we could automatically define which data to write based on the component spec instead of hardcoded.

I think for this we will need to implement the DataWriter class. Don't see how this could be done within the component if we don't have access to the subsets/index columns

@RobbeSneyders
Copy link
Member

We do have access to the component spec as an attribute on the component class. Agree that we'll want to refactor this into a DataWriter class though.

For me it's fine to implement this as a custom component first as well, so it works in the pipeline, and we can do the work to make it reusable later.

# login
huggingface_hub.login(token=hf_token)

# Create hub
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Create hub
# Create repository

# TODO: ADD Arguments for embedding component later on
# MODEL_ID = "openai/clip-vit-large-patch14"
# BATCH_SIZE = 10
write_to_hub_op = FondantComponentOp(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to include this op in the pipeline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah indeed, gonna change the simple pipeline to the sd one and then i'll connect all the components together

@@ -0,0 +1,57 @@
"""
This component write an images dataset to the hub.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This component write an images dataset to the hub.
This component write an image dataset to the hub.

Comment on lines 19 to 20
*,
dataframe: dd.DataFrame,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*,
dataframe: dd.DataFrame,
dataframe: dd.DataFrame,
*,

It might be that you'll need to change this to this to run the component successfully

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fine since the FondantTransformComponent is currently defined like this

def transform(self, *args, **kwargs) -> dd.DataFrame

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's documented that the dataframe is passed in as a positional argument. We should change this to actually be the case 😛

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh 😅 i'll change it, though would opt for something better than documentation for enforcing certain constrains

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PhilippeMoussalli PhilippeMoussalli merged commit 4688014 into main May 17, 2023
@PhilippeMoussalli PhilippeMoussalli deleted the component/write-image-to-hub branch May 17, 2023 09:35
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
PR that adds the write to hub component. 

Right now we're running `to_parquet` method twice, once within the
component itself to write to the hub and another time in fondant to
write the dataset. We should work on implementing a `DataWriter` class
huggingface/huggingface_hub#138.

Another thing that we need to configure is how to properly pass secrets
huggingface/huggingface_hub#139. Currently the hf token is passed as an
argument that is visible in kubeflow to all users who have access to it.

Linking also the issue for properly representing submitted images on the
hub (right now they are represented as byte strings:
huggingface/datasets#5869
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Components Implementation of components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write to Hugging Face hub component
3 participants