-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alpha pipeline] Write to hub component #140
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @PhilippeMoussalli!
Now this component isn't really generic though. Would be great if we could automatically define which data to write based on the component spec instead of hardcoded.
@@ -0,0 +1,57 @@ | |||
""" | |||
This component write an images dataset to the hub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This component write an images dataset to the hub. | |
This component writes an images dataset to the hub. |
I think for this we will need to implement the |
We do have access to the component spec as an attribute on the component class. Agree that we'll want to refactor this into a For me it's fine to implement this as a custom component first as well, so it works in the pipeline, and we can do the work to make it reusable later. |
# login | ||
huggingface_hub.login(token=hf_token) | ||
|
||
# Create hub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Create hub | |
# Create repository |
# TODO: ADD Arguments for embedding component later on | ||
# MODEL_ID = "openai/clip-vit-large-patch14" | ||
# BATCH_SIZE = 10 | ||
write_to_hub_op = FondantComponentOp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going to include this op in the pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah indeed, gonna change the simple pipeline to the sd one and then i'll connect all the components together
@@ -0,0 +1,57 @@ | |||
""" | |||
This component write an images dataset to the hub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This component write an images dataset to the hub. | |
This component write an image dataset to the hub. |
*, | ||
dataframe: dd.DataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*, | |
dataframe: dd.DataFrame, | |
dataframe: dd.DataFrame, | |
*, |
It might be that you'll need to change this to this to run the component successfully
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be fine since the FondantTransformComponent is currently defined like this
def transform(self, *args, **kwargs) -> dd.DataFrame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it's documented that the dataframe is passed in as a positional argument. We should change this to actually be the case 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh 😅 i'll change it, though would opt for something better than documentation for enforcing certain constrains
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @PhilippeMoussalli!
PR that adds the write to hub component. Right now we're running `to_parquet` method twice, once within the component itself to write to the hub and another time in fondant to write the dataset. We should work on implementing a `DataWriter` class huggingface/huggingface_hub#138. Another thing that we need to configure is how to properly pass secrets huggingface/huggingface_hub#139. Currently the hf token is passed as an argument that is visible in kubeflow to all users who have access to it. Linking also the issue for properly representing submitted images on the hub (right now they are represented as byte strings: huggingface/datasets#5869
PR that adds the write to hub component.
Right now we're running
to_parquet
method twice, once within the component itself to write to the hub and another time in fondant to write the dataset. We should work on implementing aDataWriter
class huggingface/huggingface_hub#138.Another thing that we need to configure is how to properly pass secrets huggingface/huggingface_hub#139. Currently the hf token is passed as an argument that is visible in kubeflow to all users who have access to it.
Linking also the issue for properly representing submitted images on the hub (right now they are represented as byte strings:
huggingface/datasets#5869