Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileNotFoundError when reading from Hugging Face #554

Closed
lhoestq opened this issue Oct 30, 2024 · 1 comment · Fixed by #555
Closed

FileNotFoundError when reading from Hugging Face #554

lhoestq opened this issue Oct 30, 2024 · 1 comment · Fixed by #555
Assignees
Labels
bug Something isn't working

Comments

@lhoestq
Copy link
Contributor

lhoestq commented Oct 30, 2024

Description

Hi I'm Quentin from HF :) I wanted to play with datachain after #375 by @dberenbaum but I'm getting this error:

from datachain import DataChain

DataChain.from_csv("hf://datasets/infinite-dataset-hub/MobilePlanAssistant/data.csv").show()
FileNotFoundError                         Traceback (most recent call last)
[<ipython-input-2-1e396698d13d>](https://localhost:8080/#) in <cell line: 3>()
      1 from datachain import DataChain
      2 
----> 3 DataChain.from_csv("hf://datasets/infinite-dataset-hub/MobilePlanAssistant/data.csv").show()

5 frames
[/usr/local/lib/python3.10/dist-packages/datachain/lib/dc.py](https://localhost:8080/#) in from_csv(cls, path, delimiter, header, output, object_name, model_name, source, nrows, session, settings, column_types, **kwargs)
   1860             convert_options=convert_options,
   1861         )
-> 1862         return chain.parse_tabular(
   1863             output=output,
   1864             object_name=object_name,

[/usr/local/lib/python3.10/dist-packages/datachain/lib/dc.py](https://localhost:8080/#) in parse_tabular(self, output, object_name, model_name, source, nrows, **kwargs)
   1743         if col_names or not output:
   1744             try:
-> 1745                 schema = infer_schema(self, **kwargs)
   1746                 output = schema_to_output(schema, col_names)
   1747             except ValueError as e:

[/usr/local/lib/python3.10/dist-packages/datachain/lib/arrow.py](https://localhost:8080/#) in infer_schema(chain, **kwargs)
    112     schemas = []
    113     for file in chain.collect("file"):
--> 114         ds = dataset(file.get_path(), filesystem=file.get_fs(), **kwargs)  # type: ignore[union-attr]
    115         schemas.append(ds.schema)
    116     return pa.unify_schemas(schemas)

[/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py](https://localhost:8080/#) in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    792 
    793     if _is_path_like(source):
--> 794         return _filesystem_dataset(source, **kwargs)
    795     elif isinstance(source, (tuple, list)):
    796         if all(_is_path_like(elem) or isinstance(elem, FileInfo) for elem in source):

[/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py](https://localhost:8080/#) in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    474             fs, paths_or_selector = _ensure_multiple_sources(source, filesystem)
    475     else:
--> 476         fs, paths_or_selector = _ensure_single_source(source, filesystem)
    477 
    478     options = FileSystemFactoryOptions(

[/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py](https://localhost:8080/#) in _ensure_single_source(path, filesystem)
    439         paths_or_selector = [path]
    440     else:
--> 441         raise FileNotFoundError(path)
    442 
    443     return filesystem, paths_or_selector

FileNotFoundError: /infinite-dataset-hub/MobilePlanAssistant/data.csv

It looks like _ensure_single_source incorrectly uses a LocalFileSystem instead of the HfFileSystem

The same path works from pandas via fsspec:

>>> import pandas as pd
>>> df = pd.read_csv("hf://datasets/infinite-dataset-hub/MobilePlanAssistant/data.csv")
>>> df.head()
   idx                                         user_input  \
0    0                 Hi, I'm looking for a mobile plan.   
1    1   I need unlimited data and international calling.   
2    2            I want at least 10GB of data per month.   
3    3  That's too expensive, do you have anything che...   
4    4    I'm allergic to cats, will this affect my plan?   

                                        bot_response            labels  
0  Hello! I'd be happy to help you find the best ...          Greeting  
1  Great, do you have a preferred data limit and ...      Data Inquiry  
2  I found a plan with unlimited data and interna...   Plan Suggestion  
3  I found another plan with 8GB of data and inte...  Price Comparison  
4  I'm sorry, but my abilities are focused on mob...  Unexpected Topic 

Version Info

0.6.3
Python 3.10.12

@lhoestq lhoestq added the bug Something isn't working label Oct 30, 2024
@shcheklein shcheklein self-assigned this Oct 30, 2024
@shcheklein
Copy link
Member

@lhoestq create a draft PR to fix this. Please try to run it here #555

Btw, it would be great to catch up and discuss some further steps / what kind of use cases you have in mind. We've done recently some progress here #516 as part of the https://github.com/orgs/iterative/projects/518.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants