Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making Data access more uniform #55

Open
gordonwatts opened this issue Dec 12, 2022 · 0 comments
Open

Making Data access more uniform #55

gordonwatts opened this issue Dec 12, 2022 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@gordonwatts
Copy link
Member

It would be nice to see a more carefully thought out and straight forward way to ask for data to come back from ServiceX. In short, normalize the access patterns for servicex. The current interface has grown organically, and there are now so many operations and it is hard to surface them from one place to the other. Time to take a step back, perhaps.

What we have now

        sx = ServiceXDataset([uproot_single_file],
                             backend_name=endpoint_uproot,
                             status_callback_factory=None)
        src = ServiceXSourceUpROOT(sx, 'mini')
        r = (src.Select(lambda e: {'lep_pt': e['lep_pt']})
                .AsAwkwardArray()
                .value())

And AsAwkwardArray can be replaced by a bunch of different things:

  • AsPandasDF, as_pandas - a panda dataframe (this does not support nested objects!)
  • AsROOTTTree, as_ROOT_tree - a list of file(s) that contains a root TTree object
  • AsParquetFiles, as_parquet - a list of parquet file(s) built from awkward's to_parquet method
  • AsAwkwardArray, as_awkward - returns an awkward array of all the data

These methods do not return the actual data - just the request to generate the data. The value() call at the end actually triggers the infrastructure to generate the data. There is another version of the method called value_async() that does the same thing, but allows you to easily queue up many requests at once.

There are at least two axes here:

  • What data format should come back from the ServiceX query
  • Should programming interface be sync or asynchronous?

There is yet another axis for the root and parquet queries - do you want the files downloaded locally into a cache or just a uri to access them over the web? This is only accessible via direct calls to the servicex library (e.g. see get_root_files_async, get_root_files_stream, and get_data_rootfiles_uri_stream and get_data_rootfiles_uri_async).

What do users of func_adl want?

Let's look at each one and reason about why different choices are made.

  • Data Format
    • What the user is familiar with ("what is a parquet file?")
    • How the data will be consumed downstream of servicex. Input to awkward distributed processing?
  • URI's to files (for root and parquet) or local copied files?
    • At an analysis facility probably want the local files downloaded. On a local laptop developing downstream code, want the local file option.
    • Downstream code will be quite opinionated about what it can and can't use.
  • Async or Synch access
    • One or many datasets? One probably wants to parallelize access to many datasets.
    • In a notebook almost certainly want synch access to make a "demo" work well.
  • Streaming URI's
    • Only interesting if downstream access can process a stream of URI's.
    • Uses well understood but not widely known streaming async infrastructure in python.

Starting from Scratch

@gordonwatts gordonwatts added the enhancement New feature or request label Dec 12, 2022
@gordonwatts gordonwatts self-assigned this Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant