Making Data access more uniform #55

gordonwatts · 2022-12-12T12:42:46Z

It would be nice to see a more carefully thought out and straight forward way to ask for data to come back from ServiceX. In short, normalize the access patterns for servicex. The current interface has grown organically, and there are now so many operations and it is hard to surface them from one place to the other. Time to take a step back, perhaps.

What we have now

        sx = ServiceXDataset([uproot_single_file],
                             backend_name=endpoint_uproot,
                             status_callback_factory=None)
        src = ServiceXSourceUpROOT(sx, 'mini')
        r = (src.Select(lambda e: {'lep_pt': e['lep_pt']})
                .AsAwkwardArray()
                .value())

And AsAwkwardArray can be replaced by a bunch of different things:

AsPandasDF, as_pandas - a panda dataframe (this does not support nested objects!)
AsROOTTTree, as_ROOT_tree - a list of file(s) that contains a root TTree object
AsParquetFiles, as_parquet - a list of parquet file(s) built from awkward's to_parquet method
AsAwkwardArray, as_awkward - returns an awkward array of all the data

These methods do not return the actual data - just the request to generate the data. The value() call at the end actually triggers the infrastructure to generate the data. There is another version of the method called value_async() that does the same thing, but allows you to easily queue up many requests at once.

There are at least two axes here:

What data format should come back from the ServiceX query
Should programming interface be sync or asynchronous?

There is yet another axis for the root and parquet queries - do you want the files downloaded locally into a cache or just a uri to access them over the web? This is only accessible via direct calls to the servicex library (e.g. see get_root_files_async, get_root_files_stream, and get_data_rootfiles_uri_stream and get_data_rootfiles_uri_async).

What do users of `func_adl` want?

Let's look at each one and reason about why different choices are made.

Data Format
- What the user is familiar with ("what is a parquet file?")
- How the data will be consumed downstream of servicex. Input to awkward distributed processing?
URI's to files (for root and parquet) or local copied files?
- At an analysis facility probably want the local files downloaded. On a local laptop developing downstream code, want the local file option.
- Downstream code will be quite opinionated about what it can and can't use.
Async or Synch access
- One or many datasets? One probably wants to parallelize access to many datasets.
- In a notebook almost certainly want synch access to make a "demo" work well.
Streaming URI's
- Only interesting if downstream access can process a stream of URI's.
- Uses well understood but not widely known streaming async infrastructure in python.

Starting from Scratch

The text was updated successfully, but these errors were encountered:

gordonwatts added the enhancement New feature or request label Dec 12, 2022

gordonwatts self-assigned this Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making Data access more uniform #55

Making Data access more uniform #55

gordonwatts commented Dec 12, 2022

Making Data access more uniform #55

Making Data access more uniform #55

Comments

gordonwatts commented Dec 12, 2022

What we have now

What do users of func_adl want?

Starting from Scratch

What do users of `func_adl` want?