Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve of design of filtering of branches when concatenating #1388

Open
acampove opened this issue Feb 23, 2025 · 3 comments
Open

Improve of design of filtering of branches when concatenating #1388

acampove opened this issue Feb 23, 2025 · 3 comments
Assignees
Labels
feature New feature or request

Comments

@acampove
Copy link

I am using version 5.5.2 and in the snippet below:

import uproot
import numpy as np

def _make_file(fname : str):
    n_entries = 10
    branch1_data = np.random.rand(n_entries)
    branch2_data = np.random.rand(n_entries)

    with uproot.recreate(fname) as f:
        f["tree"] = {
            "a_1": branch1_data,
            "a_2": branch2_data,
            "a_3": branch1_data,
            "a_4": branch2_data,  
            "b_1": branch1_data,
            "b_2": branch2_data,  
            "b_3": branch1_data,
            "b_4": branch2_data,  
        }

def main():
    _make_file('file_1.root')
    _make_file('file_2.root')

    df = uproot.concatenate({'file_1.root': 'tree', 'file_2.root' : 'tree'}, expressions={'a_1', 'a_2'}, filter_name='b*', library='pd')
    print(df)

if __name__ == "__main__":
    main()

I get columns a_1 and a_2.

From the user's POV, I want both the b and the a columns. Is it possible to modify the behavior of uproot to get an inclusive, rather than exclusive selection?

@acampove acampove added the feature New feature or request label Feb 23, 2025
@pfackeldey pfackeldey self-assigned this Feb 27, 2025
@pfackeldey
Copy link
Collaborator

Hi @acampove,
expressions can be used to on-the-fly transform your data into a specific array/column. If you're not interested in that, you probably don't need to provide it.
Omitting expressions yields the expected dataframe:

df = uproot.concatenate({'file_1.root': 'tree', 'file_2.root' : 'tree'}, filter_name='b*', library='pd')
print(df)
#         b_1       b_2       b_3       b_4
# 0   0.096449  0.953901  0.096449  0.953901
# 1   0.637242  0.259867  0.637242  0.259867
# 2   0.515761  0.313249  0.515761  0.313249
# 3   0.740748  0.940448  0.740748  0.940448
# 4   0.869072  0.624719  0.869072  0.624719
# 5   0.041706  0.761446  0.041706  0.761446
# 6   0.883716  0.163284  0.883716  0.163284
# 7   0.156949  0.922057  0.156949  0.922057
# 8   0.651333  0.548299  0.651333  0.548299
# 9   0.622364  0.334150  0.622364  0.334150
# 10  0.937513  0.810083  0.937513  0.810083
# 11  0.055701  0.186211  0.055701  0.186211
# 12  0.611302  0.091394  0.611302  0.091394
# 13  0.862566  0.001212  0.862566  0.001212
# 14  0.710977  0.217308  0.710977  0.217308
# 15  0.250999  0.273506  0.250999  0.273506
# 16  0.286835  0.993268  0.286835  0.993268
# 17  0.990380  0.014993  0.990380  0.014993
# 18  0.256301  0.610082  0.256301  0.610082
# 19  0.690280  0.854935  0.690280  0.854935

To give you an example where expressions can be used for:

df = uproot.concatenate({'file_1.root': 'tree', 'file_2.root' : 'tree'}, expressions="sqrt(b_1**2 + a_1**2)", library='pd')
print(df)
#     sqrt(b_1**2 + a_1**2)
# 0                0.136400
# 1                0.901196
# 2                0.729396
# 3                1.047576
# ...

Best, Peter

@pfackeldey
Copy link
Collaborator

pfackeldey commented Feb 27, 2025

Oh, sorry, I just realized you wanted to have all b* but also a_1 and a_2. You can have this filter logic with a regex:

df = uproot.concatenate({'file_1.root': 'tree', 'file_2.root' : 'tree'}, filter_name="/(b.+)|(a_[1,2])/i", library='pd')
print(df)
#          a_1       a_2       b_1       b_2       b_3       b_4
# 0   0.404026  0.505566  0.404026  0.505566  0.404026  0.505566
# 1   0.806364  0.069890  0.806364  0.069890  0.806364  0.069890
# 2   0.966566  0.872194  0.966566  0.872194  0.966566  0.872194
# 3   0.226920  0.983254  0.226920  0.983254  0.226920  0.983254
# ...

@NJManganelli
Copy link
Contributor

I think a list of filters, eg ["b*", "a_1", ...] should also work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants