Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional Data Formats? #156

Open
ax3l opened this issue May 30, 2023 · 12 comments
Open

Additional Data Formats? #156

ax3l opened this issue May 30, 2023 · 12 comments

Comments

@ax3l
Copy link
Contributor

ax3l commented May 30, 2023

Thank you for the JOSS submission in openjournals/joss-reviews#5375 .

I really like the support of the IAEA data loaders.

Based on the extended abstract and linked motivating discussion in it, I was wondering:
I am personally curious if, for phase space data, the openPMD standard [1] [2] (disclaimer: I lead this effort) could be helpful as an additional input loader source? We have by now a relatively large selection of accelerator codes supporting openPMD as their output and also try to use it more in experimental laser-plasma accelerator work.

The paper summarizes so far:

[...] extensible library enabling import/analysis/export of PhaseSpace data of arbitrary format.

If one were to implement another loader, how much work would be needed?
I am looking at
https://bwheelz36.github.io/ParticlePhaseSpace/new_data_loader.html

and am further curious about data sizes: #158

Update: I found https://bwheelz36.github.io/ParticlePhaseSpace/code_docs.html#ParticlePhaseSpace.DataLoaders.Load_PandasData which might be pretty easy to couple to openPMD with https://github.com/openPMD/openPMD-api/blob/0.15.1/examples/11_particle_dataframe.py (example data sets here). (Our Pandas reader supports chunked processing - let's continue discussion on lasy loading/streaming/out-of-core processing in #158)

Note that the linked reference Kuschel, S. (2022). Postpic. https://github.com/skuschel/postpic implemented openPMD early on.
Minor correction: I think it should read (2014) as of the first release for this reference.

[1] https://github.com/openPMD
[2] https://www.openPMD.org

This was referenced May 30, 2023
@bwheelz36
Copy link
Owner

Hey @ax3l - I have to admit that I was embarrassingly not actually aware of openPMD! It looks great.

It is fairly minimal amount of work to add new Loaders/Exporters (depending of course on how complex the data source is). I would be happy to take a look at loading openPMD data. I don't suppose you already have some files handy I could test on?
Also, I notice that openPMD supports multiple data formats. It might be quite some work to write a DataLoader that handled several formats, but as a proof of principle would it be acceptable to just demonstrate on one format?

@ax3l
Copy link
Contributor Author

ax3l commented May 31, 2023

Hi @bwheelz36 , sorry for the edit in my original message.

I added a few example files and a probably four liner to load data via an edit :)

import openpmd_api as io

s = io.Series("../samples/git-sample/data%T.h5", io.Access.read_only)
electrons = s.iterations[400].particles["electrons"]  # 400 or another "step" in the data series

df = electrons.to_df()  # careful: all SI at this point

After finishing the docs, I would also be excited to attempt an exporter 🤩

(Please do not feel that my implementation questions as required for the JOSS review to pass. I am just truly curious and the other comments in between for the manuscript are more important to add please :) )

@bwheelz36
Copy link
Owner

Hi @ax3l

That's all good - given there is a defined open dataset format, it absolutely makes sense that this package should support it.

Having said that - I'm a bit confused tbh. I'm trying to run the first read example from the openpmd-api site with the following code:

import openpmd_api as io
series = io.Series( "data%T.h5", io.Access.read_only)

I pointed this code to each of the three examples example-2d', example-3d', example-thetaMode - (it is actually not that clear from the example that this is what you are supposed to do?). In each case the data loads, but there is no information in the 'iterations' attribute?

@ax3l
Copy link
Contributor Author

ax3l commented Jun 5, 2023

Hi @bwheelz36,

Thanks for trying the example datasets!
The iterations concept is explained here:
https://openpmd-api.readthedocs.io/en/latest/usage/concepts.html

there is no information in the 'iterations' attribute?

Please let me know if you have more questions on this in case I missed the point of the question :)

Once you open a data Series, you can loop over available iterations in it, read the data in each iteration, etc

@bwheelz36
Copy link
Owner

Hi @ax3l

Ok, here's an end to end example of what I tried. Maybe I'm doing something extremely stupid...

in a terminal:

# inside a fresh virtual environment
git clone https://github.com/openPMD/openPMD-example-datasets.git
cd openPMD-example-datasets
tar -zxvf example-2d.tar.gz
tar -zxvf example-3d.tar.gz
tar -zxvf example-thetaMode.tar.gz

pip install openpmd-api
python  # enter python session

inside python:

import openpmd_api as io

data_loc = "example-2d/hdf5/data%T.h5"
s = io.Series(data_loc, io.Access.read_only)

Here's the explorer view of s; it appears to simply have nothing in it?

image

@ax3l
Copy link
Contributor Author

ax3l commented Jun 25, 2023

Oh that is wild, thanks for reporting!
We check against most of those files in CI, but maybe something slipped in that we did not cover :-o

I will double check this after my conferences and summer break.

@franzpoeschel
Copy link

For this, see my comment here:

The string representations of many classes are counterintuitive and have led to confusion, e.g. series.iterations printed will look as if it is empty

I guess that this issue is proved again..
The data is there, it just does not look like it:

>>> import openpmd_api as io
>>> s = io.Series("data%T.h5", io.Access.read_only)
>>> s.iterations
<openPMD.Attributable with '0' attributes>
>>> [index for index in s.iterations]
[255, 260, 265, 270, 275, 280, 285, 290, 295, 300, 305, 310, 315, 320, 325, 330, 335, 340, 345, 350, 355, 360, 365, 370, 375, 380, 385, 390, 395, 400]

@franzpoeschel
Copy link

Fixed in openPMD/openPMD-api#1476

@ax3l
Copy link
Contributor Author

ax3l commented Jul 17, 2023

Thank you for updating the representation strings, @franzpoeschel! This will be shipped with the next patch release, 0.15.2.

@bwheelz36 for your example above, all looks good and you can keep exploring what is inside the data series s like this:

for k_i, i in s.iterations.items():
    print("Iteration: {0}".format(k_i))

    for k_p, p in i.particles.items():
        print("  Particle species '{0}':".format(k_p))

inside the particle species p is then a record component that is a key-value pair of a string + record component, which can be accessed like a numpy array, e.g., u_x = p["momentum"]["x"][()] - note that s.flush() will fill the array u_x with actual data.

Even easier is the access as a data frame, as in the 11_particle_dataframe.py example:

for i in s.iterations:
    for p in i.particles:
        df = p.to_df()
        print(df)

@ax3l
Copy link
Contributor Author

ax3l commented Aug 16, 2023

@bwheelz36 did this help? :)

@bwheelz36
Copy link
Owner

Hi @ax3l - the first loop you posted above helps yes - it is clear there is some data there! in that example, doing p.to_df() gives a dataframe which would facilitate close to one-to-one read in to ParticlePhaseSpace.

the second loop crashes with AttributeError: 'int' object has no attribute 'particles'. I added a line if hasattr(i, 'particles'): however this was never entered...

Can I make sure I understand the intent behind iterations - each iteration would represent for instance a time interval?

@franzpoeschel
Copy link

franzpoeschel commented Aug 16, 2023

the second loop crashes with AttributeError: 'int' object has no attribute 'particles'. I added a line if hasattr(i, 'particles'): however this was never entered...

I think that there is a slight bug in the second loop, try this one:

for it_index, it in s.iterations.items():
    for p in it.particles:
        df = p.to_df()
        print(df)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants