Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update create data package utility #104

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

r-b-g-b
Copy link

@r-b-g-b r-b-g-b commented Jan 23, 2025

Two changes to the d1_util.create_data_packages to let it run correctly. I added some more detail inline in the "Files changed" tab.

Copy link
Author

@r-b-g-b r-b-g-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some inline comments to explain the changes!

sys_meta = generate_system_metadata_for_science_object(
pid, SYSMETA_FORMATID, sci_obj
)
client.create(pid, io.StringIO(sci_obj), sys_meta)
client.create(pid, io.BytesIO(sci_obj), sys_meta)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this line results in the error:

TypeError                                 Traceback (most recent call last)
Cell In[62], line 8
      6 for file_path in files_in_group:
      7     print("  File: {}".format(file_path))
----> 8     create_science_object_on_member_node(client, file_path)

File ~/projects/cib-data-infrastructure/.venv/lib/python3.13/site-packages/d1_util/create_data_packages.py:178, in create_science_object_on_member_node(client, file_path)
    174 sci_obj = open(file_path, "rb").read()
    175 sys_meta = generate_system_metadata_for_science_object(
    176     pid, SYSMETA_FORMATID, sci_obj
    177 )
--> 178 client.create(pid, io.StringIO(sci_obj), sys_meta)

TypeError: initial_value must be str or None, not bytes

That makes sense, since the file is opened in "rb" binary mode so needs to be wrapped with BytesIO instead of StringIO (which expects a decoded str).

# default, use the DataONE production environment for resolving the object
# URIs. To use the resource map generator in a test environment, pass the base
# url to the root CN in that environment in the dataone_root parameter.
resource_map_generator = d1_common.resource_map.ResourceMapGenerator()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is I get the error:

AttributeError                            Traceback (most recent call last)
Cell In[83], line 1
----> 1 create_package_on_member_node(client, files_in_group)

Cell In[65], line 90, in create_package_on_member_node(client, files_in_group)
     88 package_pid = group_name(files_in_group[0])
     89 pids = [os.path.basename(p) for p in files_in_group]
---> 90 resource_map = create_resource_map_for_pids(package_pid, pids)
     91 sys_meta = generate_system_metadata_for_science_object(
     92     package_pid, RESOURCE_MAP_FORMAT_ID, resource_map
     93 )
     94 client.create(package_pid, io.StringIO(resource_map), sys_meta)

Cell In[65], line 102, in create_resource_map_for_pids(package_pid, pids)
     97 def create_resource_map_for_pids(package_pid, pids):
     98     # Create a resource map generator that will generate resource maps that, by
     99     # default, use the DataONE production environment for resolving the object
    100     # URIs. To use the resource map generator in a test environment, pass the base
    101     # url to the root CN in that environment in the dataone_root parameter.
--> 102     resource_map_generator = d1_common.resource_map.ResourceMapGenerator()
    103     return resource_map_generator.simple_generate_resource_map(
    104         package_pid, pids[0], pids[1:]
    105     )

AttributeError: module 'd1_common.resource_map' has no attribute 'ResourceMapGenerator'

I believe this class has since been removed and its method simple_generate_resource_map replaced with d1_common.resource_map.createSimpleResourceMap.

resource_map = create_resource_map_for_pids(package_pid, pids)
resource_map = create_resource_map_for_pids(
package_pid, pids
).serialize_to_transport()
sys_meta = generate_system_metadata_for_science_object(
package_pid, RESOURCE_MAP_FORMAT_ID, resource_map
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't serialize the resource map, we get an error here:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[97], line 1
----> 1 create_package_on_member_node(client, files_in_group)

Cell In[96], line 91, in create_package_on_member_node(client, files_in_group)
     89 pids = [os.path.basename(p) for p in files_in_group]
     90 resource_map = create_resource_map_for_pids(package_pid, pids)
---> 91 sys_meta = generate_system_metadata_for_science_object(
     92     package_pid, RESOURCE_MAP_FORMAT_ID, resource_map
     93 )
     94 client.create(package_pid, io.BytesIO(resource_map), sys_meta)

Cell In[96], line 111, in generate_system_metadata_for_science_object(pid, format_id, science_object)
    109 def generate_system_metadata_for_science_object(pid, format_id, science_object):
    110     size = len(science_object)
--> 111     md5 = hashlib.md5(science_object).hexdigest()
    112     now = d1_common.date_time.utc_now()
    113     sys_meta = generate_sys_meta(pid, format_id, size, md5, now)

TypeError: object supporting the buffer API required

Since the science_object is expected to be a sequence of bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants