Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update create data package utility #104

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 11 additions & 10 deletions utilities/src/d1_util/create_data_packages.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,30 +171,31 @@ def main():
# once, for the MD5 checksum calculation.
def create_science_object_on_member_node(client, file_path):
pid = os.path.basename(file_path)
sci_obj = open(file_path, "rb").read()
with open(file_path, "rb") as fp:
sci_obj = fp.read()
sys_meta = generate_system_metadata_for_science_object(
pid, SYSMETA_FORMATID, sci_obj
)
client.create(pid, io.StringIO(sci_obj), sys_meta)
client.create(pid, io.BytesIO(sci_obj), sys_meta)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this line results in the error:

TypeError                                 Traceback (most recent call last)
Cell In[62], line 8
      6 for file_path in files_in_group:
      7     print("  File: {}".format(file_path))
----> 8     create_science_object_on_member_node(client, file_path)

File ~/projects/cib-data-infrastructure/.venv/lib/python3.13/site-packages/d1_util/create_data_packages.py:178, in create_science_object_on_member_node(client, file_path)
    174 sci_obj = open(file_path, "rb").read()
    175 sys_meta = generate_system_metadata_for_science_object(
    176     pid, SYSMETA_FORMATID, sci_obj
    177 )
--> 178 client.create(pid, io.StringIO(sci_obj), sys_meta)

TypeError: initial_value must be str or None, not bytes

That makes sense, since the file is opened in "rb" binary mode so needs to be wrapped with BytesIO instead of StringIO (which expects a decoded str).



def create_package_on_member_node(client, files_in_group):
package_pid = group_name(files_in_group[0])
pids = [os.path.basename(p) for p in files_in_group]
resource_map = create_resource_map_for_pids(package_pid, pids)
resource_map = create_resource_map_for_pids(
package_pid, pids
).serialize_to_transport()
sys_meta = generate_system_metadata_for_science_object(
package_pid, RESOURCE_MAP_FORMAT_ID, resource_map
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't serialize the resource map, we get an error here:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[97], line 1
----> 1 create_package_on_member_node(client, files_in_group)

Cell In[96], line 91, in create_package_on_member_node(client, files_in_group)
     89 pids = [os.path.basename(p) for p in files_in_group]
     90 resource_map = create_resource_map_for_pids(package_pid, pids)
---> 91 sys_meta = generate_system_metadata_for_science_object(
     92     package_pid, RESOURCE_MAP_FORMAT_ID, resource_map
     93 )
     94 client.create(package_pid, io.BytesIO(resource_map), sys_meta)

Cell In[96], line 111, in generate_system_metadata_for_science_object(pid, format_id, science_object)
    109 def generate_system_metadata_for_science_object(pid, format_id, science_object):
    110     size = len(science_object)
--> 111     md5 = hashlib.md5(science_object).hexdigest()
    112     now = d1_common.date_time.utc_now()
    113     sys_meta = generate_sys_meta(pid, format_id, size, md5, now)

TypeError: object supporting the buffer API required

Since the science_object is expected to be a sequence of bytes.

)
client.create(package_pid, io.StringIO(resource_map), sys_meta)
client.create(package_pid, io.BytesIO(resource_map), sys_meta)


def create_resource_map_for_pids(package_pid, pids):
# Create a resource map generator that will generate resource maps that, by
# default, use the DataONE production environment for resolving the object
# URIs. To use the resource map generator in a test environment, pass the base
# url to the root CN in that environment in the dataone_root parameter.
resource_map_generator = d1_common.resource_map.ResourceMapGenerator()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is I get the error:

AttributeError                            Traceback (most recent call last)
Cell In[83], line 1
----> 1 create_package_on_member_node(client, files_in_group)

Cell In[65], line 90, in create_package_on_member_node(client, files_in_group)
     88 package_pid = group_name(files_in_group[0])
     89 pids = [os.path.basename(p) for p in files_in_group]
---> 90 resource_map = create_resource_map_for_pids(package_pid, pids)
     91 sys_meta = generate_system_metadata_for_science_object(
     92     package_pid, RESOURCE_MAP_FORMAT_ID, resource_map
     93 )
     94 client.create(package_pid, io.StringIO(resource_map), sys_meta)

Cell In[65], line 102, in create_resource_map_for_pids(package_pid, pids)
     97 def create_resource_map_for_pids(package_pid, pids):
     98     # Create a resource map generator that will generate resource maps that, by
     99     # default, use the DataONE production environment for resolving the object
    100     # URIs. To use the resource map generator in a test environment, pass the base
    101     # url to the root CN in that environment in the dataone_root parameter.
--> 102     resource_map_generator = d1_common.resource_map.ResourceMapGenerator()
    103     return resource_map_generator.simple_generate_resource_map(
    104         package_pid, pids[0], pids[1:]
    105     )

AttributeError: module 'd1_common.resource_map' has no attribute 'ResourceMapGenerator'

I believe this class has since been removed and its method simple_generate_resource_map replaced with d1_common.resource_map.createSimpleResourceMap.

return resource_map_generator.simple_generate_resource_map(
# Create a resource map that, by default, uses the DataONE production environment for resolving
# the object URIs. To use the resource map generator in a test environment, pass the base url to
# the root CN in that environment in the dataone_root parameter.
return d1_common.resource_map.createSimpleResourceMap(
package_pid, pids[0], pids[1:]
)

Expand Down