-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update create data package utility #104
base: master
Are you sure you want to change the base?
Update create data package utility #104
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some inline comments to explain the changes!
sys_meta = generate_system_metadata_for_science_object( | ||
pid, SYSMETA_FORMATID, sci_obj | ||
) | ||
client.create(pid, io.StringIO(sci_obj), sys_meta) | ||
client.create(pid, io.BytesIO(sci_obj), sys_meta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this line results in the error:
TypeError Traceback (most recent call last)
Cell In[62], line 8
6 for file_path in files_in_group:
7 print(" File: {}".format(file_path))
----> 8 create_science_object_on_member_node(client, file_path)
File ~/projects/cib-data-infrastructure/.venv/lib/python3.13/site-packages/d1_util/create_data_packages.py:178, in create_science_object_on_member_node(client, file_path)
174 sci_obj = open(file_path, "rb").read()
175 sys_meta = generate_system_metadata_for_science_object(
176 pid, SYSMETA_FORMATID, sci_obj
177 )
--> 178 client.create(pid, io.StringIO(sci_obj), sys_meta)
TypeError: initial_value must be str or None, not bytes
That makes sense, since the file is opened in "rb" binary mode so needs to be wrapped with BytesIO instead of StringIO (which expects a decoded str
).
# default, use the DataONE production environment for resolving the object | ||
# URIs. To use the resource map generator in a test environment, pass the base | ||
# url to the root CN in that environment in the dataone_root parameter. | ||
resource_map_generator = d1_common.resource_map.ResourceMapGenerator() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As is I get the error:
AttributeError Traceback (most recent call last)
Cell In[83], line 1
----> 1 create_package_on_member_node(client, files_in_group)
Cell In[65], line 90, in create_package_on_member_node(client, files_in_group)
88 package_pid = group_name(files_in_group[0])
89 pids = [os.path.basename(p) for p in files_in_group]
---> 90 resource_map = create_resource_map_for_pids(package_pid, pids)
91 sys_meta = generate_system_metadata_for_science_object(
92 package_pid, RESOURCE_MAP_FORMAT_ID, resource_map
93 )
94 client.create(package_pid, io.StringIO(resource_map), sys_meta)
Cell In[65], line 102, in create_resource_map_for_pids(package_pid, pids)
97 def create_resource_map_for_pids(package_pid, pids):
98 # Create a resource map generator that will generate resource maps that, by
99 # default, use the DataONE production environment for resolving the object
100 # URIs. To use the resource map generator in a test environment, pass the base
101 # url to the root CN in that environment in the dataone_root parameter.
--> 102 resource_map_generator = d1_common.resource_map.ResourceMapGenerator()
103 return resource_map_generator.simple_generate_resource_map(
104 package_pid, pids[0], pids[1:]
105 )
AttributeError: module 'd1_common.resource_map' has no attribute 'ResourceMapGenerator'
I believe this class has since been removed and its method simple_generate_resource_map
replaced with d1_common.resource_map.createSimpleResourceMap
.
resource_map = create_resource_map_for_pids(package_pid, pids) | ||
resource_map = create_resource_map_for_pids( | ||
package_pid, pids | ||
).serialize_to_transport() | ||
sys_meta = generate_system_metadata_for_science_object( | ||
package_pid, RESOURCE_MAP_FORMAT_ID, resource_map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't serialize the resource map, we get an error here:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[97], line 1
----> 1 create_package_on_member_node(client, files_in_group)
Cell In[96], line 91, in create_package_on_member_node(client, files_in_group)
89 pids = [os.path.basename(p) for p in files_in_group]
90 resource_map = create_resource_map_for_pids(package_pid, pids)
---> 91 sys_meta = generate_system_metadata_for_science_object(
92 package_pid, RESOURCE_MAP_FORMAT_ID, resource_map
93 )
94 client.create(package_pid, io.BytesIO(resource_map), sys_meta)
Cell In[96], line 111, in generate_system_metadata_for_science_object(pid, format_id, science_object)
109 def generate_system_metadata_for_science_object(pid, format_id, science_object):
110 size = len(science_object)
--> 111 md5 = hashlib.md5(science_object).hexdigest()
112 now = d1_common.date_time.utc_now()
113 sys_meta = generate_sys_meta(pid, format_id, size, md5, now)
TypeError: object supporting the buffer API required
Since the science_object
is expected to be a sequence of bytes.
Two changes to the
d1_util.create_data_packages
to let it run correctly. I added some more detail inline in the "Files changed" tab.