-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate rel_path
from rel_url_path
, use one url
#113
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -263,17 +263,23 @@ datalogistik_metadata.ini | |
Schema of the table. | ||
|
||
``url`` | ||
Download url in case this is a single-file table. | ||
|
||
``base_url`` | ||
Base download url in case this is a multi-file table. Each file will append | ||
their `rel_path` to this to form the full download url. | ||
Download url for the table. This can be: | ||
* A URL specifying the file to be downloaded for that table (which could be a | ||
single file, or a directory that contains many files to be downloaded) | ||
* A base URL that is concatenated with ``rel_url_path``s in the ``files`` attribute | ||
if the table is a multi-file table and it is preferable to list out the files | ||
|
||
``files`` | ||
A list of files in this table. Each entry in the list has the following properties: | ||
|
||
``rel_path`` | ||
Path to the file, relative to the directory of this table. | ||
Path to the file(s), relative to the directory of this table. This is the the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. duplicate "the" |
||
location on disk in the cache. | ||
|
||
``rel_url_path`` | ||
URL path to the file(s), relative to the directory of this table. This is used | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be "relative to the base URL"? |
||
only when downloading the file. This is only necesary when a multi table file has | ||
the files that make up the table listed out individually. | ||
|
||
``file_size`` | ||
Size of the file. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -357,7 +357,6 @@ def validate_table_files(self, table): | |
f"No metadata found for table {table}, could not perform validation (assuming valid)" | ||
) | ||
return True | ||
|
||
new_file_listing = self.create_file_listing(table) | ||
# we can't perform a simple equality check on the whole listing, | ||
# because the orig_file_listing does not contain the metadata file. | ||
|
@@ -372,6 +371,8 @@ def validate_table_files(self, table): | |
found = None | ||
for new_file in new_file_listing: | ||
if new_file["rel_path"] == orig_file["rel_path"]: | ||
# drop the rel_url_path for comparison, because it's not relevant! | ||
orig_file.pop("rel_url_path", None) | ||
Comment on lines
+374
to
+375
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We might want to do the reverse of this some day and only check a subset of specific fields instead of dropping this, but this was the easiest way to working code. |
||
found = new_file | ||
break | ||
|
||
|
@@ -448,7 +449,7 @@ def download(self): | |
msg = ( | ||
"No table entries were found. " | ||
"To download a dataset, at least 1 table entry must exist " | ||
"that has a 'url' or 'base_url' property." | ||
"that has a 'url' property." | ||
) | ||
log.error(msg) | ||
raise ValueError(msg) | ||
|
@@ -460,51 +461,48 @@ def download(self): | |
|
||
# For now, we always download all tables. So we need to loop through each table | ||
for table in self.tables: | ||
|
||
# There are 2 possible types downloads: | ||
# 1 - The table entry has a url property. Either this table is a single-file, or it is a single | ||
# download that will produce multiple files. | ||
# 2 - multi file (either single or multi table). The table entry has a base_url property, | ||
# and each file has a rel_path property. This is appended to the base_url to form | ||
# the download link. The files will be placed in the table directory (generated from the table name). | ||
|
||
# create table dir | ||
table_path = self.ensure_table_loc(table) | ||
|
||
# Type 1 | ||
if table.url: | ||
# Note that table_path will be a file if this is a single-file table, | ||
# and a dir if it is a multi-file table (download_file will produce multiple files) | ||
util.download_file(table.url, output_path=table_path) | ||
util.set_readonly(table_path) | ||
|
||
# Type 2 | ||
elif table.base_url: | ||
if len(table.files) <= 1: | ||
msg = f"Single-file table '{table.table}' has 'base_url' property set. It should only have a 'url'." | ||
for file in table.files: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't we want to support repo entries that do not have a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That would be nice, but let's do that in a follow on or later PR. The one (open) question there is: do we want to support supplying the checksum, etc. at a higher level without the files entry. That gets complicated so IMO we probably shouldn't, but let's make an issue for that + discuss there |
||
# Validate that there is a url at all | ||
if not table.url: | ||
msg = ( | ||
f"Could not find a url property for Table '{table.table}'." | ||
) | ||
log.error(msg) | ||
raise ValueError(msg) | ||
for file in table.files: | ||
# contains the suffix for the download url | ||
rel_path = file.get("rel_path") | ||
if not rel_path: | ||
msg = f"Missing rel_path property for multi-file table '{table.table}'." | ||
raise RuntimeError(msg) | ||
|
||
# contains the suffix for the download url | ||
rel_url_path = file.get("rel_url_path") | ||
|
||
# if the filename is not at the end of full_path, join | ||
table_url = table.url | ||
if rel_url_path and not table_url.endswith(rel_url_path): | ||
table_url = table_url + "/" + rel_url_path | ||
|
||
download_path = table_path | ||
|
||
# Set the rel_path | ||
file["rel_path"] = download_path.name | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why overwrite Or did I misunderstand and is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
You're right that it's (almost) always identical, but there are two cases where it's not identical:
|
||
|
||
# if this is a multi file table, then we need to do validate that | ||
# there are rel_paths + append them. All files constituting a table | ||
# must be in a dir with name table.name (created by ensure_table_loc) | ||
# note that the resulting dir structure is not necessarily flat, | ||
# because the table can have multiple levels of partitioning. | ||
if len(table.files) > 1 or table.multi_file: | ||
if not rel_url_path: | ||
msg = f"Missing rel_url_path property for multi-file table '{table.table}'." | ||
log.error(msg) | ||
raise ValueError(msg) | ||
download_path = download_path / rel_url_path | ||
|
||
# All files constituting a table must be in a dir with name table.name (created by ensure_table_loc) | ||
# note that the resulting dir structure is not necessarily flat, | ||
# because the table can have multiple levels of partitioning. | ||
download_path = table_path / rel_path | ||
url = table.base_url + "/" + rel_path | ||
|
||
util.download_file(url, output_path=download_path) | ||
util.set_readonly(download_path) | ||
# but for multi-file tables, we override this with the rel_url_path | ||
file["rel_path"] = rel_url_path | ||
|
||
else: | ||
msg = f"Could not find a url or base_url property for Table '{table.table}'." | ||
log.error(msg) | ||
raise RuntimeError(msg) | ||
util.download_file(table_url, output_path=download_path) | ||
util.set_readonly(download_path) | ||
|
||
# Try validation in case the dataset info contained checksums | ||
if not self.validate_table_files(table): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -299,7 +299,10 @@ def test_compress(comp_string): | |
# Integration-style tests | ||
def test_main(capsys): | ||
# This should be in the cache already, so no conversion needed | ||
exact_dataset = dataset.Dataset(name="fanniemae_sample", format="csv", delim="|") | ||
# import pdb; pdb.set_trace() | ||
exact_dataset = dataset.Dataset( | ||
name="fanniemae_sample", format="csv", delim="|", compression="gzip" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It turns out we were missing the compression here, so this was actually transforming a dataset (oops!). I've added to the test below an assertion that the hash is the same so we catch that if it happens again (with a comment why...) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh wow good catch, it has proved rather difficult to get the default behavior right in all situations! Maybe we should reconsider the |
||
) | ||
|
||
with pytest.raises(SystemExit) as e: | ||
datalogistik.main(exact_dataset) | ||
|
@@ -310,6 +313,11 @@ def test_main(capsys): | |
assert captured["name"] == "fanniemae_sample" | ||
assert captured["format"] == "csv" | ||
assert isinstance(captured["tables"], dict) | ||
# this is the path from the fixtures, if this doesn't match, we've actualy converted and not just found the extant one | ||
assert ( | ||
captured["tables"]["fanniemae_sample"]["path"] | ||
== "tests/fixtures/test_cache/fanniemae_sample/a77e575/fanniemae_sample.csv.gz" | ||
) | ||
|
||
|
||
@pytest.mark.skipif(sys.platform == "win32", reason="windows errors on the cleanup") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure this rst isn't right, please suggest the right syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you just need more line breaks (usually the solution to RST woes :) ). Also watch out for the "``s"