-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate rel_path
from rel_url_path
, use one url
#113
Conversation
* A URL specifying the file to be downloaded for that table (which could be a | ||
single file, or a directory that contains many files to be downloaded) | ||
* A base URL that is concatenated with ``rel_url_path``s in the ``files`` attribute | ||
if the table is a multi-file table and it is preferable to list out the files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure this rst isn't right, please suggest the right syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you just need more line breaks (usually the solution to RST woes :) ). Also watch out for the "``s"
``url``
Download url for the table. This can be:
* A URL specifying the file to be downloaded for that table (which could be a
single file, or a directory that contains many files to be downloaded)
* A base URL that is concatenated with ``rel_url_path`` in the ``files`` attribute
if the table is a multi-file table and it is preferable to list out the files
# drop the rel_url_path for comparison, because it's not relevant! | ||
orig_file.pop("rel_url_path", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to do the reverse of this some day and only check a subset of specific fields instead of dropping this, but this was the easiest way to working code.
exact_dataset = dataset.Dataset(name="fanniemae_sample", format="csv", delim="|") | ||
# import pdb; pdb.set_trace() | ||
exact_dataset = dataset.Dataset( | ||
name="fanniemae_sample", format="csv", delim="|", compression="gzip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out we were missing the compression here, so this was actually transforming a dataset (oops!). I've added to the test below an assertion that the hash is the same so we catch that if it happens again (with a comment why...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow good catch, it has proved rather difficult to get the default behavior right in all situations! Maybe we should reconsider the None
versus "uncompressed" distinction (where None
is basically "dont-care"), then our equality operator could just return equal if one of the objects has compression "dont-care"
"table_url,rel_url_path,final_url", | ||
[ | ||
( | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet", | ||
None, | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet", | ||
), | ||
# we can append rel_url_path | ||
( | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic", | ||
"chi_traffic_2020_Q1.parquet", | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet", | ||
), | ||
# we don't duplicate rel_url_path when it's there | ||
( | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet", | ||
"chi_traffic_2020_Q1.parquet", | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet", | ||
), | ||
# we can have a tablename that is differnet from rel_url_path | ||
( | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic", | ||
"traffic.parquet", | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic/traffic.parquet", | ||
), | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably overkill (basically to support someone specifying the full url and the rel_path), but it's a handful of lines of code + some tests to cover a relatively confusing bit. We can pull it out (or turn the fix-up code that removes the filename if it's at the end of the url into validation that tells someone they've messed up their repo code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it this way, because it's nice to still support a simple url (that is not made up by a base + rel_url_path)
# We probably should move write_metadata out from download, but let's mock it for now | ||
def _fake_write_metadata(self): | ||
return True | ||
|
||
monkeypatch.setattr(Dataset, "write_metadata", _fake_write_metadata) | ||
|
||
# This probably also means that set_readonly isn't in the right place, but let's mock it away nonetheless | ||
def _fake_set_readonly(path): | ||
return True | ||
|
||
monkeypatch.setattr("datalogistik.util.set_readonly", _fake_set_readonly) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In working on these tests I've left things in place as much as possible, but that we need to monkey patch these functions tells me we probably don't have Dataset.download()
factored like we should. I would propose we have a method that does the downloading and no validation or anything (maybe Dataset._download()`), and then we wrap that with the validation etc. stuff later (so then when we test that we only need to mock the actual downloading bit and not also permissions+metadata writing+permissions munging).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
# add rel_url_path as if this were coming from a repo file with that. | ||
ds = copy.deepcopy(multi_file_ds) | ||
files = ds.tables[0].files | ||
# add rel_url_path | ||
[fl.update({"rel_url_path": fl["rel_path"]}) for fl in files] | ||
# remote rel_path | ||
[fl.pop("rel_path") for fl in files] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit funny, but it's because we're using metadata from the fixture cache to then test downloading. So we need to add rel_url_path
in here as if it were in repo.json
We could also add rel_url_path
in to the fixtures and take out some of this (though it's good that we're testing the rel_path
setting which (rightfully) happens during downloading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add rel_url_path
in the metadata files in the fixture cache, I think these properties will be carried from the repo file entry so they would be there in a real metadata file too
@@ -298,7 +392,8 @@ def _simulate_download_by_copying(url, output_path): | |||
# Note that if we were not working in a tmp dir, the dir should have been deleted | |||
# and the exception message would be: "File integrity check for newly downloaded dataset failed." | |||
|
|||
# Test a succeeding download | |||
|
|||
def test_validated_download_dataset(monkeypatch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I broke these out to make them a bit easier to see which failed when. Happy to put them back if we're 👎 on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, I posed some questions but feel free to merge
README.rst
Outdated
location on disk in the cache. | ||
|
||
``rel_url_path`` | ||
URL path to the file(s), relative to the directory of this table. This is used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "relative to the base URL"?
elif table.base_url: | ||
if len(table.files) <= 1: | ||
msg = f"Single-file table '{table.table}' has 'base_url' property set. It should only have a 'url'." | ||
for file in table.files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want to support repo entries that do not have a files
entry? A simple entry could just have a table
entry with a url
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be nice, but let's do that in a follow on or later PR. The one (open) question there is: do we want to support supplying the checksum, etc. at a higher level without the files entry. That gets complicated so IMO we probably shouldn't, but let's make an issue for that + discuss there
download_path = table_path | ||
|
||
# Set the rel_path | ||
file["rel_path"] = download_path.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why overwrite file["rel_path"]
here? Don't we need to it figure out where to place the file? The contents could be different from download_path.name
, especially if it is placed in a subdir (because of partitioning for example), and then the validation would fail
Or did I misunderstand and is rel_path
now not a user input anymore? It seems I did, but then my question is; if rel_url_path
is set, it is alway identical to rel_path
, so do we need the distinction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is alway identical to rel_path, so do we need the distinction?
You're right that it's (almost) always identical, but there are two cases where it's not identical:
- in the multi-file table scenario below
- when the table name is different from the filename
exact_dataset = dataset.Dataset(name="fanniemae_sample", format="csv", delim="|") | ||
# import pdb; pdb.set_trace() | ||
exact_dataset = dataset.Dataset( | ||
name="fanniemae_sample", format="csv", delim="|", compression="gzip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow good catch, it has proved rather difficult to get the default behavior right in all situations! Maybe we should reconsider the None
versus "uncompressed" distinction (where None
is basically "dont-care"), then our equality operator could just return equal if one of the objects has compression "dont-care"
README.rst
Outdated
|
||
``files`` | ||
A list of files in this table. Each entry in the list has the following properties: | ||
|
||
``rel_path`` | ||
Path to the file, relative to the directory of this table. | ||
Path to the file(s), relative to the directory of this table. This is the the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate "the"
"table_url,rel_url_path,final_url", | ||
[ | ||
( | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet", | ||
None, | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet", | ||
), | ||
# we can append rel_url_path | ||
( | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic", | ||
"chi_traffic_2020_Q1.parquet", | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet", | ||
), | ||
# we don't duplicate rel_url_path when it's there | ||
( | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet", | ||
"chi_traffic_2020_Q1.parquet", | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet", | ||
), | ||
# we can have a tablename that is differnet from rel_url_path | ||
( | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic", | ||
"traffic.parquet", | ||
"https://ursa-qa.s3.amazonaws.com/chitraffic/traffic.parquet", | ||
), | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it this way, because it's nice to still support a simple url (that is not made up by a base + rel_url_path)
# add rel_url_path as if this were coming from a repo file with that. | ||
ds = copy.deepcopy(multi_file_ds) | ||
files = ds.tables[0].files | ||
# add rel_url_path | ||
[fl.update({"rel_url_path": fl["rel_path"]}) for fl in files] | ||
# remote rel_path | ||
[fl.pop("rel_path") for fl in files] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add rel_url_path
in the metadata files in the fixture cache, I think these properties will be carried from the repo file entry so they would be there in a real metadata file too
Resolves #110 and #111
This introduces a disambiguation of
rel_path
fromrel_url_path
. The idea is that:rel_path
is that path in the cache, on diskrel_url_path
is the path needed for downloading onlyThis distinction is really only needed for cases where the table name (which is what a single-file table will be named once downloaded) is different from the path that it is stored as. We might not even need this "feature" though we have a table where that happens | is true already. We could also overwrite | change
rel_path
after we download the file, but that doesn't feel great (in case we maybe want to redownload? maybe it's not worth maintaining at all...)rel_url_path
is really only needed in cases where there is a multi-file table, where the files are listed (e.g. not the case of the taxi s3 bucket, we don't yet have examples of this!). But it can be specified elsewhere if that's helpful.All of this means that: for the easy path, an author of a new repo entry generally just uses
url
and that's it. But we can save typing if there are multi-files by usingurl
as a base withrel_url_paths
added on. And then we (internally! with code) maintain therel_path
rel_url_path
if we need those.