-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move header line to table #89
Conversation
This ended up becoming uglier than I had hoped, but it works and addressed the 2 issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good, a few comments.
The biggest one is that we should make sure we're using ensure_table_loc()
as the central way to get table locations for referencing them. If we are having issues because ensure_table_loc()
is creating files|directories when we don't want it to, we could make two functions:
get_table_loc()
and ensure_table_loc()
and have ensure_table_loc()
call get_table_loc()
internally.
datalogistik/dataset.py
Outdated
if self.format == "csv" and self.compression not in [ | ||
None, | ||
"none", | ||
"uncompressed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't do this in the PR, but it would be nice to generalize into a utility function since we do it in a few places.
datalogistik/dataset.py
Outdated
def get_table_filename(self, table): | ||
if len(table.files) > 1 or table.multi_file: | ||
name = name | ||
name = table.table | ||
elif self.name in tpc_info.tpc_datasets: | ||
return self.generate_filename(table) | ||
else: | ||
name = name + os.extsep + self.format | ||
if table.files: | ||
# If there is a files entry, use it | ||
name = table.files[0]["file_path"] | ||
else: | ||
return self.generate_filename(table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use this in ensure_table_loc()
too?
datalogistik/datalogistik/dataset.py
Lines 181 to 195 in 426d737
def ensure_table_loc(self, table=None, parents_only=False): | |
# Defaults to the 0th table, which for single-table datasets is exactly what we want | |
table = self.get_one_table(table) | |
# TODO: check that this file actually exists? | |
data_path = pathlib.Path(self.ensure_dataset_loc(), self.get_table_name(table)) | |
if parents_only: | |
data_path = data_path.parent | |
# Make the dir if it's not already extant | |
if not data_path.exists(): | |
data_path.mkdir(parents=True, exist_ok=True) | |
return data_path |
datalogistik/dataset.py
Outdated
# TODO: check that this file actually exists? | ||
data_path = pathlib.Path(self.ensure_dataset_loc(), self.get_table_name(table)) | ||
data_path = pathlib.Path( | ||
self.ensure_dataset_loc(), self.get_table_filename(table) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use get_table_filename()
in ensure_table_loc()
, we should be able to use ensure_table_loc()
here instead of constructing one like this.
I'm sure I've missed a few places, but we should strive to generally use ensure_table_loc()
to get table locations whenever we possibly can, rather than constructing the paths in different places
# one file table | ||
table.files = [ | ||
{ | ||
"rel_path": table_loc.relative_to(self.ensure_dataset_loc()), | ||
"file_path": table_loc.relative_to(self.ensure_dataset_loc()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to remove all references to this path being relative? we could do something like rel_file_path
if path
is ambiguous. It's not the end of the world to make this file_path
, but I think having some clear this-is-and-always-should-be-relative marker for this would be great.
Especially since when we output this info we will need to make sure to construct non-relative paths (and we should name those as such)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is just about consistency with the repo file, for me either is fine but I prefer using the same keyword for both places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on repo ~= dataset elements. Let's do rel_path
for both since it's even more important in the repo that people know that path is relative to the URL at the dataset level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we still going to s/file_path/rel_path/g
?
datalogistik/dataset.py
Outdated
new_dataset.ensure_dataset_loc(), | ||
new_dataset.get_table_filename( | ||
new_dataset.get_one_table(old_table.table) | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use ensure_table_loc()
here still
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't, because that will create a directory with the name of the output file, causing an error in some cases. As you noted we should probably create 2 separate functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, yeah, feel free to move to the two function approach. And when we do we should audit our other uses and make sure we do ensure_...
when we need to and get_...
when we need to
# Conflicts: # datalogistik/table.py # repo.json # tests/test_dataset.py
# Conflicts: # tests/fixtures/test_cache/fanniemae_sample/a77e575/datalogistik_metadata.ini
It will now only create a dir when the table is stored in a dir (as opposed to a single file)
This reverts commit 18609b4.
# Conflicts: # datalogistik/cli.py # datalogistik/dataset.py # tests/test_dataset.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, it's getting closer.
A few minor comments + a larger suggestion at refactoring some of the filename | extension | path getters
# now fake a multi-file name: | ||
new_ds = copy.deepcopy(simple_parquet_ds) | ||
new_ds.tables[0].multi_file = True | ||
assert new_ds.get_table_name(new_ds.tables[0]) == "chi_traffic_sample" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to remove this test, do we? We'll need to update it to use get_table_filename()
, but we should still test the behavior of a multi-file table with get_table_filename()
and confirm that it returns an extension-less filename
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively: see my comment below about change get_table_filename()
into an extension helper instead of a filename helper. In which case these tests would be quite a bit different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added another test for csv without gzip, but I think that should cover the cases. The function is not used for multi-file tables.
``local-creation-date`` | ||
``local_creation_date`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the docs cleanups
datalogistik/dataset.py
Outdated
if len(table.files) > 1 and file_name: | ||
# We only use the file name, because we rely on the convention | ||
# that all files constituting a table are in dir table.name | ||
download_path = table_path / pathlib.Path(file_name).name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also want to add into this comment for our future-selves that we are using file_name
directly in this case (rather than overriding it to be table_name.extension
like we do for single-file tables since with multi-file tables, the important part is that the name of the directory is the same as the table_name, and then arrow is pretty flexible about the contents of that directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, that was the idea behind this comment. We'd probably get into trouble if these names do not match, but in this case (multi-file) it is crucial the files are in the proper dir so I'm overriding it just to be sure. I've changed the comment a bit, hopefully for the better.
# one file table | ||
table.files = [ | ||
{ | ||
"rel_path": table_loc.relative_to(self.ensure_dataset_loc()), | ||
"file_path": table_loc.relative_to(self.ensure_dataset_loc()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we still going to s/file_path/rel_path/g
?
"file_size": table_loc.lstat().st_size, | ||
"md5": util.calculate_checksum(table_loc), | ||
} | ||
] | ||
elif table_loc.is_dir(): | ||
probe_file = next(table_loc.iterdir()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adding a comment in here for our future-selves about what this is doing + why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rel_path
: yes, but I also did that change in #97 so I left it like this here.
Comment about probe_file
: done
if not new_table.dim: | ||
# TODO: we should check if these are still valid after conversion | ||
nrows = table_pads.count_rows() | ||
ncols = len(table_pads.schema.names) | ||
new_table.dim = [nrows, ncols] | ||
else: | ||
nrows = new_table.dim[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to do this in this PR, but let's make an issue to clean this up and wrap it into the fill-metadata-from-files workflow as well (we would probably need to factor it out into a function and use it both here and there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok,
#98
Please add any additional thoughts about this
datalogistik/dataset.py
Outdated
# IFF header_line is False, then add that to the write options | ||
if new_table.header_line is False: | ||
write_options = dataset_write_format.make_write_options( | ||
include_header=False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we've moved this out from under the if/else up above and apply it only in cases where the output is a csv? AFAIK, parquet files won't use include_header
at all.
We can move that if/else down here if we need to have messed with other properties up above. (We also really should factor that if/else out into a function, we don't need to do it now, but there are lots of little bits in there that will be tricky and it'll be nice to be able to test them directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And FTR: here's what happens when I tested this in our integration:
https://github.com/voltrondata-labs/arrowbench/actions/runs/3128761468/jobs/5077008768#step:10:341
DEBUG [2022-09-26 15:00:11,610] Cleaning cache directory '/home/runner/.datalogistik_cache/fanniemae_2016Q4/a6a662b'
Traceback (most recent call last):
File "/opt/pipx_bin/datalogistik", line 8, in <module>
sys.exit(main())
File "/opt/pipx/venvs/datalogistik/lib/python3.8/site-packages/datalogistik/datalogistik.py", line 58, in main
new_dataset = close_match.convert(dataset)
File "/opt/pipx/venvs/datalogistik/lib/python3.8/site-packages/datalogistik/dataset.py", line 468, in convert
write_options = dataset_write_format.make_write_options(
File "pyarrow/_dataset_parquet.pyx", line 183, in pyarrow._dataset_parquet.ParquetFileFormat.make_write_options
File "pyarrow/_dataset_parquet.pyx", line 516, in pyarrow._dataset_parquet.ParquetFileWriteOptions.update
TypeError: unexpected parquet write option: include_header
/opt/pipx_bin/datalogistik
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving it here is part of the original purpose of this PR 😅 because we moved the header_line
property from the dataset level down to the table level (so each table can now potentially have a different setting for this property!)
But I see that it should still test for CSV, looking at the error you posted! I'll refactor this into a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this has been refactored into a function get_write_format
. Unfortunately, I couldn't add a proper test for it because the writeoptions class cannot perform an equality check (probably due to Cython shenanigans)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOOH yes, we should submit a Jira to the arrow tracker for equality comparison on some of these objects. But what I did in cases like this is to test individual components that I cared about. It took some feeling around in the docs + a bit of RTFSing to get the exact names right, but here's an example:
datalogistik/tests/test_dataset.py
Lines 348 to 351 in fbe34cc
# it's a pyarrow quirk that the convert_options are under `default_fragment_scan_options` | |
assert ( | |
"FANCY_NULL" in spec.default_fragment_scan_options.convert_options.null_values | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's going to be very difficult to do this (even adding the functionality to Arrow) because the properties aren't being kept in python, they're just passed on to some C++ structs.
repo.json
Outdated
"url": "https://ursa-qa.s3.amazonaws.com/fanniemae_loanperf/2016Q4.csv.gz", | ||
"homepage": "https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data", | ||
"delim": "|", | ||
"format": "csv", | ||
"compression": "gz", | ||
"tables": [ | ||
{ | ||
"table": "2016Q4", | ||
"header_line": false, | ||
"files": [ | ||
{ | ||
"url": "https://ursa-qa.s3.amazonaws.com/fanniemae_loanperf/2016Q4.csv.gz", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we actually want to make this change like this. We should keep url
up at the top level, and not repeat the full path again down here at the files level.
In most of the cases here, it's no big deal, but take a look at what's going on with our multi-file, partitioned taxi dataset in arrowbench: https://github.com/voltrondata-labs/arrowbench/blob/main/R/known-sources.R#L196-L203
For that one we don't even need to know what the files are, just that we should take the whole bucket. We should be able to do this in datalogistik as well (we don't have to do it right now in this PR, but let's not make it harder on ourselves to do that in the future)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code only supports multi-file where each file has it's own url. What your describing is 1 url for downloading multiple files. We can also choose to support that, but then we would need to remove the loop going over all the files and just perform a download command of that 1 url (and then possibly validating all the files that were downloaded).
Do we need to support both cases? That would make the code ugly and would need some careful documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW., util.download_file("s3://ursa-labs-taxi-data", pathlib.Path("./tmp"))
:
urllib3.exceptions.URLSchemeUnknown: Not supported URL scheme s3
It seems this is not supported by urllib3, we need another library to download a bucket directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the change to repo.json
and added a check that should support both having a url at dataset level or urls at file level.
And I created a PR off of this branch, that does a first attempt at supporting S3: #99
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use arrow for any filesystems it supports: https://arrow.apache.org/docs/python/generated/pyarrow.fs.copy_files.html#pyarrow.fs.copy_files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for what we need to support: the only multi file dataset that exists in our benchmarks is the one I linked to. So we definitely need to support that.
After that, I think we should support specifying them this way as well (though even this could be deferred until after the MVP, I don't know of any datasets that look like this currently!)
[
{
"name": "my_dataset",
"url": "https://ursa-qa.s3.amazonaws.com/my_dataset/my_table",
...
"tables": [
{
"table": "my_table",
"files": [
{
"file_path": "part-0.csv.gz"
},
{
"file_path": "part-1.csv.gz"
},
{
"file_path": "part-2.csv.gz"
},
{
"file_path": "part-3.csv.gz"
}
],
Where our downloader would concatenate the URL from the repo level and then file_path
below; e.g.
https://ursa-qa.s3.amazonaws.com/my_dataset/my_table/part-0.csv.gz
for the downloads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s3 download support has been added! It's hard to test, though, because the taxi bucket is very large.
I have a slight preference for keeping it as it is in the current state; either the url is on the dataset level, or there is a url entry for each file entry. Because a base url with suffixes is less flexible (what if they're not all in the same dir on the server?)
testrepo.json
Outdated
@@ -0,0 +1,109 @@ | |||
[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file used anywhere? I've tried looking through the code, but can't find anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but the datasets are very useful to test downloading multi-table and multi-file datasets.
I could add unit tests for them but maybe these files are a bit large for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR is already pretty big, let's defer this to a new issue about adding a test repo file + testing it. Agreed it's helpful, we might need to upload some of our sample files if we don't have them already, but they are much smaller.
The other option is we could mock that download function and then use the fixture files themselves. But anyway, there are a bunch of little decisions + things to add to this, so let's make an issue and follow up on that later
datalogistik/dataset.py
Outdated
def get_table_filename(self, table): | ||
name = table.table + os.extsep + self.format | ||
if self.format == "csv" and self.compression == "gzip": | ||
name = name + os.extsep + "gz" | ||
return name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, now that I've poked through this code a bit more, this function is used really as more of an extension identifier than it is a filename getter in the one place that it's used. What if we leaned into that more, renamed this get_table_extension()
(or possibly, just get_extension
), had it return the extension there,
And then edited ensure_table_loc()
to be:
# this function will get the location of a table to be used + passed to a pyarrow dataset. It will ensure that all the directories leading up to the files that contain the data all exist (but will not create the data files themselves, directly). This function should be used to get the location of a table rather than constructing it oneself
def ensure_table_loc(self, table=None):
dataset_path = self.ensure_dataset_loc()
# Defaults to the 0th table, which for single-table datasets is exactly what we want
table = self.get_one_table(table)
if len(table.files) > 1 or table.multi_file:
table_path = pathlib.Path(dataset_path, table.table)
table_path.mkdir(exist_ok=True)
elif self.name not in tpc_info.tpc_datasets and table.files:
# If there is a files entry, use it
table_path = pathlib.Path(dataset_path, table.files[0]["file_path"])
else:
table_path = pathlib.Path(dataset_path, table.table + self.get_extension(self))
return table_path
I've added a bit of a docstring too, which we should add to ensure_table_loc()
regardless of the specifics of the change. We should also use dataset_path / table.table
instead of pathlib.Path(dataset_path, table.table)
where we do that in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. The one major thing left is to make sure that we can do (raw) csv(.gz) -> csv on our fannie dataset with header_line: False
and then we can merge this.
I noticed some additions on multi-table datasets added this last set of pushes, IMO we should create a separate PR to do those, there are a bunch of little decisions we'll need to make and getting that right here will mean extending this PR even longer.
@@ -9,6 +9,7 @@ | |||
"tables": [ | |||
{ | |||
"table": "2016Q4", | |||
"header_line": false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
repo.json
Outdated
"format": "parquet", | ||
"tables": [ | ||
{ | ||
"table": "2009" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the table here is actually taxi_parquet
(or probably taxi
since we will be able to change the format now!). This entry is also not totally complete: at a minimum we'll need to specify the partitioning schema.
IMO: we should split this change out into a new PR so that we can merge the other parts of this PR without needing to get this right just yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is wrong, sorry. I was trying to download only a subset of the bucket but it didn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the entire entry and also reverted the addition of S3 support. That code is in a branch I pushed here https://github.com/conbench/datalogistik/tree/s3-using-arrowfs just to have it around somewhere (although it's a tiny change)
This reverts commit bbbe754. # Conflicts: # repo.json
This change should probably be reverted when adding shortcuts for (de)compression
This reverts commit 1163596.
…om write_metadata())
This reverts commit 4164d71.
…r per-file" This reverts commit c6a7fb9.
… table in wrong dirs" This reverts commit e253c10. # Conflicts: # datalogistik/dataset.py
This reverts commit 70c1ede. # Conflicts: # datalogistik/dataset.py # tests/test_dataset.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments, though looks like our integration tests in arrowbench have an issue still: https://github.com/voltrondata-labs/arrowbench/actions/runs/3136784477/jobs/5109925017
if dataset_from_repo: | ||
dataset.fill_in_defaults(dataset_from_repo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice way to catch if the dataset isn't in the repo but still go
# Parse back to a pathlib, because we write paths out to JSON as strings | ||
if type(self.cache_location) is str: | ||
self.cache_location = pathlib.Path(self.cache_location) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't currently guard against this, but I believe both cache_location
and dir_hash
should be removed before we write jsons to disk. They should only really be needed when writing a dataset for the first time, after that the location and hash is known by where the metadata file is.
Would you mind writing up an issue with ^^^ that we can tackle later and adding a TODO here that basically says "When we actually guard that those aren't saved, we can remove this"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, however, self.cache_location
is used in a bunch of places, so it needs to be filled in at some point. That could also be done by using something like metadata_file_path.parent of course.
Created #102
@@ -311,6 +295,7 @@ def get_table_dataset(self, table=None): | |||
schema = table.schema | |||
if self.format == "parquet": | |||
dataset_read_format = pads.ParquetFileFormat() | |||
schema = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this here? We don't yet support it, but we will soon support changing schemas, and being able to specify a schema even with parquet is a way to do that. All current datasets in the repo now will be None
regardless, but I don't think we should hardcode it here too (unless we absolutely have to for some other reason?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to prevent an error a couple of lines down, when trying to read the parquet dataset, because it doesn't accept a schema argument. Note that this doesn't throw away the schema in the metadata
datalogistik/dataset.py
Outdated
# IFF header_line is False, then add that to the write options | ||
write_options = dataset_write_format.make_write_options( | ||
include_header=False if table.header_line is False else True, | ||
delimiter=self.delim if self.delim else ",", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now, but would you mind making a TODO that says we should actually set the delim default during init + use None
(which in pyarrow will default to ","
itself)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at it a 2nd time now, the whole if seems superfluous, I'll consider removing it. But the real question is then do we want to initialize to ","
ourselves? Even that isn't really necessary considering pyarrow's default behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nods yeah maybe we can get away with not initializing at all!
datalogistik/dataset.py
Outdated
dataset_file_path = cached_dataset_path / ( | ||
table.table + self.get_extension() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need an os.extsep
here (or whatever the pathlib equivalent of that is?)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to include that in get_extension()
It has a different filename from the url
datalogistik/dataset.py
Outdated
elif self.name not in tpc_info.tpc_datasets and table.files: | ||
# If there is a files entry, use it | ||
table_path = dataset_path / table.files[0]["file_path"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif self.name not in tpc_info.tpc_datasets and table.files: | |
# If there is a files entry, use it | |
table_path = dataset_path / table.files[0]["file_path"] |
I suspect this is a better way to resolve the filename issue that the last commit is working around. Is there a case where we want to rely on the file_path
there instead of the table_name + the conventions we've established for those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, we should never! So it's better to enforce it. The only thing that file_path
could do is provide a suffix for the download url. But there's an important detail: the download function needs a full path to a file, not a dir. So for multi-file tables, we need a file_path
entry both for the url suffix and the name of the file in the table dir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the header_line field to the table class.
Also, now using only underscores in repo file and metadata file properties.
An additional change is that we are now taking the filename from the repo file instead of generating one using name.format.
Closes #76, closes #72