Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move header line to table #89

Merged
merged 53 commits into from
Sep 28, 2022
Merged

Move header line to table #89

merged 53 commits into from
Sep 28, 2022

Conversation

joosthooz
Copy link
Collaborator

@joosthooz joosthooz commented Sep 21, 2022

Move the header_line field to the table class.
Also, now using only underscores in repo file and metadata file properties.
An additional change is that we are now taking the filename from the repo file instead of generating one using name.format.
Closes #76, closes #72

@joosthooz joosthooz requested a review from jonkeane September 21, 2022 14:46
@joosthooz
Copy link
Collaborator Author

This ended up becoming uglier than I had hoped, but it works and addressed the 2 issues.
ensure_..._loc always create the path as a dir which is not always what you want, sometimes it will return a path to a file. When converting to a compressed csv, we don't want the compression extension yet, because we need to compress in a separate step after conversion.
Also, the code currently does not detect whether a conversion can be done by just doing a simple (de)compress. It will default to picking a parquet variant if it exists.

Copy link
Contributor

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good, a few comments.

The biggest one is that we should make sure we're using ensure_table_loc() as the central way to get table locations for referencing them. If we are having issues because ensure_table_loc() is creating files|directories when we don't want it to, we could make two functions:

get_table_loc() and ensure_table_loc() and have ensure_table_loc() call get_table_loc() internally.

datalogistik/dataset.py Show resolved Hide resolved
Comment on lines 144 to 147
if self.format == "csv" and self.compression not in [
None,
"none",
"uncompressed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't do this in the PR, but it would be nice to generalize into a utility function since we do it in a few places.

Comment on lines 152 to 162
def get_table_filename(self, table):
if len(table.files) > 1 or table.multi_file:
name = name
name = table.table
elif self.name in tpc_info.tpc_datasets:
return self.generate_filename(table)
else:
name = name + os.extsep + self.format
if table.files:
# If there is a files entry, use it
name = table.files[0]["file_path"]
else:
return self.generate_filename(table)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use this in ensure_table_loc() too?

def ensure_table_loc(self, table=None, parents_only=False):
# Defaults to the 0th table, which for single-table datasets is exactly what we want
table = self.get_one_table(table)
# TODO: check that this file actually exists?
data_path = pathlib.Path(self.ensure_dataset_loc(), self.get_table_name(table))
if parents_only:
data_path = data_path.parent
# Make the dir if it's not already extant
if not data_path.exists():
data_path.mkdir(parents=True, exist_ok=True)
return data_path

Comment on lines 185 to 194
# TODO: check that this file actually exists?
data_path = pathlib.Path(self.ensure_dataset_loc(), self.get_table_name(table))
data_path = pathlib.Path(
self.ensure_dataset_loc(), self.get_table_filename(table)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use get_table_filename() in ensure_table_loc(), we should be able to use ensure_table_loc() here instead of constructing one like this.

I'm sure I've missed a few places, but we should strive to generally use ensure_table_loc() to get table locations whenever we possibly can, rather than constructing the paths in different places

# one file table
table.files = [
{
"rel_path": table_loc.relative_to(self.ensure_dataset_loc()),
"file_path": table_loc.relative_to(self.ensure_dataset_loc()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to remove all references to this path being relative? we could do something like rel_file_path if path is ambiguous. It's not the end of the world to make this file_path, but I think having some clear this-is-and-always-should-be-relative marker for this would be great.

Especially since when we output this info we will need to make sure to construct non-relative paths (and we should name those as such)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is just about consistency with the repo file, for me either is fine but I prefer using the same keyword for both places

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on repo ~= dataset elements. Let's do rel_path for both since it's even more important in the repo that people know that path is relative to the URL at the dataset level

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we still going to s/file_path/rel_path/g?

Comment on lines 449 to 452
new_dataset.ensure_dataset_loc(),
new_dataset.get_table_filename(
new_dataset.get_one_table(old_table.table)
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use ensure_table_loc() here still

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't, because that will create a directory with the name of the output file, causing an error in some cases. As you noted we should probably create 2 separate functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, yeah, feel free to move to the two function approach. And when we do we should audit our other uses and make sure we do ensure_... when we need to and get_... when we need to

# Conflicts:
#	datalogistik/table.py
#	repo.json
#	tests/test_dataset.py
@joosthooz joosthooz requested a review from jonkeane September 26, 2022 10:03
Copy link
Contributor

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, it's getting closer.

A few minor comments + a larger suggestion at refactoring some of the filename | extension | path getters

Comment on lines -89 to -92
# now fake a multi-file name:
new_ds = copy.deepcopy(simple_parquet_ds)
new_ds.tables[0].multi_file = True
assert new_ds.get_table_name(new_ds.tables[0]) == "chi_traffic_sample"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to remove this test, do we? We'll need to update it to use get_table_filename(), but we should still test the behavior of a multi-file table with get_table_filename() and confirm that it returns an extension-less filename

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively: see my comment below about change get_table_filename() into an extension helper instead of a filename helper. In which case these tests would be quite a bit different

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added another test for csv without gzip, but I think that should cover the cases. The function is not used for multi-file tables.

Comment on lines -248 to +244
``local-creation-date``
``local_creation_date``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the docs cleanups

Comment on lines 327 to 330
if len(table.files) > 1 and file_name:
# We only use the file name, because we rely on the convention
# that all files constituting a table are in dir table.name
download_path = table_path / pathlib.Path(file_name).name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also want to add into this comment for our future-selves that we are using file_name directly in this case (rather than overriding it to be table_name.extension like we do for single-file tables since with multi-file tables, the important part is that the name of the directory is the same as the table_name, and then arrow is pretty flexible about the contents of that directory

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, that was the idea behind this comment. We'd probably get into trouble if these names do not match, but in this case (multi-file) it is crucial the files are in the proper dir so I'm overriding it just to be sure. I've changed the comment a bit, hopefully for the better.

# one file table
table.files = [
{
"rel_path": table_loc.relative_to(self.ensure_dataset_loc()),
"file_path": table_loc.relative_to(self.ensure_dataset_loc()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we still going to s/file_path/rel_path/g?

"file_size": table_loc.lstat().st_size,
"md5": util.calculate_checksum(table_loc),
}
]
elif table_loc.is_dir():
probe_file = next(table_loc.iterdir())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding a comment in here for our future-selves about what this is doing + why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rel_path: yes, but I also did that change in #97 so I left it like this here.
Comment about probe_file: done

Comment on lines +456 to +462
if not new_table.dim:
# TODO: we should check if these are still valid after conversion
nrows = table_pads.count_rows()
ncols = len(table_pads.schema.names)
new_table.dim = [nrows, ncols]
else:
nrows = new_table.dim[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to do this in this PR, but let's make an issue to clean this up and wrap it into the fill-metadata-from-files workflow as well (we would probably need to factor it out into a function and use it both here and there)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok,
#98
Please add any additional thoughts about this

Comment on lines 466 to 470
# IFF header_line is False, then add that to the write options
if new_table.header_line is False:
write_options = dataset_write_format.make_write_options(
include_header=False
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we've moved this out from under the if/else up above and apply it only in cases where the output is a csv? AFAIK, parquet files won't use include_header at all.

We can move that if/else down here if we need to have messed with other properties up above. (We also really should factor that if/else out into a function, we don't need to do it now, but there are lots of little bits in there that will be tricky and it'll be nice to be able to test them directly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And FTR: here's what happens when I tested this in our integration:

https://github.com/voltrondata-labs/arrowbench/actions/runs/3128761468/jobs/5077008768#step:10:341

DEBUG [2022-09-26 15:00:11,610] Cleaning cache directory '/home/runner/.datalogistik_cache/fanniemae_2016Q4/a6a662b'
Traceback (most recent call last):
  File "/opt/pipx_bin/datalogistik", line 8, in <module>
    sys.exit(main())
  File "/opt/pipx/venvs/datalogistik/lib/python3.8/site-packages/datalogistik/datalogistik.py", line 58, in main
    new_dataset = close_match.convert(dataset)
  File "/opt/pipx/venvs/datalogistik/lib/python3.8/site-packages/datalogistik/dataset.py", line 468, in convert
    write_options = dataset_write_format.make_write_options(
  File "pyarrow/_dataset_parquet.pyx", line 183, in pyarrow._dataset_parquet.ParquetFileFormat.make_write_options
  File "pyarrow/_dataset_parquet.pyx", line 516, in pyarrow._dataset_parquet.ParquetFileWriteOptions.update
TypeError: unexpected parquet write option: include_header
/opt/pipx_bin/datalogistik

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving it here is part of the original purpose of this PR 😅 because we moved the header_line property from the dataset level down to the table level (so each table can now potentially have a different setting for this property!)
But I see that it should still test for CSV, looking at the error you posted! I'll refactor this into a function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this has been refactored into a function get_write_format. Unfortunately, I couldn't add a proper test for it because the writeoptions class cannot perform an equality check (probably due to Cython shenanigans)

Copy link
Contributor

@jonkeane jonkeane Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOOH yes, we should submit a Jira to the arrow tracker for equality comparison on some of these objects. But what I did in cases like this is to test individual components that I cared about. It took some feeling around in the docs + a bit of RTFSing to get the exact names right, but here's an example:

# it's a pyarrow quirk that the convert_options are under `default_fragment_scan_options`
assert (
"FANCY_NULL" in spec.default_fragment_scan_options.convert_options.null_values
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's going to be very difficult to do this (even adding the functionality to Arrow) because the properties aren't being kept in python, they're just passed on to some C++ structs.

repo.json Outdated
Comment on lines 4 to 14
"url": "https://ursa-qa.s3.amazonaws.com/fanniemae_loanperf/2016Q4.csv.gz",
"homepage": "https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data",
"delim": "|",
"format": "csv",
"compression": "gz",
"tables": [
{
"table": "2016Q4",
"header_line": false,
"files": [
{
"url": "https://ursa-qa.s3.amazonaws.com/fanniemae_loanperf/2016Q4.csv.gz",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we actually want to make this change like this. We should keep url up at the top level, and not repeat the full path again down here at the files level.

In most of the cases here, it's no big deal, but take a look at what's going on with our multi-file, partitioned taxi dataset in arrowbench: https://github.com/voltrondata-labs/arrowbench/blob/main/R/known-sources.R#L196-L203
For that one we don't even need to know what the files are, just that we should take the whole bucket. We should be able to do this in datalogistik as well (we don't have to do it right now in this PR, but let's not make it harder on ourselves to do that in the future)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code only supports multi-file where each file has it's own url. What your describing is 1 url for downloading multiple files. We can also choose to support that, but then we would need to remove the loop going over all the files and just perform a download command of that 1 url (and then possibly validating all the files that were downloaded).
Do we need to support both cases? That would make the code ugly and would need some careful documentation.

Copy link
Collaborator Author

@joosthooz joosthooz Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW., util.download_file("s3://ursa-labs-taxi-data", pathlib.Path("./tmp")):
urllib3.exceptions.URLSchemeUnknown: Not supported URL scheme s3
It seems this is not supported by urllib3, we need another library to download a bucket directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the change to repo.json and added a check that should support both having a url at dataset level or urls at file level.
And I created a PR off of this branch, that does a first attempt at supporting S3: #99

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for what we need to support: the only multi file dataset that exists in our benchmarks is the one I linked to. So we definitely need to support that.

After that, I think we should support specifying them this way as well (though even this could be deferred until after the MVP, I don't know of any datasets that look like this currently!)

[
    {
        "name": "my_dataset",
        "url": "https://ursa-qa.s3.amazonaws.com/my_dataset/my_table",
        ...
        "tables": [
            {
                "table": "my_table",
                "files": [
                    {
                        "file_path": "part-0.csv.gz"
                    },
                    {
                        "file_path": "part-1.csv.gz"
                    },
                    {
                        "file_path": "part-2.csv.gz"
                    },
                    {
                        "file_path": "part-3.csv.gz"
                    }
                ],

Where our downloader would concatenate the URL from the repo level and then file_path below; e.g.
https://ursa-qa.s3.amazonaws.com/my_dataset/my_table/part-0.csv.gz for the downloads

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3 download support has been added! It's hard to test, though, because the taxi bucket is very large.
I have a slight preference for keeping it as it is in the current state; either the url is on the dataset level, or there is a url entry for each file entry. Because a base url with suffixes is less flexible (what if they're not all in the same dir on the server?)

testrepo.json Outdated
@@ -0,0 +1,109 @@
[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file used anywhere? I've tried looking through the code, but can't find anywhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but the datasets are very useful to test downloading multi-table and multi-file datasets.
I could add unit tests for them but maybe these files are a bit large for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR is already pretty big, let's defer this to a new issue about adding a test repo file + testing it. Agreed it's helpful, we might need to upload some of our sample files if we don't have them already, but they are much smaller.

The other option is we could mock that download function and then use the fixture files themselves. But anyway, there are a bunch of little decisions + things to add to this, so let's make an issue and follow up on that later

Comment on lines 160 to 164
def get_table_filename(self, table):
name = table.table + os.extsep + self.format
if self.format == "csv" and self.compression == "gzip":
name = name + os.extsep + "gz"
return name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, now that I've poked through this code a bit more, this function is used really as more of an extension identifier than it is a filename getter in the one place that it's used. What if we leaned into that more, renamed this get_table_extension() (or possibly, just get_extension), had it return the extension there,

And then edited ensure_table_loc() to be:

    # this function will get the location of a table to be used + passed to a pyarrow dataset. It will ensure that all the directories leading up to the files that contain the data all exist (but will not create the data files themselves, directly). This function should be used to get the location of a table rather than constructing it oneself
    def ensure_table_loc(self, table=None):
        dataset_path = self.ensure_dataset_loc()
        # Defaults to the 0th table, which for single-table datasets is exactly what we want
        table = self.get_one_table(table)

        if len(table.files) > 1 or table.multi_file:
            table_path = pathlib.Path(dataset_path, table.table)
            table_path.mkdir(exist_ok=True)
        elif self.name not in tpc_info.tpc_datasets and table.files:
            # If there is a files entry, use it
            table_path = pathlib.Path(dataset_path, table.files[0]["file_path"])
        else:
            table_path = pathlib.Path(dataset_path, table.table + self.get_extension(self))
        return table_path

I've added a bit of a docstring too, which we should add to ensure_table_loc() regardless of the specifics of the change. We should also use dataset_path / table.table instead of pathlib.Path(dataset_path, table.table) where we do that in this function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done

Copy link
Contributor

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. The one major thing left is to make sure that we can do (raw) csv(.gz) -> csv on our fannie dataset with header_line: False and then we can merge this.

I noticed some additions on multi-table datasets added this last set of pushes, IMO we should create a separate PR to do those, there are a bunch of little decisions we'll need to make and getting that right here will mean extending this PR even longer.

@@ -9,6 +9,7 @@
"tables": [
{
"table": "2016Q4",
"header_line": false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

repo.json Outdated
"format": "parquet",
"tables": [
{
"table": "2009"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the table here is actually taxi_parquet (or probably taxi since we will be able to change the format now!). This entry is also not totally complete: at a minimum we'll need to specify the partitioning schema.

IMO: we should split this change out into a new PR so that we can merge the other parts of this PR without needing to get this right just yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is wrong, sorry. I was trying to download only a subset of the bucket but it didn't work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the entire entry and also reverted the addition of S3 support. That code is in a branch I pushed here https://github.com/conbench/datalogistik/tree/s3-using-arrowfs just to have it around somewhere (although it's a tiny change)

This reverts commit bbbe754.

# Conflicts:
#	repo.json
This change should probably be reverted when adding shortcuts for (de)compression
… table in wrong dirs"

This reverts commit e253c10.

# Conflicts:
#	datalogistik/dataset.py
This reverts commit 70c1ede.

# Conflicts:
#	datalogistik/dataset.py
#	tests/test_dataset.py
Copy link
Contributor

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, though looks like our integration tests in arrowbench have an issue still: https://github.com/voltrondata-labs/arrowbench/actions/runs/3136784477/jobs/5109925017

Comment on lines +143 to +144
if dataset_from_repo:
dataset.fill_in_defaults(dataset_from_repo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice way to catch if the dataset isn't in the repo but still go

Comment on lines +99 to +101
# Parse back to a pathlib, because we write paths out to JSON as strings
if type(self.cache_location) is str:
self.cache_location = pathlib.Path(self.cache_location)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently guard against this, but I believe both cache_location and dir_hash should be removed before we write jsons to disk. They should only really be needed when writing a dataset for the first time, after that the location and hash is known by where the metadata file is.

Would you mind writing up an issue with ^^^ that we can tackle later and adding a TODO here that basically says "When we actually guard that those aren't saved, we can remove this"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, however, self.cache_location is used in a bunch of places, so it needs to be filled in at some point. That could also be done by using something like metadata_file_path.parent of course.
Created #102

@@ -311,6 +295,7 @@ def get_table_dataset(self, table=None):
schema = table.schema
if self.format == "parquet":
dataset_read_format = pads.ParquetFileFormat()
schema = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this here? We don't yet support it, but we will soon support changing schemas, and being able to specify a schema even with parquet is a way to do that. All current datasets in the repo now will be None regardless, but I don't think we should hardcode it here too (unless we absolutely have to for some other reason?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to prevent an error a couple of lines down, when trying to read the parquet dataset, because it doesn't accept a schema argument. Note that this doesn't throw away the schema in the metadata

# IFF header_line is False, then add that to the write options
write_options = dataset_write_format.make_write_options(
include_header=False if table.header_line is False else True,
delimiter=self.delim if self.delim else ",",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, but would you mind making a TODO that says we should actually set the delim default during init + use None (which in pyarrow will default to "," itself)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it a 2nd time now, the whole if seems superfluous, I'll consider removing it. But the real question is then do we want to initialize to "," ourselves? Even that isn't really necessary considering pyarrow's default behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nods yeah maybe we can get away with not initializing at all!

Comment on lines 348 to 350
dataset_file_path = cached_dataset_path / (
table.table + self.get_extension()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need an os.extsep here (or whatever the pathlib equivalent of that is?)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to include that in get_extension()

Comment on lines 204 to 206
elif self.name not in tpc_info.tpc_datasets and table.files:
# If there is a files entry, use it
table_path = dataset_path / table.files[0]["file_path"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif self.name not in tpc_info.tpc_datasets and table.files:
# If there is a files entry, use it
table_path = dataset_path / table.files[0]["file_path"]

I suspect this is a better way to resolve the filename issue that the last commit is working around. Is there a case where we want to rely on the file_path there instead of the table_name + the conventions we've established for those?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we should never! So it's better to enforce it. The only thing that file_path could do is provide a suffix for the download url. But there's an important detail: the download function needs a full path to a file, not a dir. So for multi-file tables, we need a file_path entry both for the url suffix and the name of the file in the table dir.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5b9be54
Note that we still have PRs open for the details of multi-file/multi-table downloads #101 #47

@jonkeane jonkeane merged commit cd86842 into main Sep 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move header_line to Table Allow for compression in file extensions
2 participants