Move header line to table #89

joosthooz · 2022-09-21T10:01:33Z

Move the header_line field to the table class.
Also, now using only underscores in repo file and metadata file properties.
An additional change is that we are now taking the filename from the repo file instead of generating one using name.format.
Closes #76, closes #72

joosthooz · 2022-09-21T14:52:54Z

This ended up becoming uglier than I had hoped, but it works and addressed the 2 issues.
ensure_..._loc always create the path as a dir which is not always what you want, sometimes it will return a path to a file. When converting to a compressed csv, we don't want the compression extension yet, because we need to compress in a separate step after conversion.
Also, the code currently does not detect whether a conversion can be done by just doing a simple (de)compress. It will default to picking a parquet variant if it exists.

jonkeane

This is looking good, a few comments.

The biggest one is that we should make sure we're using ensure_table_loc() as the central way to get table locations for referencing them. If we are having issues because ensure_table_loc() is creating files|directories when we don't want it to, we could make two functions:

get_table_loc() and ensure_table_loc() and have ensure_table_loc() call get_table_loc() internally.

datalogistik/dataset.py

jonkeane · 2022-09-21T15:10:25Z

datalogistik/dataset.py

+        if self.format == "csv" and self.compression not in [
+            None,
+            "none",
+            "uncompressed",


We shouldn't do this in the PR, but it would be nice to generalize into a utility function since we do it in a few places.

jonkeane · 2022-09-21T15:22:10Z

datalogistik/dataset.py

+    def get_table_filename(self, table):
        if len(table.files) > 1 or table.multi_file:
-            name = name
+            name = table.table
+        elif self.name in tpc_info.tpc_datasets:
+            return self.generate_filename(table)
        else:
-            name = name + os.extsep + self.format
+            if table.files:
+                # If there is a files entry, use it
+                name = table.files[0]["file_path"]
+            else:
+                return self.generate_filename(table)


Should we use this in ensure_table_loc() too?

datalogistik/datalogistik/dataset.py

Lines 181 to 195 in 426d737

def ensure_table_loc(self, table=None, parents_only=False):

# Defaults to the 0th table, which for single-table datasets is exactly what we want

table = self.get_one_table(table)

# TODO: check that this file actually exists?

data_path = pathlib.Path(self.ensure_dataset_loc(), self.get_table_name(table))

if parents_only:

data_path = data_path.parent

# Make the dir if it's not already extant

if not data_path.exists():

data_path.mkdir(parents=True, exist_ok=True)

return data_path

jonkeane · 2022-09-21T15:28:04Z

datalogistik/dataset.py

-        # TODO: check that this file actually exists?
-        data_path = pathlib.Path(self.ensure_dataset_loc(), self.get_table_name(table))
+        data_path = pathlib.Path(
+            self.ensure_dataset_loc(), self.get_table_filename(table)
+        )


If we use get_table_filename() in ensure_table_loc(), we should be able to use ensure_table_loc() here instead of constructing one like this.

I'm sure I've missed a few places, but we should strive to generally use ensure_table_loc() to get table locations whenever we possibly can, rather than constructing the paths in different places

jonkeane · 2022-09-21T15:32:22Z

datalogistik/dataset.py

                # one file table
                table.files = [
                    {
-                        "rel_path": table_loc.relative_to(self.ensure_dataset_loc()),
+                        "file_path": table_loc.relative_to(self.ensure_dataset_loc()),


Do we want to remove all references to this path being relative? we could do something like rel_file_path if path is ambiguous. It's not the end of the world to make this file_path, but I think having some clear this-is-and-always-should-be-relative marker for this would be great.

Especially since when we output this info we will need to make sure to construct non-relative paths (and we should name those as such)

This change is just about consistency with the repo file, for me either is fine but I prefer using the same keyword for both places

Agreed on repo ~= dataset elements. Let's do rel_path for both since it's even more important in the repo that people know that path is relative to the URL at the dataset level

Are we still going to s/file_path/rel_path/g?

jonkeane · 2022-09-21T15:33:27Z

datalogistik/dataset.py

+                    new_dataset.ensure_dataset_loc(),
+                    new_dataset.get_table_filename(
+                        new_dataset.get_one_table(old_table.table)
+                    ),


We should use ensure_table_loc() here still

We can't, because that will create a directory with the name of the output file, causing an error in some cases. As you noted we should probably create 2 separate functions.

Cool, yeah, feel free to move to the two function approach. And when we do we should audit our other uses and make sure we do ensure_... when we need to and get_... when we need to

# Conflicts: # datalogistik/table.py # repo.json # tests/test_dataset.py

# Conflicts: # tests/fixtures/test_cache/fanniemae_sample/a77e575/datalogistik_metadata.ini

It will now only create a dir when the table is stored in a dir (as opposed to a single file)

This reverts commit 18609b4.

# Conflicts: # datalogistik/cli.py # datalogistik/dataset.py # tests/test_dataset.py

…n wrong dirs

jonkeane

Thanks for this, it's getting closer.

A few minor comments + a larger suggestion at refactoring some of the filename | extension | path getters

jonkeane · 2022-09-26T13:08:41Z

tests/test_dataset.py

-    # now fake a multi-file name:
-    new_ds = copy.deepcopy(simple_parquet_ds)
-    new_ds.tables[0].multi_file = True
-    assert new_ds.get_table_name(new_ds.tables[0]) == "chi_traffic_sample"


I don't think we want to remove this test, do we? We'll need to update it to use get_table_filename(), but we should still test the behavior of a multi-file table with get_table_filename() and confirm that it returns an extension-less filename

Alternatively: see my comment below about change get_table_filename() into an extension helper instead of a filename helper. In which case these tests would be quite a bit different

I added another test for csv without gzip, but I think that should cover the cases. The function is not used for multi-file tables.

jonkeane · 2022-09-26T13:10:55Z

README.rst

-``local-creation-date``
+``local_creation_date``


Thanks for all the docs cleanups

jonkeane · 2022-09-26T13:32:25Z

datalogistik/dataset.py

+                    if len(table.files) > 1 and file_name:
+                        # We only use the file name, because we rely on the convention
+                        # that all files constituting a table are in dir table.name
+                        download_path = table_path / pathlib.Path(file_name).name


We might also want to add into this comment for our future-selves that we are using file_name directly in this case (rather than overriding it to be table_name.extension like we do for single-file tables since with multi-file tables, the important part is that the name of the directory is the same as the table_name, and then arrow is pretty flexible about the contents of that directory

Absolutely, that was the idea behind this comment. We'd probably get into trouble if these names do not match, but in this case (multi-file) it is crucial the files are in the proper dir so I'm overriding it just to be sure. I've changed the comment a bit, hopefully for the better.

jonkeane · 2022-09-26T13:33:20Z

datalogistik/dataset.py

                # one file table
                table.files = [
                    {
-                        "rel_path": table_loc.relative_to(self.ensure_dataset_loc()),
+                        "file_path": table_loc.relative_to(self.ensure_dataset_loc()),


Are we still going to s/file_path/rel_path/g?

jonkeane · 2022-09-26T13:36:48Z

datalogistik/dataset.py

                        "file_size": table_loc.lstat().st_size,
                        "md5": util.calculate_checksum(table_loc),
                    }
                ]
            elif table_loc.is_dir():
+                probe_file = next(table_loc.iterdir())


Would you mind adding a comment in here for our future-selves about what this is doing + why?

rel_path: yes, but I also did that change in #97 so I left it like this here.
Comment about probe_file: done

jonkeane · 2022-09-26T13:43:10Z

datalogistik/dataset.py

+                if not new_table.dim:
+                    # TODO: we should check if these are still valid after conversion
+                    nrows = table_pads.count_rows()
+                    ncols = len(table_pads.schema.names)
+                    new_table.dim = [nrows, ncols]
+                else:
+                    nrows = new_table.dim[0]


We don't need to do this in this PR, but let's make an issue to clean this up and wrap it into the fill-metadata-from-files workflow as well (we would probably need to factor it out into a function and use it both here and there)

Ok,
#98
Please add any additional thoughts about this

jonkeane · 2022-09-26T13:46:26Z

datalogistik/dataset.py

+                # IFF header_line is False, then add that to the write options
+                if new_table.header_line is False:
+                    write_options = dataset_write_format.make_write_options(
+                        include_header=False
+                    )


Any reason we've moved this out from under the if/else up above and apply it only in cases where the output is a csv? AFAIK, parquet files won't use include_header at all.

We can move that if/else down here if we need to have messed with other properties up above. (We also really should factor that if/else out into a function, we don't need to do it now, but there are lots of little bits in there that will be tricky and it'll be nice to be able to test them directly

And FTR: here's what happens when I tested this in our integration:

https://github.com/voltrondata-labs/arrowbench/actions/runs/3128761468/jobs/5077008768#step:10:341

DEBUG [2022-09-26 15:00:11,610] Cleaning cache directory '/home/runner/.datalogistik_cache/fanniemae_2016Q4/a6a662b' Traceback (most recent call last): File "/opt/pipx_bin/datalogistik", line 8, in <module> sys.exit(main()) File "/opt/pipx/venvs/datalogistik/lib/python3.8/site-packages/datalogistik/datalogistik.py", line 58, in main new_dataset = close_match.convert(dataset) File "/opt/pipx/venvs/datalogistik/lib/python3.8/site-packages/datalogistik/dataset.py", line 468, in convert write_options = dataset_write_format.make_write_options( File "pyarrow/_dataset_parquet.pyx", line 183, in pyarrow._dataset_parquet.ParquetFileFormat.make_write_options File "pyarrow/_dataset_parquet.pyx", line 516, in pyarrow._dataset_parquet.ParquetFileWriteOptions.update TypeError: unexpected parquet write option: include_header /opt/pipx_bin/datalogistik

Moving it here is part of the original purpose of this PR 😅 because we moved the header_line property from the dataset level down to the table level (so each table can now potentially have a different setting for this property!)
But I see that it should still test for CSV, looking at the error you posted! I'll refactor this into a function.

Ok, this has been refactored into a function get_write_format. Unfortunately, I couldn't add a proper test for it because the writeoptions class cannot perform an equality check (probably due to Cython shenanigans)

OOOH yes, we should submit a Jira to the arrow tracker for equality comparison on some of these objects. But what I did in cases like this is to test individual components that I cared about. It took some feeling around in the docs + a bit of RTFSing to get the exact names right, but here's an example:

datalogistik/tests/test_dataset.py

Lines 348 to 351 in fbe34cc

# it's a pyarrow quirk that the convert_options are under `default_fragment_scan_options`

assert (

"FANCY_NULL" in spec.default_fragment_scan_options.convert_options.null_values

)

It's going to be very difficult to do this (even adding the functionality to Arrow) because the properties aren't being kept in python, they're just passed on to some C++ structs.

jonkeane · 2022-09-26T13:53:57Z

repo.json

-        "url": "https://ursa-qa.s3.amazonaws.com/fanniemae_loanperf/2016Q4.csv.gz",
        "homepage": "https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data",
        "delim": "|",
        "format": "csv",
        "compression": "gz",
        "tables": [
            {
                "table": "2016Q4",
+                "header_line": false,
                "files": [
                    {
+                        "url": "https://ursa-qa.s3.amazonaws.com/fanniemae_loanperf/2016Q4.csv.gz",


I don't think we actually want to make this change like this. We should keep url up at the top level, and not repeat the full path again down here at the files level.

In most of the cases here, it's no big deal, but take a look at what's going on with our multi-file, partitioned taxi dataset in arrowbench: https://github.com/voltrondata-labs/arrowbench/blob/main/R/known-sources.R#L196-L203
For that one we don't even need to know what the files are, just that we should take the whole bucket. We should be able to do this in datalogistik as well (we don't have to do it right now in this PR, but let's not make it harder on ourselves to do that in the future)

This code only supports multi-file where each file has it's own url. What your describing is 1 url for downloading multiple files. We can also choose to support that, but then we would need to remove the loop going over all the files and just perform a download command of that 1 url (and then possibly validating all the files that were downloaded).
Do we need to support both cases? That would make the code ugly and would need some careful documentation.

BTW., util.download_file("s3://ursa-labs-taxi-data", pathlib.Path("./tmp")):
urllib3.exceptions.URLSchemeUnknown: Not supported URL scheme s3
It seems this is not supported by urllib3, we need another library to download a bucket directly.

I reverted the change to repo.json and added a check that should support both having a url at dataset level or urls at file level.
And I created a PR off of this branch, that does a first attempt at supporting S3: #99

We can use arrow for any filesystems it supports: https://arrow.apache.org/docs/python/generated/pyarrow.fs.copy_files.html#pyarrow.fs.copy_files

As for what we need to support: the only multi file dataset that exists in our benchmarks is the one I linked to. So we definitely need to support that.

After that, I think we should support specifying them this way as well (though even this could be deferred until after the MVP, I don't know of any datasets that look like this currently!)

[ { "name": "my_dataset", "url": "https://ursa-qa.s3.amazonaws.com/my_dataset/my_table", ... "tables": [ { "table": "my_table", "files": [ { "file_path": "part-0.csv.gz" }, { "file_path": "part-1.csv.gz" }, { "file_path": "part-2.csv.gz" }, { "file_path": "part-3.csv.gz" } ],

Where our downloader would concatenate the URL from the repo level and then file_path below; e.g.
https://ursa-qa.s3.amazonaws.com/my_dataset/my_table/part-0.csv.gz for the downloads

s3 download support has been added! It's hard to test, though, because the taxi bucket is very large.
I have a slight preference for keeping it as it is in the current state; either the url is on the dataset level, or there is a url entry for each file entry. Because a base url with suffixes is less flexible (what if they're not all in the same dir on the server?)

jonkeane · 2022-09-26T13:55:45Z

testrepo.json

@@ -0,0 +1,109 @@
+[


Is this file used anywhere? I've tried looking through the code, but can't find anywhere

No, but the datasets are very useful to test downloading multi-table and multi-file datasets.
I could add unit tests for them but maybe these files are a bit large for that?

Since this PR is already pretty big, let's defer this to a new issue about adding a test repo file + testing it. Agreed it's helpful, we might need to upload some of our sample files if we don't have them already, but they are much smaller.

The other option is we could mock that download function and then use the fixture files themselves. But anyway, there are a bunch of little decisions + things to add to this, so let's make an issue and follow up on that later

jonkeane · 2022-09-26T14:29:49Z

datalogistik/dataset.py

+    def get_table_filename(self, table):
+        name = table.table + os.extsep + self.format
+        if self.format == "csv" and self.compression == "gzip":
+            name = name + os.extsep + "gz"
        return name


On second thought, now that I've poked through this code a bit more, this function is used really as more of an extension identifier than it is a filename getter in the one place that it's used. What if we leaned into that more, renamed this get_table_extension() (or possibly, just get_extension), had it return the extension there,

And then edited ensure_table_loc() to be:

# this function will get the location of a table to be used + passed to a pyarrow dataset. It will ensure that all the directories leading up to the files that contain the data all exist (but will not create the data files themselves, directly). This function should be used to get the location of a table rather than constructing it oneself def ensure_table_loc(self, table=None): dataset_path = self.ensure_dataset_loc() # Defaults to the 0th table, which for single-table datasets is exactly what we want table = self.get_one_table(table) if len(table.files) > 1 or table.multi_file: table_path = pathlib.Path(dataset_path, table.table) table_path.mkdir(exist_ok=True) elif self.name not in tpc_info.tpc_datasets and table.files: # If there is a files entry, use it table_path = pathlib.Path(dataset_path, table.files[0]["file_path"]) else: table_path = pathlib.Path(dataset_path, table.table + self.get_extension(self)) return table_path

I've added a bit of a docstring too, which we should add to ensure_table_loc() regardless of the specifics of the change. We should also use dataset_path / table.table instead of pathlib.Path(dataset_path, table.table) where we do that in this function.

jonkeane

Looking good. The one major thing left is to make sure that we can do (raw) csv(.gz) -> csv on our fannie dataset with header_line: False and then we can merge this.

I noticed some additions on multi-table datasets added this last set of pushes, IMO we should create a separate PR to do those, there are a bunch of little decisions we'll need to make and getting that right here will mean extending this PR even longer.

jonkeane · 2022-09-27T18:15:41Z

repo.json

@@ -9,6 +9,7 @@
        "tables": [
            {
                "table": "2016Q4",
+                "header_line": false,


jonkeane · 2022-09-27T18:21:51Z

repo.json

+        "format": "parquet",
+        "tables": [
+            {
+                "table": "2009"


the table here is actually taxi_parquet (or probably taxi since we will be able to change the format now!). This entry is also not totally complete: at a minimum we'll need to specify the partitioning schema.

IMO: we should split this change out into a new PR so that we can merge the other parts of this PR without needing to get this right just yet.

Yes, this is wrong, sorry. I was trying to download only a subset of the bucket but it didn't work.

I removed the entire entry and also reverted the addition of S3 support. That code is in a branch I pushed here https://github.com/conbench/datalogistik/tree/s3-using-arrowfs just to have it around somewhere (although it's a tiny change)

This reverts commit bbbe754. # Conflicts: # repo.json

This change should probably be reverted when adding shortcuts for (de)compression

This reverts commit 1163596.

…parquet

…om write_metadata())

This reverts commit 4164d71.

…r per-file" This reverts commit c6a7fb9.

… table in wrong dirs" This reverts commit e253c10. # Conflicts: # datalogistik/dataset.py

This reverts commit 70c1ede. # Conflicts: # datalogistik/dataset.py # tests/test_dataset.py

jonkeane

A few minor comments, though looks like our integration tests in arrowbench have an issue still: https://github.com/voltrondata-labs/arrowbench/actions/runs/3136784477/jobs/5109925017

jonkeane · 2022-09-28T14:09:24Z

datalogistik/cli.py

+        if dataset_from_repo:
+            dataset.fill_in_defaults(dataset_from_repo)


Nice way to catch if the dataset isn't in the repo but still go

jonkeane · 2022-09-28T14:12:18Z

datalogistik/dataset.py

+        # Parse back to a pathlib, because we write paths out to JSON as strings
+        if type(self.cache_location) is str:
+            self.cache_location = pathlib.Path(self.cache_location)


We don't currently guard against this, but I believe both cache_location and dir_hash should be removed before we write jsons to disk. They should only really be needed when writing a dataset for the first time, after that the location and hash is known by where the metadata file is.

Would you mind writing up an issue with ^^^ that we can tackle later and adding a TODO here that basically says "When we actually guard that those aren't saved, we can remove this"?

Yes, however, self.cache_location is used in a bunch of places, so it needs to be filled in at some point. That could also be done by using something like metadata_file_path.parent of course.
Created #102

jonkeane · 2022-09-28T14:14:55Z

datalogistik/dataset.py

@@ -311,6 +295,7 @@ def get_table_dataset(self, table=None):
        schema = table.schema
        if self.format == "parquet":
            dataset_read_format = pads.ParquetFileFormat()
+            schema = None


Do we want this here? We don't yet support it, but we will soon support changing schemas, and being able to specify a schema even with parquet is a way to do that. All current datasets in the repo now will be None regardless, but I don't think we should hardcode it here too (unless we absolutely have to for some other reason?)

This is just to prevent an error a couple of lines down, when trying to read the parquet dataset, because it doesn't accept a schema argument. Note that this doesn't throw away the schema in the metadata

jonkeane · 2022-09-28T14:16:45Z

datalogistik/dataset.py

+            # IFF header_line is False, then add that to the write options
+            write_options = dataset_write_format.make_write_options(
+                include_header=False if table.header_line is False else True,
+                delimiter=self.delim if self.delim else ",",


This is fine for now, but would you mind making a TODO that says we should actually set the delim default during init + use None (which in pyarrow will default to "," itself)?

Looking at it a 2nd time now, the whole if seems superfluous, I'll consider removing it. But the real question is then do we want to initialize to "," ourselves? Even that isn't really necessary considering pyarrow's default behavior.

nods yeah maybe we can get away with not initializing at all!

jonkeane · 2022-09-28T14:20:00Z

datalogistik/dataset.py

+                    dataset_file_path = cached_dataset_path / (
+                        table.table + self.get_extension()
+                    )


Do you need an os.extsep here (or whatever the pathlib equivalent of that is?)?

I chose to include that in get_extension()

It has a different filename from the url

jonkeane · 2022-09-28T14:50:05Z

datalogistik/dataset.py

+        elif self.name not in tpc_info.tpc_datasets and table.files:
+            # If there is a files entry, use it
+            table_path = dataset_path / table.files[0]["file_path"]


Suggested change

elif self.name not in tpc_info.tpc_datasets and table.files:

# If there is a files entry, use it

table_path = dataset_path / table.files[0]["file_path"]

I suspect this is a better way to resolve the filename issue that the last commit is working around. Is there a case where we want to rely on the file_path there instead of the table_name + the conventions we've established for those?

You're right, we should never! So it's better to enforce it. The only thing that file_path could do is provide a suffix for the download url. But there's an important detail: the download function needs a full path to a file, not a dir. So for multi-file tables, we need a file_path entry both for the url suffix and the name of the file in the table dir.

Done in 5b9be54
Note that we still have PRs open for the details of multi-file/multi-table downloads #101 #47

joosthooz added 5 commits September 21, 2022 11:55

Moved header_line to table, changed - in repo file to _

a81f2a6

Updated test fixtures to use _ instead of -

98f4fa7

using filename from repo file instead of name.format

9a821ec

Fixed finding parquet compression, creating conversion output path

227c638

Fixed conversion to compressed csv

05c3800

joosthooz requested a review from jonkeane September 21, 2022 14:46

jonkeane requested changes Sep 21, 2022

View reviewed changes

Merge branch 'main' into 76-move-header_line-to-table

481af4a

# Conflicts: # datalogistik/table.py # repo.json # tests/test_dataset.py

jonkeane mentioned this pull request Sep 21, 2022

Clean up a path bug + add tests for it #91

Merged

joosthooz added 8 commits September 23, 2022 10:25

Merge branch 'main' into 76-move-header_line-to-table

5b6948d

# Conflicts: # tests/fixtures/test_cache/fanniemae_sample/a77e575/datalogistik_metadata.ini

Removed some outdated TODOs

5948f06

Restructured ensure_table_loc.

0826ae3

It will now only create a dir when the table is stored in a dir (as opposed to a single file)

Revert "default to none compression (#93)"

e3ad881

This reverts commit 18609b4.

Parquet's default compression snappy gets priority

ed04dc0

Fixed bug in setting autogen_column_names

ed106de

Removed outdated test

81d9b09

Not throwing away some metadata anymore during conversion

037dfc8

This was referenced Sep 23, 2022

default to none compression #93

Merged

Handle csv.gz extension, extra null strings, and download tiles to table names #95

Merged

joosthooz added 5 commits September 23, 2022 17:51

Merge branch 'main' into 76-move-header_line-to-table

ede2f5f

# Conflicts: # datalogistik/cli.py # datalogistik/dataset.py # tests/test_dataset.py

Cleaning after unsuccesful conversion

d27e300

Fixed dataset downloading (also for multi-file downloads)

70c1ede

Unsetting schema when reading in a parquet dataset

d506ea2

Handling potential problem when user tries placing files of a table i…

e253c10

…n wrong dirs

joosthooz requested a review from jonkeane September 26, 2022 10:03

Added repo file with some datasetsources for testing

6f30f18

jonkeane requested changes Sep 26, 2022

View reviewed changes

joosthooz added 2 commits September 27, 2022 11:37

Addressed review comments

8ccccbc

Added a docstring

9136c11

jonkeane requested changes Sep 27, 2022

View reviewed changes

joosthooz added 21 commits September 28, 2022 12:06

Fixed s3 taxi ds table name (but this should become taxi)

75c34ac

Removed unused test repo file

81cac04

Removing s3 dataset repo entry for now

9c1c245

Revert "Added support for downloading an s3 bucket"

6d8c146

This reverts commit bbbe754. # Conflicts: # repo.json

Do not add schema to metadata for parquet

1163596

Conversion to csv using pyarrow potentially changes delim.

4164d71

This change should probably be reverted when adding shortcuts for (de)compression

Revert "Do not add schema to metadata for parquet"

85f9661

This reverts commit 1163596.

Added a comment about why we should keep schema in metadata also for …

2662de5

…parquet

Fixed bug where properties with value False were removed from metadata

ff1b2b9

Removed duplicate calls to fill_metadata_from_files() (also called fr…

fd18b00

…om write_metadata())

Revert "Conversion to csv using pyarrow potentially changes delim."

74c5f1e

This reverts commit 4164d71.

keeping original delimiter when converting to csv

4bab7a6

Formatting

69c8c99

Revert "A dataset can now have a download url as top-level property o…

41944d3

…r per-file" This reverts commit c6a7fb9.

Revert "Handling potential problem when user tries placing files of a…

69f0b3f

… table in wrong dirs" This reverts commit e253c10. # Conflicts: # datalogistik/dataset.py

Revert "Fixed dataset downloading (also for multi-file downloads)"

ca2cb2f

This reverts commit 70c1ede. # Conflicts: # datalogistik/dataset.py # tests/test_dataset.py

Removed unused parameter from get_extension()

80328c3

formatting

4652482

Removed a superfluous argument

9cba9a2

Added header_line property to fanniemae repo entry (this was reverted)

4b59eb9

Removed unused import

cffcbba

jonkeane reviewed Sep 28, 2022

View reviewed changes

Performed a quick fix of the download code for the nyctaxi dataset.

9a0d3f4

It has a different filename from the url

jonkeane reviewed Sep 28, 2022

View reviewed changes

joosthooz mentioned this pull request Sep 28, 2022

Remove some properties before writing metadata to JSON #102

Open

joosthooz added 2 commits September 28, 2022 18:05

Enforcing file/tabledir name == table name

5b9be54

Addressed review comment

7163cc7

jonkeane approved these changes Sep 28, 2022

View reviewed changes

jonkeane merged commit cd86842 into main Sep 28, 2022

	def ensure_table_loc(self, table=None, parents_only=False):
	# Defaults to the 0th table, which for single-table datasets is exactly what we want
	table = self.get_one_table(table)

	# TODO: check that this file actually exists?
	data_path = pathlib.Path(self.ensure_dataset_loc(), self.get_table_name(table))

	if parents_only:
	data_path = data_path.parent

	# Make the dir if it's not already extant
	if not data_path.exists():
	data_path.mkdir(parents=True, exist_ok=True)

	return data_path

	# it's a pyarrow quirk that the convert_options are under `default_fragment_scan_options`
	assert (
	"FANCY_NULL" in spec.default_fragment_scan_options.convert_options.null_values
	)

		if dataset_from_repo:
		dataset.fill_in_defaults(dataset_from_repo)

	elif self.name not in tpc_info.tpc_datasets and table.files:
	# If there is a files entry, use it
	table_path = dataset_path / table.files[0]["file_path"]

		``local-creation-date``
		``local_creation_date``

Move header line to table #89

Move header line to table #89

Conversation

joosthooz commented Sep 21, 2022 • edited Loading

joosthooz commented Sep 21, 2022

jonkeane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonkeane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonkeane Sep 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joosthooz Sep 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonkeane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonkeane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joosthooz commented Sep 21, 2022 •

edited

Loading

jonkeane Sep 27, 2022 •

edited

Loading

joosthooz Sep 27, 2022 •

edited

Loading