Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do I need to open a dataset twice? #10

Open
mbauman opened this issue Mar 19, 2021 · 3 comments
Open

Why do I need to open a dataset twice? #10

mbauman opened this issue Mar 19, 2021 · 3 comments
Labels
documentation Improvements or additions to documentation

Comments

@mbauman
Copy link
Member

mbauman commented Mar 19, 2021

This workflow feels funny. Am I doing this wrong?

julia> blob = open(Blob, dataset("us_counties"))
📄 data @ JuliaHub/bcf2ed95-b0a2-40bf-8d62-12a0de4e2a44/v1

julia> df = open(io->CSV.read(io, DataFrame), IO, blob)
1105438×6 DataFrame
...

Is there an easier way to CSV.read a blob?

@c42f
Copy link
Contributor

c42f commented Mar 23, 2021

You don't need the intermediate stage where it's opened as a blob, you can just open Blobs directly as a more Julia-native datatype you care about. For example, CSV prefers a Vector{UInt8} buffer, so you can do

open(buf->CSV.read(buf, DataFrame), Vector{UInt8}, dataset("tiny_csv"))

Because CSV.read also can deal with an IO you can also do

open(io->CSV.read(io, DataFrame), IO, dataset("tiny_csv"))

@c42f
Copy link
Contributor

c42f commented Mar 23, 2021

In general, the idea here is that Blob is a reflection of what's on disk (vector-of-bytes), but that this can be reflected into Julia in various ways - at least, as a stream (IO), an array of bytes (Vector{UInt8}), utf-8 text (String), or lazily represented as a Blob.

This is what I mean when I say (somewhere in the docs) that DataSets is a kind of bridge between type systems.

  • Serialized raw data has a kind of ambiguous structural type system
  • The data may be deserialized into various in-program data types

@c42f
Copy link
Contributor

c42f commented Mar 23, 2021

You may be wondering "what's the purpose of Blob at all, then?" There's several things:

  • It's a lazy resource to be passed around to functions which can then open it in the way they want internally, rather than the caller needing to know which Julia type the callee needs.
  • You can get one from both Blob and BlobTree - the content of a "single file" dataset, and the leaves of the tree.

@c42f c42f added the documentation Improvements or additions to documentation label Aug 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants