Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts re common Table operations #2

Open
JockLawrie opened this issue Aug 22, 2019 · 2 comments
Open

Thoughts re common Table operations #2

JockLawrie opened this issue Aug 22, 2019 · 2 comments

Comments

@JockLawrie
Copy link

Hi there,

Not sure if this is the right place, but I've been thinking about the table operations that I typically use.
Seems the Tables interface allows for this in a straightforward way.
I was going to just implement it and release the package, but thought I'd run the idea here to coordinate efforts.

I've boiled them down to the operations in the code snippet below, aiming for:

  • No ambiguity. It should be obvious from the name what the operation does. E.g., select selects columns by convention, but if I've been away from my code for a while I have to relearn this. I'd prefer selectcols (and selectrows instead of filter).
  • Safety. Mutating operations should be visibly clear, and unsafe operations made explicit (as per the previous point).
  • Minimality. There shouldn't be 2 functions that do the same thing. E.g., some systems have both mutate and transform, which I think creates clutter in the API.

So here's what I have in mind for tables, views and split-apply-combine operations.
Suggestions most welcome.

Cheers

Tables:

newtable = SomeTableType(table)  # Convert table to SomeTableType
val = table[i, colname]  # Get
table[i, colname] = val  # Set
newtable = appendrows(table, rows)
newtable = appendcols(table, newcolname => somevector...)
newtable = appendcols(table, newcolname => func(row)...)
newtable = deleterows(table, rows)
newtable = deletecols(table, cols)
table = mutatecol!(table, colname::Symbol => func)
table = sortrows!(table, by)
table = sortrows!(table, colnames, rev)

Views:

view = selectcols(table, colnames)
view = selectrows(table, rowindices)
view = selectrows(table, func(row))
val = view[i, colname]  # Get
view[i, colname] = val  # Set. Raise an error if the view function returns false on the resulting row.
unsafe_set!(view, i, colname, val)  # Changes the value and does not raise an error.
newtable = SomeTableType(view)  # Convert view to SomeTableType
view = mutatecol!(view, colname::Symbol => func)  # Raise an error if the view function returns false on any of the resulting rows.

split-apply-combine:

grptbl = groupby(data, colnames...)
grptbl = groupby(data, rowfunc)

for grp in grptbl    # grp is a view
    for r in rows(grp)
        # do something here
    end
end

reducedtbl = some_empty_table
for grp in grptbl
    push!(reducedtbl, (col1=sum(grp[:col3]), col2=mean(grp[:col4])))
end

val = groupdefinition(grp)  # (colname1=val1, colname2=val2,...) if grp was defined by colnames; or func(grp[1, :]) if grp was defined by a row function
grp = group(grptbl, groupdef)  # Useful for groups accessed via definition.
grp = group(grptbl, groupidx)  # Useful for accessing groups by index and for iterating over groups

For constructing reduced tables DataFrames has an interface similar to

reducedtbl = reduceby(table, colnames, :col1 => (sum, :col3), :col2 => (sum, :col4))  # Short version of the above, though less flexible (cannot operate on multiple columns at once)

But I prefer the version that explicitly iterates over the groups because it adheres to minimality and is more flexible (construction of the new columns can use arbitrary functions of the input view).

@nalimilan
Copy link
Member

@JockLawrie
Copy link
Author

Thanks. That's a long thread that didn't conclude anything. Since the set of operations that are useful will differ for different people/use cases/style preferences, it's hard to see any agreement for a single unified API. Nor does it seem necessary.

Base and Tables.jl between them seem to provide the elements required to construct a set of operations, so perhaps it's best to let a number of query-like packages built on Base and Tables.jl to emerge organically, which the community will undoubtedly pare down by voting with their feet.

I'm inclined to just put something out there with a view to having it improved or replaced by popular vote.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants