For completed items see change-log.
https://github.com/holgerbrandl/krangl/issues
-
Better documentation & cheatsheet
-
Date (column?) support
-
Factor (column?) support + Add factor attribute utilities similar to methods in R package
forcats
-
better spec out NA
- consider use of doublearray for double/int-col along with NaN, see https://pandas.pydata.org/pandas-docs/stable/missing_data.html#working-with-missing-data
-
add a
pretty_column
helper -
improve
iter.toDataFrame()
to include reference, getters, kotlin property getters
-
inconsistenly named reader methods
-
krangl.ColumnsKt#map
should have better return type -
use/support compressed columns (https://github.com/lemire/JavaFastPFOR)
-
Better lambda receiver contexts
-
Performance (indices, avoid list and array copies, compressed columns)
-
Use dedicated return type for table formula helpers (like
mean
,rank
) to reduce runtime errors -
More bindings to other jvm data-science libraries
-
Sequence
vsIterable
? -
Pluggable backends like native or SQL
-
should
unfold
be better calledflatten
? -
write chapter about timeseries support
- Add parquet support https://stackoverflow.com/questions/39728854/create-parquet-files-in-java
- more defined behavior/tests needed for grouped dfs that become empty after filtering
require(dplyr)
iris %>% group_by(Species) %>% filter(Sepal.Length>100)
- misc consider to use kotlin.collections.ArrayAsCollection
- Setup up benchmarking suite
List copy optimization
-
use iterable where possible
-
misc consider to use kotlin.collections.ArrayAsCollection --> get rid of toList which always does a full copy internally.
-
30% flights HOTSPOT:
krangl/Extensions.kt:275
can we get rid fo the array creation? -
krangl.SimpleDataFrame.addColumn
should avoidtoMutatbleList
-
More consistent use of List vs using arrays as column datastore (see array vs list). This would avoid array conversion which are omnipresent in the API at the moment.
-
get rid of other
toMutableList` and use view instead -
Analyze benchmark results with with kravis/krangl :-)
-
use for column indices to speed up access
fast column storage https://github.com/lemire/JavaFastPFOR http://fastutil.di.unimi.it/
benchmarking
https://github.com/mm-mansour/Fast-Pandas
-
remove regrouping in core verbs where possible
-
consider to use invoke for row access (potentially decouple more arguable extensions in different namespace?)
-
provide equivalent for dplyr::summarize_each and dplyr::mutate_each #4
-
krangl.head
should use view instead of copy; also consider to use views for grouped data (see https://softwarecave.org/2014/03/19/views-in-java-collections-framework/) -
koma bindings --> http://koma.kyonifer.com/
-
Add a
DataFrame.transpose()
methodas_tibble(cbind(nms = names(df), t(df)))
-
Integrate idoms to do enrichment testing with fisher test from commons-math
-
see tablesaw changelog https://jtablesaw.github.io/tablesaw/changes_in_v_0.2
- directly access values with
it["foo"]
and not just column object. For the latter DataFrame.cols can be used- Not a good idea because all extension function would then be defined for common lists like List etc. It's more important to keep the namespace clear
Provide adhoc/data class conversion for column model adhoc/data class objects
val dataFrame = object : DataFrame() {
val x = Factor("sdf", "sdf", "sdfd")
val y = DblCol(Double.MAX_VALUE, Double.MIN_VALUE)
val z = y + y
}
val newTable = df.map{ data class Foo(val name:String)}
newTable.newCol
newTable.src.x
--> Can not work because data class is not an expression
- improve benchmarking by avoid jmv warmup with -XX:CompileThreshold=1 src
--> rather continue with jmh driven benchmarking subproject
--
Make use of kotlin.Number to simplify API --> Done by adding NumberCol
but unclear how to actually benefit from it