Splitting the data #1

Evizero · 2015-09-04T19:56:32Z

We should probably exchange ideas on data sampling / splitting.

The approach I am currently using is very memory friendly for huge datasets by shuffling the data-matrix in-place and then making continuous array views.

Let X be a 10x10000000 Array{Float64,2}

julia> @time shuffle!(X)
elapsed time: 1.202112921 seconds (80 bytes allocated)

julia> @time train = view(X, :, 1:7000000)
elapsed time: 1.7596e-5 seconds (192 bytes allocated)

julia> @time test = view(X, :, 7000001:10000000)
elapsed time: 1.3097e-5 seconds (192 bytes allocated)

I couldn't find a better way so far. It does has its limitations when the sampling should be sensitive to the class distribution

tbreloff · 2015-09-04T21:51:44Z

See the nnet/data.jl source file... I'm using the idea of fixed arrays of DataPoint objects, and wrappers which access the arrays in different ways. I haven't benchmarked completely, though.

On Sep 4, 2015, at 3:56 PM, Christof Stocker [email protected] wrote:

We should probably exchange ideas on data sampling / splitting.

The approach I am currently using is very memory friendly for huge datasets by shuffling the data-matrix in-place and then making continuous array views.

Let X be a 10x10000000 Array{Float64,2}

julia> @time shuffle!(X)
elapsed time: 1.202112921 seconds (80 bytes allocated)

julia> @time train = view(X, :, 1:7000000)
elapsed time: 1.7596e-5 seconds (192 bytes allocated)

julia> @time test = view(X, :, 7000001:10000000)
elapsed time: 1.3097e-5 seconds (192 bytes allocated)
I couldn't find a better way so far. It does has its limitations when the sampling should be sensitive to the class distribution

—
Reply to this email directly or view it on GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting the data #1

Splitting the data #1

Evizero commented Sep 4, 2015

tbreloff commented Sep 4, 2015

Splitting the data #1

Splitting the data #1

Comments

Evizero commented Sep 4, 2015

tbreloff commented Sep 4, 2015