Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting the data #1

Open
Evizero opened this issue Sep 4, 2015 · 1 comment
Open

Splitting the data #1

Evizero opened this issue Sep 4, 2015 · 1 comment

Comments

@Evizero
Copy link

Evizero commented Sep 4, 2015

We should probably exchange ideas on data sampling / splitting.

The approach I am currently using is very memory friendly for huge datasets by shuffling the data-matrix in-place and then making continuous array views.

Let X be a 10x10000000 Array{Float64,2}

julia> @time shuffle!(X)
elapsed time: 1.202112921 seconds (80 bytes allocated)

julia> @time train = view(X, :, 1:7000000)
elapsed time: 1.7596e-5 seconds (192 bytes allocated)

julia> @time test = view(X, :, 7000001:10000000)
elapsed time: 1.3097e-5 seconds (192 bytes allocated)

I couldn't find a better way so far. It does has its limitations when the sampling should be sensitive to the class distribution

@tbreloff
Copy link
Owner

tbreloff commented Sep 4, 2015

See the nnet/data.jl source file... I'm using the idea of fixed arrays of DataPoint objects, and wrappers which access the arrays in different ways. I haven't benchmarked completely, though.

On Sep 4, 2015, at 3:56 PM, Christof Stocker [email protected] wrote:

We should probably exchange ideas on data sampling / splitting.

The approach I am currently using is very memory friendly for huge datasets by shuffling the data-matrix in-place and then making continuous array views.

Let X be a 10x10000000 Array{Float64,2}

julia> @time shuffle!(X)
elapsed time: 1.202112921 seconds (80 bytes allocated)

julia> @time train = view(X, :, 1:7000000)
elapsed time: 1.7596e-5 seconds (192 bytes allocated)

julia> @time test = view(X, :, 7000001:10000000)
elapsed time: 1.3097e-5 seconds (192 bytes allocated)
I couldn't find a better way so far. It does has its limitations when the sampling should be sensitive to the class distribution


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants