-
-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data loading & preprocessing pipeline feature #1282
Comments
Right now it is possible to use struct Dataset{T, N} <: AbstractArray{T, N}
frame_template::FormatExpr
targets::AbstractArray
end
Dataset{T}(frame_template, targets) where {T} =
Dataset{T, 1}(frame_template, targets) Then defining function Base.getindex(d::Dataset{T}, i::Int) where {T}
path = format(d.frame_template, i - 1)
image = path |> FileIO.load |> Images.channelview .|> T
image, d.targets[[i]]
end and how we load mini-batch function Base.getindex(d::Dataset{T}, ids::Array) where {T}
x, y = d[ids[1]]
xs_last_dim = ntuple(i -> Colon(), ndims(x))
ys_last_dim = ntuple(i -> Colon(), ndims(y))
xs = Array{T}(undef, size(x)..., length(ids))
ys = Array{T}(undef, size(y)..., length(ids))
xs[xs_last_dim..., 1] .= x
ys[ys_last_dim..., 1] .= y
for (i, id) in enumerate(ids[2:end])
x, y = d[id]
xs[xs_last_dim..., i + 1] .= x
ys[ys_last_dim..., i + 1] .= y
end
xs, ys
end And some helper functions Base.IndexStyle(::Type{Dataset}) = IndexLinear()
Base.size(d::Dataset) = (length(d.targets),)
Base.length(d::Dataset) = length(d.targets) This in some sense mimicks PyTorch's Dataset frame_template = FormatExpr(raw".\frames\frame-{:d}.jpg") # Template path for images.
targets = load_from_txt(raw".\speed.txt") # Array of targets
dataset = Dataset{Float32}(frame_template, targets)
loader = Flux.Data.DataLoader(dataset, batchsize=4, shuffle=true)
println("Loader length: $(length(loader))")
for (i, (x, y)) in enumerate(loader)
i == 10 && break
println("$i: $(size(x)) $(size(y))")
end Output of the data from my example:
|
For the time being, we can just document the interface that a "Dataset" should expose in order to be compatible with the DataLoader. @pxl-th a PR in this direction would be very welcome. In the longer run, we should definitely consider reimplementing the DataLoader on top of transducers. Transducers are great and come fully packed with features, as @ageron showed. |
Thanks for your detailed answer @pxl-th . I'm not sure whether my code example really makes sense, but it's the kind of API I would imagine, largely inspired from TF's tf.data API, with a transducers twist. I'm happy to help if you want. |
I wonder if something more idiomatic could be done, like: # I can call custom now and it will return three objects
@dataset (:train) image,target1,target2 function custom(path_arrays,idx)
image = # load from the path_arrays[idx] + ...
target1 = # load from the path_arrays[idx] + ...
target2 = # load from the path_arrays[idx] + ...
end
or # add another method to dataset
function dataset(path_arrays,idx, :train)
image = # load from the path_arrays[idx] + ...
target1 = # load from the path_arrays[idx] + ...
target2 = # load from the path_arrays[idx] + ...
image,target1,target2
end And another type called I am not a Julia expert, but I could help implementing it 😄 . |
Hi everyone! I am trying to write a CNN (U-Net) in Flux ML Library of Julia. However the first hurdle that I am facing is, how to load images from folder to train the model. I have searched the Internet extensively for any method, but no luck. All the examples on the internet for CNN in Flux, use library functions to import pre-existing data-sets like MNIST dataset. Actually all the examples, use a library functions to load this dataset. Please someone tell me, how to load custom image dataset from folders to train a CNN model written in Flux Julia. I would be really grateful! |
You can look at UNet.jl |
|
There's a Most libraries would have a form of |
Can I use the load_batch function to output x_train, y_train, so that both could be fed in the - |
Quite possibly. Its hard to say without knowing how your dataset is structured. The function you mention assumes that the images and labels exist together in a directory, and every directory points to a different class. If you write a function that can generate valid paths to images and their corresponding labels, then you can use load_img too. It might be useful to see how that's written anyway, it's quite easy. |
My Data directory is structured as follows:- |
So you'd use the CSV to generate valid paths to the data and corresponding labels and shoot that to load_img (quite possibly you'll broadcast over a vector of strings or vector of tuple of strings) and load the images and labels like https://github.com/DhairyaLGandhi/UNet.jl/blob/2b8f2393a8bf9895f69bb8493fbae84d7c9f9c35/src/dataloader.jl#L7 You can copy this function and replace this line with one that doesn't append the |
Well, would this function be correct, if I want to load all the images at once and then feed x_train, y_train to DataLoader function:- function load_data(base_dir, n)
img_dir = "/images/"
mask_dir = "/masks/"
x = zeros(Float32, rsize..., 1, n) # []
y = zeros(Float32, rsize..., 1, n) # []
for i in train_df[!, "id"]
img = load(joinpath(base_dir, img_dir, i))
mask = load(joinpath(base_dir, mask_dir, i))
#img = imresize(img, rsize...)
#mask = imresize(mask, rsize...)
img = channelview(img)
mask = channelview(mask)
#img = reshape(img, rsize..., 1)
#mask = reshape(mask, rsize..., 1)
x′ = @view x[:,:,:,i]
x′ .= img
y′ = @view x[:,:,:,i]
y′ .= mask
end
x, y
end |
But the load_img() function will load one image at a time, right. I want to load all the images at once to x_train, and all the masks to y_train. What should I do in that case. |
I wrote this function based on the load_batch() function. But why is this function only returning array filled with zeroes. function load_data(base_dir, n, rsize = (101,101))
img_dir = "images/"
mask_dir = "masks/"
x = zeros(Float64,rsize...,3, n) # []
y = zeros(UInt8, rsize..., 1,n)
count = 1
for i in train_df[!, "id"]
try
if isfile(string(joinpath(base_dir, img_dir, i), ".png")) && isfile(string(joinpath(base_dir, mask_dir, i), ".png"))
img = load(string(joinpath(base_dir, img_dir, i), ".png"))
mask = load(string(joinpath(base_dir, mask_dir, i), ".png"))
#img = imresize(img, rsize...)
#mask = imresize(mask, rsize...)
img = channelview(img)
mask = channelview(mask)
#img = reshape(img, rsize..., 1)
#mask = reshape(mask, rsize..., 1)
x′ = @view x[:,:,:,count]
x′ .= img
y′ = @view y[:,:,:,count]
y′ .= mask
append(x,x′)
append(y, y′)
count += 1
end
catch ArgumentError
end
end
x, y
end |
An approach such as #1530 combined with DataSets.jl works well with the synchronised data parallel training. That way we could nest loaders to pass on subsets of the paths to different workers which themselves loaded up samples of data from S3 (or disk) to train with. We were able to amortise the cost of loading from file/ network since the loading was itself happening asynchronously as was the data transfer. None of the samples or the entire data could fit in memory. cc @c42f |
Integrated with DataSets.jl I'm imaging something like: data = Dataloader(dataset("MNIST"))
for (x,y) in data
...
end and let the under-the-hood implementation to be something like: blobtree = open(dataset("MNIST"))
data = Dataloader(mappedarray(load, blobtree)) but if JuliaComputing/DataSets.jl#17 is implemented then it would be much simplified as data = Dataloader(open(dataset("MNIST"))) My eventual goal in JuliaML/MLDatasets.jl#73 (comment) is to have every dataset backed by Datasets.jl with some wrapper container (with downloader supports where you can find some initial discussion in oxinabox/DataDeps.jl#144) |
It probably doesn't even need to be using MappedArrays, since there may be subtleties we're not capturing here. Small datasets or custom data types or the like come to mind. I want it to be a bit composition based so DataSets.jl kicks in when it's advantageous to do so. |
Just to clarify a bit, batch loading + lazy file loading can be implemented quite easily using MLDataPattern and MappedArrays: using MLDataPattern
using FileIO, TestImages
using ImageCore, ImageShow
datadir = dirname(testimage("camera", download_only=true))
data = mappedarray(readdir(datadir, join=true)) do filename
Gray{Float32}.(load(filename)[1:32, 1:32, 1])::Matrix{Gray{Float32}}
end;
dataset = batchview(data, size=2);
dataset[1] # a list of two images
for X in dataset
X_gpu = gpu(Flux.stack(dataset[1], 3))
# do some training...
end IMHO the more important purpose of
For this I'm actually not sure of, most of the Flux network requires the data to be stacked into a big numerical array because the backends(e.g., CuDNN) require so. But to provide more generic support to also other training patterns who has native support for vector of array layouts, there won't be such a need. |
Right, the argument is to generalise outside of the computer vision case as well. Doing fetching, loading, preprocessing, moving (to |
Having plug and play loaders is the most idiomatic approach I think. It allows for uses that load large quantities of data or just ingest arguments to use and sample data from it. But this should not come with a function to overload imo since that can limit what may work well with different parts of the pipeline. Not just in terms of implementation but also uncommon data sources and API. I am imagining needing this for non-image data as well, since the pipeline is the same either way, so I would like something that can generalise better than be specialised to images (which I believe can happen once we have a pipeline set up). |
Ideally, this should also not come with strict set API to overload but fall out of the general set of guidelines around iteration and so on. I'd like to not have to write overloads, since different data sets may require different API from one to the next. The case with MLDataPatterns seems like it would require adhering to its rules, and lazy views/ types etc, which may not always scale well, or require creating extra copies with AD. AtomGraphs is the type to try alongside images (from AtomicGraphNets) |
The reality is that without a set of traits/methods to dispatch and overload, there will be ambiguities here and there. Letting |
MLDataPattern is just
There are no rules that say you have to be lazy or use views. Much like the indexing/iteration interface in Base Julia, you only need to implement two functions -- one to access a sample, and another to specify how many samples there could be. |
FWIW I have been thinking about refactoring how observation dimensions are specified in MLDataPattern which might reduce the interface down to literally the Base indexing/iteration interface. |
I don't really see it, still. It seems to be rediscovering indexing/ iteration from Base. Some common types we can host, and we do too.
Right, and I'd like to overload the iteration protocol in Base and let that be the interface so to speak. |
We already know the functionality in Base is not sufficient, otherwise https://github.com/FluxML/Flux.jl/blob/master/src/data/dataloader.jl#L105-L119 would never have been written. Where this additional functionality should live is the question. If Python ML libraries have taught us anything, it's that having your own siloed API for data containers is a surefire way to increase fragmentation and redundant work across the ecosystem. |
#1530 goes on to clean up the code there.
Correct, that's why
The discussion is geared towards flexible and general pipelining for a variety of ML workflows, not specific packages. For example one may have to work with something like ArchGDAL and it's easier for users to then use their API directly to handle coordinates depending on how it's laid out. By removing the need for overloading a specific function, you've made it possible to reduce boilerplate. Specific implementations (with their own high level API such as MLDatapattern) can of course coexist with more common image loading cases, which themselves can be specialised as Segmentation/ Localization etc tasks. |
Sure, that removes the need to implement a couple non-base functions for the tradeoff of having to implement a substantial chunk of the AbstractArray API. Or alternatively, to re-implement the guts of |
It's fairly common to update implementations... If the argument is that it's not composable then I'd defer you to help improve that PR. |
First off, let me say that I really appreciate the time and thought you've put into revamping Flux's data APIs. It's certainly much easier to stand off to the side and comment than actually roll up one's sleeves and get working code. That said, the reason I'm not motivated to improve PRs here is because I can, right this moment, run
There are of course sticking points with this approach. It's a third party dependency with a handful of transitive deps Flux doesn't rely on. It implements an interface that is neither ubiquitous nor universally agreed upon across the Julia data ecosystem. However, it is the closest thing we have to an agreed upon interface across different libraries at present. Concerns about having to implement yet another set of functions to make one's data source work are valid, but I'm not sure the remedy is to roll our own version. |
If there are concerns about the future maintenance(e.g., breakage, commitment), we could ask if @lorenzoh is willing to host DataLoaders.jl in FluxML so as to better coordinate with Flux and FastAI. Ask this because it's not clear to me at the moment what #1530 brings with the presence of DataLoaders.jl. |
I agree with @darsnack that standardizing on the For an example of how the
Since it hasn't been mentioned, I want to point out that DataLoaders.jl already does this for any data container that supports
I'm for it, it's already used extensively in FastAI.jl anyway and I think supports everything people want to do. I'm not super familiar with DataSets.jl but I can imagine using that as a backend for a |
@paritosh5feb |
It may not be the safest approach to create views on iteration. Mutation of the data etc can cause further training over multiple epochs to pollute results. Having said that, it's trivial to change the |
Please help! After updating Flux to the latest version the plx-ht's brilliant solution stopped working... what should I do? |
What error are you getting? I don't immediately see any reason that it shouldn't work. For reference, just defining: function Base.getindex(d::Dataset{T}, i::Int) where {T}
path = format(d.frame_template, i - 1)
image = path |> FileIO.load |> Images.channelview .|> T
image, d.targets[[i]]
end
Base.length(d::Dataset) = length(d.targets) is all that is necessary for it to work. |
After removing Or maybe there were some fixes in the past 17 hours? I am not versioning my manifest so I do not know. |
Seriously... plx-ht's solution does not work anymore. using Flux
struct Dataset{T, N} <: AbstractArray{T, N}
length::Int
end
Dataset{T}(length) where {T} = Dataset{T, 1}(length)
function Base.getindex(d::Dataset{T}, i::Int) where {T}
image = rand(UInt8, 32, 32) # Just random image.
target = rand(range(1, 10)) # and random class.
image, target
end
function Base.getindex(d::Dataset{T}, ids::Array) where {T}
x, y = d[ids[1]]
xs_last_dim = ntuple(i -> (), ndims(x))
ys_last_dim = ntuple(i -> Colon(), ndims(y))
xs = Array{T}(undef, size(x)..., length(ids))
ys = Array{T}(undef, size(y)..., length(ids))
xs[xs_last_dim..., 1] .= x
ys[ys_last_dim..., 1] .= y
for (i, id) in enumerate(ids[2:end])
x, y = d[id]
xs[xs_last_dim..., i + 1] .= x
ys[ys_last_dim..., i + 1] .= y
end
xs, ys
end
Base.IndexStyle(::Type{Dataset}) = IndexLinear()
Base.size(d::Dataset) = (d.length,)
Base.length(d::Dataset) = d.length
dataset = Dataset{Float32}(10)
loader = Flux.Data.DataLoader(dataset, batchsize=4, shuffle=true)
for (xs, xy) in loader # Fails here!
println("$i: $(size(xs)) $(size(ys))")
end Using Flux v0.13.3 this MWE produces following error. ERROR: MethodError: Cannot `convert` an object of type Tuple{Matrix{UInt8}, Int64} to an object of type Float32 Seems like the |
Could you post a complete stack trace and the output of Pkg status? |
Sure, sorry. Now I've removed the julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 8
JULIA_PKG_USE_CLI_GIT = true
JULIA_EDITOR = code
(test) pkg> st
Status `~/Dokumenty/Projekty/BSU/test/Project.toml`
[587475ba] Flux v0.13.3
julia> using Flux
struct Dataset{T, N} <: AbstractArray{T, N}
length::Int
end
Dataset{T}(length) where {T} = Dataset{T, 1}(length)
... # same as the code above
dataset = Dataset{Float32}(10)
loader = Flux.Data.DataLoader(dataset, batchsize=4, shuffle=true)
MLUtils.DataLoader{Dataset{Float32, 1}, Random._GLOBAL_RNG}(Float32[(UInt8[0xdc 0x69 … 0xe9 0xdf; 0x94 0xa1 … 0x8a 0x1e; … ; 0x7d 0x73 … 0x6b 0xbc; 0x74 0x5f … 0xcf 0x7e], 6), (UInt8[0x1b 0x95 … 0x65 0xb6; 0xb7 0x4a … 0x84 0x4b; … ; 0xf9 0xde … 0x99 0x75; 0xe1 0x81 … 0xac 0xfd], 7), (UInt8[0xa2 0x35 … 0x7f 0x2a; 0xc5 0x0d … 0xd7 0x87; … ; 0x81 0xd8 … 0x04 0xc6; 0x6f 0xa5 … 0xaa 0xab], 8), (UInt8[0x26 0x75 … 0xf0 0xf6; 0x93 0x7d … 0xe4 0x49; … ; 0x19 0x4d … 0x60 0x91; 0x79 0x48 … 0x39 0xc0], 5), (UInt8[0x3f 0x19 … 0x5e 0x78; 0x77 0x7c … 0x1c 0x12; … ; 0xa8 0x16 … 0x01 0xa6; 0x93 0x68 … 0x37 0x65], 3), (UInt8[0x99 0x25 … 0x3a 0x2a; 0x6c 0x10 … 0x3b 0xbb; … ; 0x5d 0xde … 0xbf 0x16; 0xe8 0xbe … 0x3c 0xfa], 2), (UInt8[0x3e 0xf4 … 0x3d 0x86; 0xda 0xce … 0x29 0x07; … ; 0x17 0xf6 … 0x09 0xc5; 0xf5 0x09 … 0x1c 0xcb], 5), (UInt8[0x94 0xa2 … 0x88 0x87; 0x8d 0xb1 … 0xfa 0x98; … ; 0xc3 0x2d … 0xc9 0x41; 0x50 0x0c … 0x86 0x88], 4), (UInt8[0xb6 0x8d … 0x15 0xe2; 0xd5 0xa5 … 0x01 0x46; … ; 0x45 0xde … 0x19 0xc2; 0xe5 0x3c … 0xe1 0x2d], 5), (UInt8[0x5f 0x69 … 0x9b 0xc5; 0x7d 0x6f … 0xbd 0xd8; … ; 0x75 0xb4 … 0xbb 0xa7; 0x82 0x19 … 0xe5 0x79], 1)], 4, 10, true, true, Random._GLOBAL_RNG())
julia> for (xs, xy) in loader
println("$i: $(size(xs)) $(size(ys))")
end
ERROR: MethodError: Cannot `convert` an object of type Tuple{Matrix{UInt8}, Int64} to an object of type Float32
Closest candidates are:
convert(::Type{T}, ::LLVM.GenericValue, ::LLVM.LLVMType) where T<:AbstractFloat at ~/.julia/packages/LLVM/WjSQG/src/execution.jl:39
convert(::Type{T}, ::LLVM.ConstantFP) where T<:AbstractFloat at ~/.julia/packages/LLVM/WjSQG/src/core/value/constant.jl:111
convert(::Type{T}, ::Static.StaticFloat64) where T<:AbstractFloat at ~/.julia/packages/Static/KC67x/src/float.jl:22
...
Stacktrace:
[1] setindex!(A::Vector{Float32}, x::Tuple{Matrix{UInt8}, Int64}, i1::Int64)
@ Base ./array.jl:903
[2] macro expansion
@ ./multidimensional.jl:867 [inlined]
[3] macro expansion
@ ./cartesian.jl:64 [inlined]
[4] _unsafe_getindex!
@ ./multidimensional.jl:862 [inlined]
[5] _unsafe_getindex
@ ./multidimensional.jl:853 [inlined]
[6] _getindex
@ ./multidimensional.jl:839 [inlined]
[7] getindex
@ ./abstractarray.jl:1218 [inlined]
[8] getobs
@ ~/.julia/packages/MLUtils/OojOS/src/observation.jl:96 [inlined]
[9] getobs(A::MLUtils.BatchView{SubArray{Float32, 1, Dataset{Float32, 1}, Tuple{Vector{Int64}}, false}, SubArray{Float32, 1, Dataset{Float32, 1}, Tuple{Vector{Int64}}, false}}, i::Int64)
@ MLUtils ~/.julia/packages/MLUtils/OojOS/src/batchview.jl:105
[10] (::MLUtils.var"#34#36")(i::Int64)
@ MLUtils ./none:0
[11] iterate
@ ./generator.jl:47 [inlined]
[12] iterate(d::MLUtils.DataLoader{Datase/home/frantisek/Dokumenty/Projekty/BSU/test/Manifest.tomlt{Float32, 1}, Random._GLOBAL_RNG})
@ MLUtils ~/.julia/packages/MLUtils/OojOS/src/dataloader.jl:91
[13] top-level scope
@ ~/Dokumenty/Projekty/BSU/test/dataloader.jl:42
julia> For completeness, I am adding a Manifest and a Project file. |
That's because you made Presumably, you don't want to actually make your dataset type an array type. You don't need to implement the If you do actually want your type to be an array, then you can implement the array interface correctly, and things should work. Furthermore, if you want your array type to be multidimensional, then you can implement multidimensional Last thing, since this thread is originally about out of memory data. MLDatasets.jl includes some experimental types to make this simpler. Based on your use-case, you want to use |
I understand this error message. I'm just saying that plx-ht's solution no longer works and lazy loading has to be done after the batch (containing only metadata of samples) is returned by the loader in training loop. |
Closing as outdated. Also, the DataLoader has been moved to MLUtils.jl and supports generic datasets through the |
The
DataLoader
is nice, but if I understand correctly it requires the dataset to fit in memory. For large datasets that don't fit in memory, it would be nice to have an easy way to load & preprocess the data efficiently, similar to TensorFlow's tf.data API. Maybe something like this exists already?If not, perhaps one option would be to provide custom transducers to make it possible to write things like:
This would load records from multiple files (in random file order), pick 4 randomly, interleave their records, preprocess every record, shuffle records using a 100,000 element buffer, and batch the records with batch size 32, and prefetch 1 batch (so the CPU can prepare the next batch while the GPU is working on the previous batch). Then the
data
could be used for training.The text was updated successfully, but these errors were encountered: