This section covers the main API methods for dataset managing/data fetching for the Lua/Torch7's dbcollection package.
The dbcollection package is composed by three main groups:
- dbcollection: dataset manager API.
- load: load a dataset.
- download: download a dataset data to disk.
- process: process a dataset's metadata and stores it to file.
- add: add a dataset/task to the list of available datasets for loading.
- remove: remove/delete a dataset from the cache.
- config_cache: configure the cache file.
- query: do simple queries to the cache.
- info: prints the cache contents.
- dbcollection.utils: utility functions.
- string_ascii: module containing methods for converting strings to tensors and tensors to strings.
The data loading API contains a few methods for data retrieval/probing:
- get: retrieve data from the dataset's hdf5 metadata file.
- object: retrieves a list of all fields' indexes/values of an object composition.
- size: size of a field.
- list: lists all fields' names.
- object_fields_id: retrieves the index position of a field in the
object_ids
list. - info: prints information about the data fields of a set.
This module is the main module for easily managing/loading/processing datasets for the dbcollection
package.
The dataset managing API is composed by the following methods. The (recommended) use of the package is as follows:
local dbc = require 'dbcollection'
The library is structured as a table. In this documentation we use the above convention to import the module and to call its methods (similar to the other APIs).
loader = dbc.load(name, task, data_dir, verbose, is_test)
Returns a loader instant of a dbcollection.DatasetLoader
class with methods to retrieve/probe data and other informations from the selected dataset.
Note1: It is important to point out that you can pass input arguments in two different ways: (1) by passing values in the correct order or (2) by defining passing a table with fields named as the input args. Here it is prefered to use the second strategy because its simpler and its usage is similar to the other APIs where you can specify only the required fields you want to change and let the others use the default values.
name
: Name of the dataset. (type=string)task
: Name of the task to load. (type=string, default='default')data_dir
: Directory path to store the downloaded data. (type=string, default='')verbose
: Displays text information (if true). (type=boolean, default=true)is_test
: Flag used for tests. (type=boolean, default=false)
You can simply load a dataset by its name like in the following example.
>>> mnist = dbc.load('mnist')
In cases where you don't have the dataset's data on disk yet (and the selected dataset can be downloaded by the API), you can specify which directory the datset's data files should be stored to and which task should be loaded for this dataset.
>>> cifar10 = dbc.load{name='cifar10',
task='classification',
data_dir='<my_home>/datasets/'}
Note: If you don't specify the directory path where to store the data files, then the files will be stored in the
dbcollection/<dataset>/data
dir where the metadata files are located.
dbc.download(name, data_dir, extract_data, verbose, is_test)
This method will download a dataset's data files to disk. After download, it updates the cache file with the dataset's name and path where the data is stored.
name
: Name of the dataset. (type=string)data_dir
: Directory path to store the downloaded data. (type=string, default='default')extract_data
: Extracts/unpacks the data files (if true). (type=boolean, default=true)verbose
: Displays text information (if true). (type=boolean, default=true)is_test
: Flag used for tests. (type=boolean, default=false)
A simple usage example for downloading a dataset (without providing a storage path for the data) requires only the name of the target dataset and it will download its data files and then extract them to disk without any supervision required.
>>> dbc.download('cifar10')
It is good practice to specify where the data will be download to by providing a existing directory path to data_dir
. (This information is stored in the dbcollection.json
file stored in your home path.)
>>> dbc.download({name='cifar10', data_dir='<some_dir>'})
In cases where you only want to download the dataset's files without extracting its contents, you can set extract_data=false
and skip the data extraction/unpacking step.
>>> dbc.download{name='cifar10',
data_dir='<some_dir>',
extract_data=false}
Note: this package uses a text progressbar when downloading files from urls for visual purposes (file size, elapsed time, % download, etc.). To disable this feature, set
verbose=false
.
dbc.process(name, task, verbose. is_test)
Processes a dataset's metadata and stores it to file. This metadata is stored in a HDF5 file for each task composing the dataset's tasks. For more information about a dataset's metadata format please check the list of available datasets in the docs.
name
: Name of the dataset. (type=string)task
: Name of the task to process. (type=string, default='all')verbose
: Displays text information (if true). (type=boolean, default=true)is_test
: Flag used for tests. (type=boolean, default=false)
To process (or reprocess) a dataset's metadata simply do:
>>> dbc.process('cifar10')
This will process all tasks for a given dataset (default='all'). To process only a specific task, you need to specify the task name you want to setup. This is handy when one wants to process only a single task of a bunch of tasks and speed up the processing/loading stage.
>>> dbc.process({name='cifar10', task='default'}) -- process only the 'default' task
Note: this method allows the user to reset a dataset's metadata file content in case of manual or accidental changes to the structure of the data. Most users won't have the need for such functionality on their basic usage of this package.
dbc.add(name, task, data_dir, verbose)
This method provides an easy way to add a custom dataset/task to the dbcollection.json
cache file without having to manually add them themselves (although it is super easy to do it and recommended!).
name
: Name of the dataset. (type=string)task
: Name of the task to load. (type=string)data_dir
: Path of the stored data on disk. (type=string)file_path
: Path to the metadata HDF5 file. (type=string)keywords
: Table of strings of keywords that categorize the dataset. (type=table, default={})is_test
: Flag used for tests. (type=boolean, default=false)
Adding a custom dataset or a custom task to a dataset requires the user to introduce the dataset's name
, task
name, data_dir
where the data files are stored and the metadata's file_path
on disk.
>>> dbc.add{name='custom1',
task='default',
data_dir='<data_dir>',
file_path='<metadata_file_path>'}
Note: In cases where no external files are required besides the metadata's data, you can set
data_dir
="".
dbc.remove(name, task, data_dir, verbose)
This method allows for a dataset to be removed from the list of available datasets for load in the cache. It also allows for the user to delete the dataset's files on disk if desired.
name
: Name of the dataset. (type=string)task
: Name of the task to delete. (type=string, default='None')delete_data
: Delete all data files from disk for this dataset if True. (type=boolean, default=false)is_test
: Flag used for tests. (type=boolean, default=false)
To remove a dataset simply do:
>>> dbc.remove('cifar10')
If you want to remove the dataset completely from disk, you can set the delete_data
parameter to true
.
>>> dbc.remove{name='cifar10', delete_data=true}
dbc.config_cache(field, value, delete_cache, delete_cache_dir, delete_cache_file, reset_cache, is_test)
Configures the cache file via API. This method allows to configure the cache file directly by selecting any data field and (re)setting its value. The user can also manually configure the dbcollection.json
cache file in the filesystem (recommended).
To modify any entry in the cache file, simply input the field name you want to change along with the new data you want to insert. This applies to any existing field.
Another available function is the reset/wipe the entire cache paths/configs from the file. To perform this action set the reset_cache
input arg to true
.
Also, there is an option to completely remove the cache file+folder from the disk by enabling delete_cache
to true
. This will remove the cache file dbcollection.json
and the dbcollection/
folder from disk.
Warning: Misusing this API method may result in tears. Proceed with caution.
field
: Name of the field to update/modify in the cache file. (type=string, default='None')value
: Value to update the field. (type=string, default='None')delete_cache
: Delete/remove the dbcollection cache file + directory. (type=boolean, default=false)delete_cache_dir
: Delete/remove the dbcollection cache directory. (type=boolean, default=false)delete_cache_file
: Delete/remove the dbcollection.json cache file. (type=boolean, default=false)reset_cache
: Reset the cache file. (type=boolean, default=false)is_test
: Flag used for tests. (type=boolean, default=false)
For example, Lets change the directory where the dbcollection's metadata main folder is stored on disk. This is useful to store/move all the metadata files to another disk.
>>> dbc.config_cache{field='default_cache_dir',
value='<home_dir>/new/path/db/'}
If a user wants to remove all files relating to the dbcollection
package, one can use the config_cache
to accomplish this in a simple way:
>>> dbc.config_cache{delete_cache=true}
or if the user wants to remove only the cache file:
>>> dbc.config_cache{delete_cache_file=true}
or to remove the cache directory where all the metadata files from all datasets are stored (I hope you are sure about this one...):
>>> dbc.config_cache{delete_cache_dir=true}
or to simply reset the cache file contents withouth removing the file:
>>> dbc.config_cache{reset_cache=true}
dbc.query(pattern, is_test)
Do simple queries to the cache and displays them to the screen.
pattern
: Field name used to search for a matching pattern in cache data. (type=string, default='info')is_test
: Flag used for tests. (type=boolean, default=false)
Simple query about the existence of a dataset.
>>> dbc.query('mnist')
It can also retrieve information by category/keyword. For example, this is useful to see which datasets have the same task.
>>> dbc.query('detection')
Note: this is the same as scanning the
dbcollection.json
cache file yourself, but it has the advantage of grouping information about a certain pattern for you.
dbc.info(name, paths_info, datasets_info, categories_info, is_test)
Prints the cache contents to screen. It can also print a list of all available datasets to download/process in the dbcollection
package.
name
: Name of the dataset to display information. (type=string, default='None')paths_info
: Print the paths info to screen. (type=boolean, default=true)datasets_info
: Print the datasets info to screen. If a string is provided, it selects only the information of that string (dataset name). (type=string, default='true')categories_info
: Print the paths info to screen. If a string is provided, it selects only the information of that string (dataset name). (type=string, default='true')is_test
: Flag used for tests. (type=boolean, default=false)
Print the cache file contents to the screen:
>>> dbc.info()
Display all available datasets to download/process:
>>> dbc.info('all')
TODO: Add more examples for the other options
The data loading API is class composed by some fields containing information about the dataset and some methods to retrieve data.
Loading a dataset using dbcollection.load() returns an instantiation of the DatasetLoader
class. It contains information about the selected dataset (name, task, set splits, directory of the data files stored in disk, etc.) and methods to easily extract data from the metadata file.
The information of the dataset is stored as attributes of the class:
name
: Name of the selected dataset. (type = string)task
: Name of the selected task. (type = string)data_dir
: Directory path where the data files are stored. (type = string)cache_path
: Path where the metadata file is located. (type = string)file
: The HDF5 file handler of the dataset's metadata. (type = hdf5.HDF5File)root_path
: HDF5 root group/path which the methods retrieve data from. (type = string, default='default/')sets
: A list of available sets for the selected dataset (e.g., train/val/test/etc.). (type = table)object_fields
: A list of all field names available for each set of the dataset. (type = table)
Note: The list of available
sets
(set splits) andobject_fields
(available field names) varies from dataset to dataset.
data = loader:get(set_name, field_name, idx)
Retrieves data from a dataset's HDF5 metadata file. This method accesses the HDF5 metadata file and searches for a field field_name
in the selected group set_name
. If an index or list of indexes are input, the method returns a slice (rows) of the data's tensor. If no index is set, it return the entire data tensor.
set_name
: Name of the set. (type=string)field_name
: Name of the data field. (type=string)idx
: Index number of the field. If the input is a table, it uses it as a range of indexes and returns the data for that range. (type=number or table)
The first, and most common, usage of this method if to retrieve a single piece of data from a data field. Lets retrieve the first image+label from the MNIST
dataset.
>>> mnist = dbc.load('mnist') -- returns a DatasetLoader class
>>> img = mnist:get('train', 'images', 1)
>>> label = mnist:get('train', 'labels', 1)
>>> #img
1
28
28
[torch.LongStorage of size 3]
>>> label
5
[torch.ByteTensor of size 1]
This method can also be used to retrieve a range of data/values.
>>> #mnist:get('train', 'images', {1, 20})
20
28
28
[torch.LongStorage of size 3]
Or all values if desired.
>>> #mnist:get('train', 'images')
60000
28
28
[torch.LongStorage of size 3]
indexes = Loader:object(set_name, idx, is_value)
Retrieves a list of all fields' indexes/values of an object composition. If is_value=true
, instead of returning a tensor containing the object's fields indexes, it returns a table of tensors for each field.
This method is particularly useful when different fields are linked (like in detection tasks with labeled data) and their contents can be quickly accessed and retrieved in one swoop.
set_name
: Name of the set. (type=string)idx
: Index number of the field. If it is a table, returns the data for all the value indexes of that list. (type=number or table)is_value
: Outputs a tensor of indexes (if false) or a table of tensors/values (if true). (type=boolean, default=false)
Fetch all indexes of an object.
>>> mnist = dbc.load('mnist')
>>> mnist:object('train', 1)
1 6
[torch.IntTensor of size 1x2]
Multiple lists can be retrieved just like with the get() method.
>>> objs_idxs = mnist:object('train', {1, 10})
1 6
2 1
3 5
4 2
5 10
6 3
7 2
8 4
9 2
10 5
[torch.IntTensor of size 10x2]
It is also possible to retrieve the values/tensors instead of the indexes.
>>> obj_data = mnist:object('train', 1, true)
>>> #obj_data[1]
1
28
28
[torch.LongStorage of size 3]
>>> obj_data[2]
2
[torch.ByteTensor of size 1]
size = loader:size(set_name, field_name)
Returns the size of a field.
set_name
: Name of the set. (type=string)field_name
: Name of the data field. (type=string)
Note: if
field_name
is not provided, it uses theobject_ids
field instead.
Get the size of the images tensor in MNIST
train set.
>>> mnist = dbc.load('mnist')
>>> mnist:size('train', 'images')
{
1 : 60000
2 : 28
3 : 28
}
Get the size of the objects in MNIST
train set.
>>> mnist:size('train', 'object_ids')
{
1 : 60000
2 : 2
}
-- or
>>> mnist:size('train')
{
1 : 60000
2 : 2
}
fields = loader:list(set_name)
Lists all field names in a set.
set_name
: Name of the set. (type=string)
Get all fields available in the MNIST
test set.
>>> mnist = dbc.load('mnist')
>>> mnist:list('test')
{
1 : "classes"
2 : "list_images_per_class"
3 : "labels"
4 : "images"
5 : "object_ids"
6 : "object_fields"
}
loader.info(set_name)
Prints information about the data fields of a set.
Displays information of all fields available like field name, size and shape of all sets. If a set_name
is provided, it displays only the information for that specific set.
set_name
: Name of the set. (type=string)
Display all field information for the MNIST
dataset.
>>> mnist = dbc.load('mnist')
>>> mnist:info()
> Set: test
- classes, shape = {10, 2}, dtype = torch.ByteTensor
- images, shape = {10000, 28, 28}, dtype = torch.ByteTensor, (in 'object_ids', position = 1)
- labels, shape = {10000}, dtype = torch.ByteTensor, (in 'object_ids', position = 2)
- object_fields, shape = {2, 7}, dtype = torch.ByteTensor
- object_ids, shape = {10000, 2}, dtype = torch.IntTensor
(Pre-ordered lists)
- list_images_per_class, shape = {10, 1135}, dtype = torch.IntTensor
> Set: train
- classes, shape = {10, 2}, dtype = torch.ByteTensor
- images, shape = {60000, 28, 28}, dtype = torch.ByteTensor, (in 'object_ids', position = 1)
- labels, shape = {60000}, dtype = torch.ByteTensor, (in 'object_ids', position = 2)
- object_fields, shape = {2, 7}, dtype = torch.ByteTensor
- object_ids, shape = {60000, 2}, dtype = torch.IntTensor
(Pre-ordered lists)
- list_images_per_class, shape = {10, 6742}, dtype = torch.IntTensor
position = loader:object_field_id(set_name, field_name)
Retrieves the index position of a field in the object_ids
list. This position points to the field name stored in the object_fields
attribute.
set_name
: Name of the set. (type=string)field_name
: Name of the data field. (type=string)
This example shows how to use this method in order to retrieve the correct fields from an object index list.
>>> mnist = dbc.load('mnist')
>>> print('object field idx (images): ', mnist:object_field_id('train', 'images'))
object field idx (images): 1
>>> print('object field idx (labels): ', mnist:object_field_id('train', 'labels'))
object field idx (labels): 2
>>> mnist.object_fields['train']
{
1 : "images"
2 : "labels"
}
This module contains some useful utility functions available to the user.
It is composed by the following modules:
This module contains methods to convert strings to ascii and vice-versa. These are useful when recovering strings from the metadata file that are encoded as torch.ByteTensors
(this is due to a limitation in the HDF5
implementation on torch7).
Although one might only need to convert torch.ByteTensors
to strings, this module contains both methods for ascii-to-string and string-to-ascii.
tensor = dbc.utils.string_ascii.convert_string_to_ascii(input)
Convert a string or table of string to a torch.CharTensor
.
input
: String or a table of strings. (type=string or table)
Single string.
>>> str = 'string1'
>>> ascii_tensor = dbc.utils.string_ascii.convert_str_to_ascii(str)
>>> print(ascii_tensor)
115 116 114 105 110 103 49 0
[torch.CharTensor of size 1x8]
Table of strings.
>>> str = {'string1', 'string2', 'string3'}
>>> ascii_tensor = dbc.utils.string_ascii.convert_str_to_ascii(str)
>>> print(ascii_tensor)
115 116 114 105 110 103 49 0
115 116 114 105 110 103 50 0
115 116 114 105 110 103 51 0
[torch.CharTensor of size 3x8]
str = dbc.utils.string_ascii.convert_ascii_to_string(input)
Convert a torch.CharTensor
or torch.ByteTensor
to a table of strings.
input
: 2D torch tensor. (type=torch.ByteTensor or torch.CharTensor)
Convert a torch.CharTensor
to string.
-- ascii format of 'string1'
>>> tensor = torch.CharTensor({{115, 116, 114, 105, 110, 103, 49, 0}})
>>> str = dbc.utils.string_ascii.convert_ascii_to_str(tensor)
>>> print(str)
{
1 : "string1"
}