Skip to content

Latest commit

 

History

History
734 lines (494 loc) · 23 KB

DOCUMENTATION.md

File metadata and controls

734 lines (494 loc) · 23 KB

Lua/Torch7 API documentation

This section covers the main API methods for dataset managing/data fetching for the Lua/Torch7's dbcollection package.

The dbcollection package is composed by three main groups:

The data loading API contains a few methods for data retrieval/probing:

  • get: retrieve data from the dataset's hdf5 metadata file.
  • object: retrieves a list of all fields' indexes/values of an object composition.
  • size: size of a field.
  • list: lists all fields' names.
  • object_fields_id: retrieves the index position of a field in the object_ids list.
  • info: prints information about the data fields of a set.

dbcollection

This module is the main module for easily managing/loading/processing datasets for the dbcollection package. The dataset managing API is composed by the following methods. The (recommended) use of the package is as follows:

local dbc = require 'dbcollection'

The library is structured as a table. In this documentation we use the above convention to import the module and to call its methods (similar to the other APIs).

load

loader = dbc.load(name, task, data_dir, verbose, is_test)

Returns a loader instant of a dbcollection.DatasetLoader class with methods to retrieve/probe data and other informations from the selected dataset.

Note1: It is important to point out that you can pass input arguments in two different ways: (1) by passing values in the correct order or (2) by defining passing a table with fields named as the input args. Here it is prefered to use the second strategy because its simpler and its usage is similar to the other APIs where you can specify only the required fields you want to change and let the others use the default values.

Parameters

  • name: Name of the dataset. (type=string)
  • task: Name of the task to load. (type=string, default='default')
  • data_dir: Directory path to store the downloaded data. (type=string, default='')
  • verbose: Displays text information (if true). (type=boolean, default=true)
  • is_test: Flag used for tests. (type=boolean, default=false)

Usage examples

You can simply load a dataset by its name like in the following example.

>>> mnist = dbc.load('mnist')

In cases where you don't have the dataset's data on disk yet (and the selected dataset can be downloaded by the API), you can specify which directory the datset's data files should be stored to and which task should be loaded for this dataset.

>>> cifar10 = dbc.load{name='cifar10',
                       task='classification',
                       data_dir='<my_home>/datasets/'}

Note: If you don't specify the directory path where to store the data files, then the files will be stored in the dbcollection/<dataset>/data dir where the metadata files are located.

download

dbc.download(name, data_dir, extract_data, verbose, is_test)

This method will download a dataset's data files to disk. After download, it updates the cache file with the dataset's name and path where the data is stored.

Parameters

  • name: Name of the dataset. (type=string)
  • data_dir: Directory path to store the downloaded data. (type=string, default='default')
  • extract_data: Extracts/unpacks the data files (if true). (type=boolean, default=true)
  • verbose: Displays text information (if true). (type=boolean, default=true)
  • is_test: Flag used for tests. (type=boolean, default=false)

Usage examples

A simple usage example for downloading a dataset (without providing a storage path for the data) requires only the name of the target dataset and it will download its data files and then extract them to disk without any supervision required.

>>> dbc.download('cifar10')

It is good practice to specify where the data will be download to by providing a existing directory path to data_dir. (This information is stored in the dbcollection.json file stored in your home path.)

>>> dbc.download({name='cifar10', data_dir='<some_dir>'})

In cases where you only want to download the dataset's files without extracting its contents, you can set extract_data=false and skip the data extraction/unpacking step.

>>> dbc.download{name='cifar10',
                 data_dir='<some_dir>',
                 extract_data=false}

Note: this package uses a text progressbar when downloading files from urls for visual purposes (file size, elapsed time, % download, etc.). To disable this feature, set verbose=false.

process

dbc.process(name, task, verbose. is_test)

Processes a dataset's metadata and stores it to file. This metadata is stored in a HDF5 file for each task composing the dataset's tasks. For more information about a dataset's metadata format please check the list of available datasets in the docs.

Parameters

  • name: Name of the dataset. (type=string)
  • task: Name of the task to process. (type=string, default='all')
  • verbose: Displays text information (if true). (type=boolean, default=true)
  • is_test: Flag used for tests. (type=boolean, default=false)

Usage examples

To process (or reprocess) a dataset's metadata simply do:

>>> dbc.process('cifar10')

This will process all tasks for a given dataset (default='all'). To process only a specific task, you need to specify the task name you want to setup. This is handy when one wants to process only a single task of a bunch of tasks and speed up the processing/loading stage.

>>> dbc.process({name='cifar10', task='default'}) -- process only the 'default' task

Note: this method allows the user to reset a dataset's metadata file content in case of manual or accidental changes to the structure of the data. Most users won't have the need for such functionality on their basic usage of this package.

add

dbc.add(name, task, data_dir, verbose)

This method provides an easy way to add a custom dataset/task to the dbcollection.json cache file without having to manually add them themselves (although it is super easy to do it and recommended!).

Parameters

  • name: Name of the dataset. (type=string)
  • task: Name of the task to load. (type=string)
  • data_dir: Path of the stored data on disk. (type=string)
  • file_path: Path to the metadata HDF5 file. (type=string)
  • keywords: Table of strings of keywords that categorize the dataset. (type=table, default={})
  • is_test: Flag used for tests. (type=boolean, default=false)

Usage examples

Adding a custom dataset or a custom task to a dataset requires the user to introduce the dataset's name, task name, data_dir where the data files are stored and the metadata's file_path on disk.

>>> dbc.add{name='custom1',
            task='default',
            data_dir='<data_dir>',
            file_path='<metadata_file_path>'}

Note: In cases where no external files are required besides the metadata's data, you can set data_dir="".

remove

dbc.remove(name, task, data_dir, verbose)

This method allows for a dataset to be removed from the list of available datasets for load in the cache. It also allows for the user to delete the dataset's files on disk if desired.

Parameters

  • name: Name of the dataset. (type=string)
  • task: Name of the task to delete. (type=string, default='None')
  • delete_data: Delete all data files from disk for this dataset if True. (type=boolean, default=false)
  • is_test: Flag used for tests. (type=boolean, default=false)

Usage examples

To remove a dataset simply do:

>>> dbc.remove('cifar10')

If you want to remove the dataset completely from disk, you can set the delete_data parameter to true.

>>> dbc.remove{name='cifar10', delete_data=true}

config_cache

dbc.config_cache(field, value, delete_cache, delete_cache_dir, delete_cache_file, reset_cache, is_test)

Configures the cache file via API. This method allows to configure the cache file directly by selecting any data field and (re)setting its value. The user can also manually configure the dbcollection.json cache file in the filesystem (recommended).

To modify any entry in the cache file, simply input the field name you want to change along with the new data you want to insert. This applies to any existing field.

Another available function is the reset/wipe the entire cache paths/configs from the file. To perform this action set the reset_cache input arg to true.

Also, there is an option to completely remove the cache file+folder from the disk by enabling delete_cache to true. This will remove the cache file dbcollection.json and the dbcollection/ folder from disk.

Warning: Misusing this API method may result in tears. Proceed with caution.

Parameters

  • field: Name of the field to update/modify in the cache file. (type=string, default='None')
  • value: Value to update the field. (type=string, default='None')
  • delete_cache: Delete/remove the dbcollection cache file + directory. (type=boolean, default=false)
  • delete_cache_dir: Delete/remove the dbcollection cache directory. (type=boolean, default=false)
  • delete_cache_file: Delete/remove the dbcollection.json cache file. (type=boolean, default=false)
  • reset_cache: Reset the cache file. (type=boolean, default=false)
  • is_test: Flag used for tests. (type=boolean, default=false)

Usage examples

For example, Lets change the directory where the dbcollection's metadata main folder is stored on disk. This is useful to store/move all the metadata files to another disk.

>>> dbc.config_cache{field='default_cache_dir',
                     value='<home_dir>/new/path/db/'}

If a user wants to remove all files relating to the dbcollection package, one can use the config_cache to accomplish this in a simple way:

>>> dbc.config_cache{delete_cache=true}

or if the user wants to remove only the cache file:

>>> dbc.config_cache{delete_cache_file=true}

or to remove the cache directory where all the metadata files from all datasets are stored (I hope you are sure about this one...):

>>> dbc.config_cache{delete_cache_dir=true}

or to simply reset the cache file contents withouth removing the file:

>>> dbc.config_cache{reset_cache=true}

query

dbc.query(pattern, is_test)

Do simple queries to the cache and displays them to the screen.

Parameters

  • pattern: Field name used to search for a matching pattern in cache data. (type=string, default='info')
  • is_test: Flag used for tests. (type=boolean, default=false)

Usage examples

Simple query about the existence of a dataset.

>>> dbc.query('mnist')

It can also retrieve information by category/keyword. For example, this is useful to see which datasets have the same task.

>>> dbc.query('detection')

Note: this is the same as scanning the dbcollection.json cache file yourself, but it has the advantage of grouping information about a certain pattern for you.

info

dbc.info(name, paths_info, datasets_info, categories_info, is_test)

Prints the cache contents to screen. It can also print a list of all available datasets to download/process in the dbcollection package.

Parameters

  • name: Name of the dataset to display information. (type=string, default='None')
  • paths_info: Print the paths info to screen. (type=boolean, default=true)
  • datasets_info: Print the datasets info to screen. If a string is provided, it selects only the information of that string (dataset name). (type=string, default='true')
  • categories_info: Print the paths info to screen. If a string is provided, it selects only the information of that string (dataset name). (type=string, default='true')
  • is_test: Flag used for tests. (type=boolean, default=false)

Usage examples

Print the cache file contents to the screen:

>>> dbc.info()

Display all available datasets to download/process:

>>> dbc.info('all')

TODO: Add more examples for the other options

Data loading API

The data loading API is class composed by some fields containing information about the dataset and some methods to retrieve data.

Loading a dataset using dbcollection.load() returns an instantiation of the DatasetLoader class. It contains information about the selected dataset (name, task, set splits, directory of the data files stored in disk, etc.) and methods to easily extract data from the metadata file.

The information of the dataset is stored as attributes of the class:

  • name: Name of the selected dataset. (type = string)
  • task: Name of the selected task. (type = string)
  • data_dir: Directory path where the data files are stored. (type = string)
  • cache_path: Path where the metadata file is located. (type = string)
  • file: The HDF5 file handler of the dataset's metadata. (type = hdf5.HDF5File)
  • root_path: HDF5 root group/path which the methods retrieve data from. (type = string, default='default/')
  • sets: A list of available sets for the selected dataset (e.g., train/val/test/etc.). (type = table)
  • object_fields: A list of all field names available for each set of the dataset. (type = table)

Note: The list of available sets (set splits) and object_fields (available field names) varies from dataset to dataset.

get

data = loader:get(set_name, field_name, idx)

Retrieves data from a dataset's HDF5 metadata file. This method accesses the HDF5 metadata file and searches for a field field_name in the selected group set_name. If an index or list of indexes are input, the method returns a slice (rows) of the data's tensor. If no index is set, it return the entire data tensor.

Parameters

  • set_name: Name of the set. (type=string)
  • field_name: Name of the data field. (type=string)
  • idx: Index number of the field. If the input is a table, it uses it as a range of indexes and returns the data for that range. (type=number or table)

Usage examples

The first, and most common, usage of this method if to retrieve a single piece of data from a data field. Lets retrieve the first image+label from the MNIST dataset.

>>> mnist = dbc.load('mnist')  -- returns a DatasetLoader class
>>> img = mnist:get('train', 'images', 1)
>>> label = mnist:get('train', 'labels', 1)
>>> #img
  1
 28
 28
[torch.LongStorage of size 3]

>>> label
 5
[torch.ByteTensor of size 1]

This method can also be used to retrieve a range of data/values.

>>> #mnist:get('train', 'images', {1, 20})
 20
 28
 28
[torch.LongStorage of size 3]

Or all values if desired.

>>> #mnist:get('train', 'images')
 60000
    28
    28
[torch.LongStorage of size 3]

object

indexes = Loader:object(set_name, idx, is_value)

Retrieves a list of all fields' indexes/values of an object composition. If is_value=true, instead of returning a tensor containing the object's fields indexes, it returns a table of tensors for each field.

This method is particularly useful when different fields are linked (like in detection tasks with labeled data) and their contents can be quickly accessed and retrieved in one swoop.

Parameters

  • set_name: Name of the set. (type=string)
  • idx: Index number of the field. If it is a table, returns the data for all the value indexes of that list. (type=number or table)
  • is_value: Outputs a tensor of indexes (if false) or a table of tensors/values (if true). (type=boolean, default=false)

Usage examples

Fetch all indexes of an object.

>>> mnist = dbc.load('mnist')
>>> mnist:object('train', 1)
 1  6
[torch.IntTensor of size 1x2]

Multiple lists can be retrieved just like with the get() method.

>>> objs_idxs = mnist:object('train', {1, 10})
  1   6
  2   1
  3   5
  4   2
  5  10
  6   3
  7   2
  8   4
  9   2
 10   5
[torch.IntTensor of size 10x2]

It is also possible to retrieve the values/tensors instead of the indexes.

>>> obj_data = mnist:object('train', 1, true)
>>> #obj_data[1]
  1
 28
 28
[torch.LongStorage of size 3]

>>> obj_data[2]
 2
[torch.ByteTensor of size 1]

size

size = loader:size(set_name, field_name)

Returns the size of a field.

Parameters

  • set_name: Name of the set. (type=string)
  • field_name: Name of the data field. (type=string)

Note: if field_name is not provided, it uses the object_ids field instead.

Usage examples

Get the size of the images tensor in MNIST train set.

>>> mnist = dbc.load('mnist')
>>> mnist:size('train', 'images')
{
  1 : 60000
  2 : 28
  3 : 28
}

Get the size of the objects in MNIST train set.

>>> mnist:size('train', 'object_ids')
{
  1 : 60000
  2 : 2
}

-- or
>>> mnist:size('train')
{
  1 : 60000
  2 : 2
}

list

fields = loader:list(set_name)

Lists all field names in a set.

Parameters

  • set_name: Name of the set. (type=string)

Usage examples

Get all fields available in the MNIST test set.

>>> mnist = dbc.load('mnist')
>>> mnist:list('test')
{
  1 : "classes"
  2 : "list_images_per_class"
  3 : "labels"
  4 : "images"
  5 : "object_ids"
  6 : "object_fields"
}

info

loader.info(set_name)

Prints information about the data fields of a set.

Displays information of all fields available like field name, size and shape of all sets. If a set_name is provided, it displays only the information for that specific set.

Parameters

  • set_name: Name of the set. (type=string)

Usage examples

Display all field information for the MNIST dataset.

>>> mnist = dbc.load('mnist')
>>> mnist:info()

> Set: test
   - classes,         shape = {10, 2},           dtype = torch.ByteTensor
   - images,          shape = {10000, 28, 28},   dtype = torch.ByteTensor,   (in 'object_ids', position = 1)
   - labels,          shape = {10000},           dtype = torch.ByteTensor,   (in 'object_ids', position = 2)
   - object_fields,   shape = {2, 7},            dtype = torch.ByteTensor
   - object_ids,      shape = {10000, 2},        dtype = torch.IntTensor

   (Pre-ordered lists)
   - list_images_per_class,   shape = {10, 1135},   dtype = torch.IntTensor

> Set: train
   - classes,         shape = {10, 2},           dtype = torch.ByteTensor
   - images,          shape = {60000, 28, 28},   dtype = torch.ByteTensor,   (in 'object_ids', position = 1)
   - labels,          shape = {60000},           dtype = torch.ByteTensor,   (in 'object_ids', position = 2)
   - object_fields,   shape = {2, 7},            dtype = torch.ByteTensor
   - object_ids,      shape = {60000, 2},        dtype = torch.IntTensor

   (Pre-ordered lists)
   - list_images_per_class,   shape = {10, 6742},   dtype = torch.IntTensor

object_field_id

position = loader:object_field_id(set_name, field_name)

Retrieves the index position of a field in the object_ids list. This position points to the field name stored in the object_fields attribute.

Parameters

  • set_name: Name of the set. (type=string)
  • field_name: Name of the data field. (type=string)

Usage examples

This example shows how to use this method in order to retrieve the correct fields from an object index list.

>>> mnist = dbc.load('mnist')
>>> print('object field idx (images): ', mnist:object_field_id('train', 'images'))
object field idx (images): 	1

>>> print('object field idx (labels): ', mnist:object_field_id('train', 'labels'))
object field idx (labels): 	2

>>> mnist.object_fields['train']
{
  1 : "images"
  2 : "labels"
}

Utils

This module contains some useful utility functions available to the user.

It is composed by the following modules:

dbcollection.utils.string_ascii

This module contains methods to convert strings to ascii and vice-versa. These are useful when recovering strings from the metadata file that are encoded as torch.ByteTensors (this is due to a limitation in the HDF5 implementation on torch7).

Although one might only need to convert torch.ByteTensors to strings, this module contains both methods for ascii-to-string and string-to-ascii.

convert_string_to_ascii

tensor = dbc.utils.string_ascii.convert_string_to_ascii(input)

Convert a string or table of string to a torch.CharTensor.

Parameters

  • input: String or a table of strings. (type=string or table)

Usage examples

Single string.

>>> str = 'string1'
>>> ascii_tensor = dbc.utils.string_ascii.convert_str_to_ascii(str)
>>> print(ascii_tensor)
 115  116  114  105  110  103   49    0
[torch.CharTensor of size 1x8]

Table of strings.

>>> str = {'string1', 'string2', 'string3'}
>>> ascii_tensor = dbc.utils.string_ascii.convert_str_to_ascii(str)
>>> print(ascii_tensor)
 115  116  114  105  110  103   49    0
 115  116  114  105  110  103   50    0
 115  116  114  105  110  103   51    0
[torch.CharTensor of size 3x8]

convert_ascii_to_string

str = dbc.utils.string_ascii.convert_ascii_to_string(input)

Convert a torch.CharTensor or torch.ByteTensor to a table of strings.

Parameters

  • input: 2D torch tensor. (type=torch.ByteTensor or torch.CharTensor)

Usage examples

Convert a torch.CharTensor to string.

-- ascii format of 'string1'
>>> tensor = torch.CharTensor({{115, 116, 114, 105, 110, 103, 49, 0}})
>>> str = dbc.utils.string_ascii.convert_ascii_to_str(tensor)
>>> print(str)
{
  1 : "string1"
}