Version 2016.08.04.0
MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.
We're happy to announce the immediate availability of MLDB [version 2016.08.04.0](https://github.com/mldbai/mldb/releases/tag/v2016.08.04. 0).
This release contains 161 new commits, modified 290 files and fixes 82 issues. On top of many bug fixes and performance improvements, here are some of the highlights of this release:
New DISTINCT ON
clause
The DISTINCT ON
clause can be used to to filter out duplicate rows based on the value of an expression. The syntax is as follows:
SELECT DISTINCT ON (algorithm, project) algorithm, project, date
FROM ml_experiments
ORDER BY algorithm, project
This will return one row per unique value of the columns algorithm
and project
.
See the Select Expression documentation for more details.
New try
builtin function
When an error occurs when processing a query, the whole query fails and no result is returned, even if only a single line caused the error. The new try
function is meant to handle this type of situation. The first argument is the expression to try to apply. The optional second argument is what will be returned if an error is encountered.
In the example below, since the string foo will not parse as valid JSON, the row expression {'error': 1}
will be returned instead:
SELECT try(parse_json('foo'), {'error': 1}) AS *
Check out the try
function documentation for more details.
Deep learning
Added support for NVIDIA CUDNN, improving the performance of MLDB's Tensorflow integration on GPUs. This is another step in making MLDB the easiest platform to use to run Tensorflow graphs.
Updated pymldb to version 0.7.0
The pymldb library is an open-source pure-Python module which provides a wrapper library that makes it easy to work with MLDB from Python. Version 0.7.0 adds support for passing in a JSON payload in GET requests. This is necessary when passing in big feature vectors to MLDB functions.
Check out the Using pymldb Tutorial notebook for more info.
Internal hashing is now done using HighwayHash
MLDB's hash functions now use the Highway Tree Hash, which is claimed to be both likely secure and very fast. This will improve the speed of working with large numbers of columns.
Other changes and fixes
- New aggregators:
vertical_stddev
(alias ofstddev
) andvertical_variance
(alias ofvariance
). - The classifier.experiment procedure now returns the ID of the scorer function it creates for each fold. This makes it easier to reuse the functions in later steps of a script.
- The
runOnCreation
arguments present for all procedures now defaults toTrue
, which was the value used by the vast majority of users. - Theexport.csv
procedure has a newskipDuplicateCells
which, when set toTrue
, will skip rows that contain cells with many values. This is necessary because the CSV format cannot represent many values per cell the way MLDB datasets can by using the time dimension. More information is available on the [export.csv procedure's documentation](https://docs.mldb.ai/doc/#builtin/procedures/CsvExportProcedure.md. html). - Fixed a loss of precision for floats when using MLDB's Python layer.
- The arguments of the tokenize function, the import.text procedure and the tokensplit function are now all camel-case.
- The tabular dataset is more efficient in storing numbers and timestamps, leading to a reduction in memory usage.
- The speed at which the sparse.mutable dataset can record rows has been improved.
- Fixed a memory leak in the
levenshtein_distance()
built-in function. - Fixed NULL propagation for math operators. Example:
5+NULL = NULL
.