Skip to content

Releases: mldbai/mldb

Version 2017.04.17.0

25 May 19:52
Compare
Choose a tag to compare

MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see
Running MLDB for installation details.

We're happy to announce the immediate availability of MLDB
version 2017.04.17.0.

Here are some of the highlights of this release:

New features

  • In addition to shared libraries, MLDB now auto loads javascript and python plugins.
  • Added functions:
  • MLDB now supports azure blob storage through the azureblob:// URI protocol.
  • Added Fast Text support to MLDB classifier’s algorithms. Note that our version of the Fast Text Classifier only supports feature counts, and currently does not support regression. Feature counts refers to a bag of word representation like what is returned by the tokenize() function.
  • Added route POST /redirect/get to give an alternate way to APIs unable to attach a body to GET requests to attain the same functionality through a POST call.
  • Added new parameter ignoreExtraColumns to the import.text procedure to ignore extra columns that weren't in header instead of failing.
  • classifier.train, classifier.experiment and classifier.test now now support multi-label classification mode.
  • Function reshape now handles row expressions.
  • Function import.word2vec now supports the named parameter.
  • Function fetcher now supports a concurrency limit which is particularly useful to avoid to overwhelm a server with requests.

Changes

  • Procedure export.csv automatically flattens structures. For example, if you have a dataset “ds” with column “x” containing a structure ‘{“a” : “foo”}’, a simple export with the query “SELECT x AS x FROM ds” now works and would output the column name as “x.a”. (MLDB-2126)
  • When you create unnamed dataset, functions or procedures, auto generated ids use underscores instead of dashes.
  • Newly trained models with an always null feature will work if they are tested against the same feature having a string value.
  • Enhanced progress report for the transform procedure.

Fixes

  • Fixed a deadlock occurring when MLDB rest API was exposed to a high number of queries per second.
  • Function fetcher no longer follows http redirects forever, fixing related hanging queries.
  • Function fetcher properly returns the underlying http errors rather than a generic message.

Version 2016.01.24.0

26 Jan 17:04
Compare
Choose a tag to compare

MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.

We're happy to announce the immediate availability of MLDB version 2017.01.24.0.

This release contains 123 new commits and modified 1653 files. On top of many bug fixes and performance improvements, here are some of the highlights of this release:

- Big optimizations in query executing time by evaluating all const expressions at binding time instead of doing it for each row. - Added the `blob_length(x)` function that returns the length (in bytes) of the blob `x`. - Added the `parse_exif(blob)` function that takes a JPEG image blob and returns basic EXIF information from it. - Added the `split_part(str, splitChars)` function that splits the string `str` and returns an embedding of all tokens as separated by the provided `splitChars` parameter. - The `fetcher()` function now works with UTF-8 paths. - Fixed wrong error code returned by the `fetcher()` function when it should return a 404. - Fixed an issue with the `fetcher()` function that could make MLDB hang for a long time. - The number of lines returned by the `/logs/mldb` endpoint has been increased from 1024 to 8192 lines. - Improved the error message returned by the `columnPathElement()` function when using an out of bounds index. - It is now possible to do a transpose of a `row_dataset`. It is also now possible to merge two `row_dataset` together. - Fixed an issue where the `WHERE` clause would not be properly applied when used with a dataset of type `UNION`. - When running MLDB in a Docker container, if the `mldb_runner` process exits with a non-zero exit code, the `docker run` command will also exit with a non-zero exit code. - When [executing a query](https://docs.mldb.ai/doc/#builtin/sql/QueryAPI.md.html), the `atom` return format was added. It returns a single atomic value, without the row name or the column name. The query will fail if anything other than a single row / column is returned. This is available in when using the `/v1/query` endoint or `pymldb`. - Logging improvements - Improved the handling of CUDA launch failures when using Tensorflow by modifying the default behaviour from `assert()` to throw so that they are recoverable. - Lower curl connect(2) timeout from 300s to 20s. This allows for ~3 SYN retransmits on a default linux config. The idea is to avoid being stuck in connect(2) for too long while still having a chance of success when going through flaky networks. - Updated svdlibc to version 1.4 which includes bug fixes. - Fixed an issue with Python plugins where POSTing to a route that returns no data and 200 code would be returned as a 404 by MLDB. - Fixed an issue when running a `transform` procedure with no input.

Version 2016.12.16.0

16 Dec 20:55
Compare
Choose a tag to compare

MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.

We're happy to announce the immediate availability of MLDB version 2016.12.16.0.

This release contains 228 new commits and modified 823 files. On top of many bug fixes and performance improvements, here are some of the highlights of this release:

  • It's now possible to make multiple predictions per REST call by using the /v1/functions/<function>/batch REST route.
  • New Identifying Biased Features Tutorial
  • New signal processing functions. This includes the fft(data [,direction='forward' [,type='real']]) function that performs a fast fourier transform on the given data.
  • Added the devices configuration argument to the tensorflow.graph function to specify on which device the graph is allowed to run.
  • MLDB now contains CUDA kernels for shader model 5.2 (Kepler), 5.3 (Maxwell), 6.0 (P100) and 6.1 (Titan X)
  • Improved support for aarch64 and ARM architectures. CUDA is now supported on the Jetson TX1
  • It is now possible to track the progress of long-running procedures, as well as interrupt them. Check the Intro to Procedures page for more details.
  • New numerical functions:
    • sin(x), cos(x) and tan(x) are the normal trigonometric functions
    • asin(x), acos(x) and atan(x) are the normal inverse trigonometric functions
    • atan2(x, y) returns the two-argument arctangent of x and y, in other words the angle (in radians) of the point through x and y from the origin with respect to the positive x axis
    • sinh(x), cosh(x) and tanh(x) are the normal hyperbolic functions
    • asinh(x), acosh(x) and atanh(x) are the normal inverse hyperbolic functions
    • pi() returns the value of pi, the ratio of a circle's circumference to its diameter, as a double precision floating point number.
    • e() returns the value of e, the base of natural logarithms, as a double precision floating point number.
  • New concat(x, ...) function that takes several embeddings with identical sizes in all but their last dimension and join them together on the last dimension.
  • The import.json procedure now supports the arrays configuration argument to specify how arrays should be encoded in the JSON output.
  • The import.text procedure now returns a rowCount field representing the number of rows that were imported, just as the import.json procedure does.
  • Fixes to the import.text procedure:
    • Fixed trailing whitespace on a CSV file that contains numbers in the last column makes MLDB think those columns are strings (as it keeps the trailing whitespace)
    • Fixed MLDB crashes with a "cannot seek" exception when attempting to open a file with autoGenerateHeaders
  • The reshape() function now has a 3 argument form. reshape(val, shape, newel) is similar to the two argument version of reshape, but allows for the number of elements to be different. If the number of elements increases, new elements will be filled in with the newel parameter.
  • Updated the uap-core library to the latest version improving user agent patterns used by the http.useragent function.
  • Fixed wide rows causing data corruption in tabular datasets
  • Many fixes to the JOIN operators
  • The merge() dataset function now accepts a single dataset
  • Random forest speedups and improvements
  • The svd.train procedure now supports all select expressions to specify it's input data, instead of the restricted form of select statements.
  • Fixed regexes being recompiled for every row when using a LIKE operator
  • The user function infrastructure has been modified to be more like built-in functions. In particular, inputs and outputs no longer need to be rows.
  • The row_dataset has been modified to return one row per column, and an atom_dataset construct added with semantics similar to the original. The types of these datasets have been improved, with inference of the value type and the column type is now path, not string.
  • The sql.expression object has been improved to allow raw and autoInput parameters to be passed, bypassing the requirement for a row on output and input respectively.

Version 2016.10.05.0

05 Oct 19:59
Compare
Choose a tag to compare

MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.

We're happy to announce the immediate availability of MLDB version 2016.10.05.0.

This release contains 141 new commits and modified 903 files. On top of many bug fixes and performance improvements, here are some of the highlights of this release:

New MongoDB interface

A big new feature is support for importing and exporting data to and from MongoDB, a popular NoSQL database. Although MongoDB can be very useful for certain use cases, it doesn't have any machine learning capabilities. We want to make it as easy as possible for our users to get their data in MLDB. So we have added the following new MLDB entities that make it easy to interface with MongoDB:

Updated TensorFlow to 0.10.0

We updated the TensorFlow version shipped with MLDB to version 0.10.0. The new version includes many bug fixes and performance improvements. We're now also shipping MLDB with different TensorFlow kernels, each optimized for different instruction sets. So for instance, the kernel with AVX2 instructions will be used if it the processor on which MLDB is run supports it.

If you're interested in deep learning, make sure to checkout the Tensorflow Image Recognition Tutorial and the Transfer Learning with Tensorflow demo to see how easy
it is to run trained models with MLDB.

Updated V8 to Release 5.0

We have updated V8, the Javascript engine used in MLDB, to Release 5.0. This brings in a lot of improvements and new features, like improved ECMAScript 2015 (ES6), as well as increasing performance. It now also compiles for the ARM architecture, which is an important step as we're working towards having MLDB run on embedded architectures.

An example of what this benefits is the jseval function, that makes it possible to execute arbitrary JavaScript code inline in an SQL query.
Check out the Executing JavaScript Code Directly in SQL Queries Using the jseval Function Tutorial for great examples of how jseval can be used.

Fixes and improvements to import procedures

The SELECT statement of the import.text procedure has been improved to support the CASE keyword. The adds extra flexibility to process data as it is being imported.

We also fixed a bug when using the NAMED clause with the import.json procedure that could cause undesired behaviour.

Updates to the classifier configuration

We have improved the user experience around configuring supervised algorithms in two ways.

First, we have clarified the documentation by creating a new Classifier configuration section that contains the information related to the configuration of supervised models. When using one of the two procedures that can be used to train models, the classifier.train and classifier.experiment, all the information you need to configure your algorithm now lives in one place.

Second, we have made the training more robust to configuration errors by having better validation of elements meant to control hyper-parameters. Incorrect parameters will now trigger errors.

New vector space functions

We added two new vector space functions:

First, the new reshape(val, shape) function takes an n-dimensional embedding and reinterprets it as an N-dimensional embedding of the provided shape containing all of the elements. This allows, for example, a 1-dimensional vector to be reinterpreted as a 2-dimensional array. The shape argument is an embedding containing the size of each dimension.

Second, the new shape(val) takes an n-dimensional embedding and returns the size of each dimension as an array.

Other changes and fixes

  • The COLUMN EXPR expression now supports the STRUCTURED keyword. By default, COLUMN EXPR returns a flattened representation. Adding the STRUCTURED keyword will return the structured representation.
  • The tsne.train procedure now has a learningRate configuration option.
  • Improved speed and fixes to JOIN operations.
  • Columns and rows can now be named with an empty string.
  • When evaluating a model using the classifier.test or the classifier.experimeny procedure, the F1-score was returned in a key named f. The name has been renamed to f1Score.
  • The HTTP layer now correctly handles the HTTP 1.1 100 CONTINUE request header.
  • Fixed the ordering of paths when mixing Unicode and digits.
  • The user function fetcher is now available as a built-in function fetch.
  • Fixed a bug with the levenshtein_distance() function where it did not work properly with UTF-8 characters.
  • Fixed an issue with the Javascript plugin's serveStaticFolder() function where the path to serve would not be considered relative to the plugin's installation directory.
  • Added the optional argument sortField to the string_agg(expr, separator [, sortField]) function, that allows to sort the returned by the sortField.

Version 2016.08.31.0

01 Sep 16:05
Compare
Choose a tag to compare

MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.

We're happy to announce the immediate availability of MLDB version 2016.08.31.0.

This release contains 114 new commits and modified 366 files. On top of many bug fixes and performance improvements, here are some of the highlights of this release:

MLPaint: the Real-Time Handwritten Digit Recognizer plugin

We're very excited to present MLPaint, the Real-Time Handwritten Digit Recognizer, a web app that runs on MLDB. It was made by Jonathan, the awesome intern we had with us this summer. Check out the video demo here: https://www.youtube.com/embed/WGdLCXDiDSo

The two demos below go into the technical details of how this plugin was built. The plugin is hosted on Github if you want to check out the implementation.

New demos

Classifier testing procedure now fully supports weights

Weighting examples correctly is a crucial part of training machine learning models that will generalize well. It can be used to compensate for sampling bias, class imbalance, etc. This is well supported for training in two ways:

  • specifying the weight for each example by using the weight column in the trainingData query
  • using the equalizationFactor parameter that specifies the amount by which to adjust weights so that all classes have an equal total weight

Weights can also be useful for testing. For instance, the cost of making mistakes for certain examples can be much less than for others. Having the metrics take that into consideration will help deliver a clearer picture of the performance expectations you can have for the model.

All the metrics reported by the classifier.test prodecure now fully take the weight of each example into account. You can specify the weight of each example by using the weight column in the testingData query.

Credentials

MLDB makes it very easy to access secured resources using a variety of protocols like http, sftp or even s3. MLDB can store credentials and supply them transparently whenever required when accessing protected files.

We fixed an issue that cause a problem when credentials file were loaded from a remote resource when launching MLDB from the command-line by using the add-credentials-from-url flag. This is mostly used in a production scenario. Error messages related to handling of credential files were also improved so they're clearer.

Updated pymldb to version 0.7.1

The pymldb library is an open-source pure-Python module which provides a wrapper library that makes it easy to work with MLDB from Python. Version 0.7.1 is a minor update changes the way the query function sends requests to MLDB. Instead of passing the query using the query string, it now sends it in the JSON payload. This makes it possible to send big feature vectors without hitting the query-string size limit.

Check out the Using pymldb Tutorial notebook for more info.

Improvements for c++ plugin developers

MLDB allows its functionality to be extended with plugins. While we often showcase Python plugins, like MLPaint mentioned at the top of this post, it's also possible to write plugins in c++.

And so c++ plugin developers rejoice! It is now easier to take advantage of MLDB's powerful SQL engine from c++ by using the new eval_sql function. It makes running queries easier and faster.

You can now also specify built-in functions by using SQL from c++. This allows for much more compact code and less boilerplate.

Shout out: Golang interface

We'd like to shout out to ZzEeKkAa who developed a very nice Golang interface for MLDB. Check it out if you're into Golang!

If you created a plugin or library that works with MLDB, make sure to reach out!

Exciting and upcoming

We have been hard at work on a new LiDAR MLDB plugin. This enables MLDB to process 3D point cloud data and do voxel rendering. It makes is possible to visualize raw and voxelized data from any point of view. Combined with our existing Tensorflow integration, it opens the door to a solving cutting-edge deep learning image recognition problems with MLDB.

It is also now possible to build MLDB on 32 and 64 bit ARM architectures. This will enable us to target a wider range of hardware. Think of smartphones, Raspberry Pi, or even Nvidia's Jetson TX1. This is a stepping stone in having MLDB run on-device.

Other changes and fixes

  • It is now possible to compile MLDB with clang
  • Cleanup of the logarithm functions:
    • ln(dp or numeric) : natural logarithm
    • log(dp or numeric) : base 10 logarithm
    • log(b numeric, x numeric) : logarithm to base b
    • Calling the following functions is now valid: sqrt(-1)=nan, log(0)=-inf, log(-1)=nan.
  • Fixed default timestamps coherence
  • Improved speed and fixes to JOIN operations
  • Fixed a bug that prevented credentials from being deleted
  • In Python script error messages, proper file paths are now returned
  • The dataset-specific query route (/v1/datasets/<dataset_name>/query) has been removed. Use the /v1/query instead.
  • We've unified the way GET endpoints accept parameters. Some routes will take parameters from either the query-string or as a JSON payload. We now enforce that all parameters should be sent one way or the other, not a mix of both.

Version 2016.08.04.0

04 Aug 19:35
Compare
Choose a tag to compare

MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.

We're happy to announce the immediate availability of MLDB [version 2016.08.04.0](https://github.com/mldbai/mldb/releases/tag/v2016.08.04. 0).

This release contains 161 new commits, modified 290 files and fixes 82 issues. On top of many bug fixes and performance improvements, here are some of the highlights of this release:

New DISTINCT ON clause

The DISTINCT ON clause can be used to to filter out duplicate rows based on the value of an expression. The syntax is as follows:

SELECT DISTINCT ON (algorithm, project) algorithm, project, date
FROM ml_experiments
ORDER BY algorithm, project

This will return one row per unique value of the columns algorithm and project.

See the Select Expression documentation for more details.

New try builtin function

When an error occurs when processing a query, the whole query fails and no result is returned, even if only a single line caused the error. The new try function is meant to handle this type of situation. The first argument is the expression to try to apply. The optional second argument is what will be returned if an error is encountered.

In the example below, since the string foo will not parse as valid JSON, the row expression {'error': 1} will be returned instead:

SELECT try(parse_json('foo'), {'error': 1}) AS *

Check out the try function documentation for more details.

Deep learning

Added support for NVIDIA CUDNN, improving the performance of MLDB's Tensorflow integration on GPUs. This is another step in making MLDB the easiest platform to use to run Tensorflow graphs.

Updated pymldb to version 0.7.0

The pymldb library is an open-source pure-Python module which provides a wrapper library that makes it easy to work with MLDB from Python. Version 0.7.0 adds support for passing in a JSON payload in GET requests. This is necessary when passing in big feature vectors to MLDB functions.

Check out the Using pymldb Tutorial notebook for more info.

Internal hashing is now done using HighwayHash

MLDB's hash functions now use the Highway Tree Hash, which is claimed to be both likely secure and very fast. This will improve the speed of working with large numbers of columns.

Other changes and fixes

  • New aggregators: vertical_stddev (alias of stddev) and vertical_variance (alias of variance).
  • The classifier.experiment procedure now returns the ID of the scorer function it creates for each fold. This makes it easier to reuse the functions in later steps of a script.
  • The runOnCreation arguments present for all procedures now defaults to True, which was the value used by the vast majority of users. - The export.csv procedure has a new skipDuplicateCells which, when set to True, will skip rows that contain cells with many values. This is necessary because the CSV format cannot represent many values per cell the way MLDB datasets can by using the time dimension. More information is available on the [export.csv procedure's documentation](https://docs.mldb.ai/doc/#builtin/procedures/CsvExportProcedure.md. html).
  • Fixed a loss of precision for floats when using MLDB's Python layer.
  • The arguments of the tokenize function, the import.text procedure and the tokensplit function are now all camel-case.
  • The tabular dataset is more efficient in storing numbers and timestamps, leading to a reduction in memory usage.
  • The speed at which the sparse.mutable dataset can record rows has been improved.
  • Fixed a memory leak in the levenshtein_distance() built-in function.
  • Fixed NULL propagation for math operators. Example: 5+NULL = NULL.

Version 2016.07.12.0

12 Jul 21:04
Compare
Choose a tag to compare

MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.

We're happy to announce the immediate availability of MLDB version 2016.07.12.0. Since the latest release, we've been working on many exciting projects. For example, we've started applying MLDB to LiDAR data and header bidding. We're also gearing up for more projects on image classification and deep learning, which means MLDB's support for Tensorflow will keep on improving over the coming weeks and months.

This release contains 135 new commits, modified 283 files and fixes 47 issues. On top of many bug fixes and performance improvements, here are some of the highlights of this release:

New tutorial

The Selecting Columns Programmatically Using Column Expressions Tutorial explains how the column expressions can be used to programmatically chose which columns are returned by the SELECT statement. This is an example of the powerful additions MLDB made to standard SQL to make it possible to work efficiently with schema-free sparse datasets made up of millions of columns.

Improvements to import procedures

Support for the SELECT and NAMED arguments in the import.json procedure

Following the addition of the WHERE argument to the import.json procedure in MLDB's last release, the procedure now supports the SELECT and NAMED arguments.

The json.import procedure allows a user to import a dataset made up of JSON blobs. The SELECT argument allows a user to select which keys to import, while the NAMED argument allows a user to name each row by potentially using values from the JSON blob.

Given the following file that contains these two lines:

{"a": "b1", "c": {"d": 1}, "e": [0, 1]}
{"a": "b2", "c": {"d": 2}}, "e": [0, 5]}

if we use the new arguments in the following way:

  • SELECT: c.d
  • NAMED: a

the resulting dataset will look like this:

_rowName c.d
b1 1
b2 2

Added rowHash() function to the import.text procedure

In the import.text procedure, certain functions are available in the SELECT, NAMED, WHERE and TIMESTAMP expressions.

This release adds the rowHash() function, that should be mainly useful when used in the WHERE argument to do random sampling. For example, when importing a huge file in MLDB, if we know beforehand that we only want to load a random 10% sample of the data, we can now simply use WHERE: rowHash() % 10 = 0. This will only keep the required sample as we're streaming through the data saving both time and memory.

Improvements to the SFTP protocol handler

In MLDB's last release, we introduced the sftp:// protocol handler. In this release, we made it more robust with the support of non-standard SSH ports as well as improving how it handles an unexpected loss of connection.

Machine learning

New summary.statistics procedure

As a first step in the modelling process, a data scientist usually wants to get a feel for the data. Looking at summary statistics is a great way to do this since they provide a high-level summary of the data.

The new summary.statistics procedure calculates summary statistics for the different columns in a dataset, and works for both numerical and categorical data.

The procedure calculates the number of unique and of null values,
and the most frequent items for both numerical and categorical data. In addition to those, the procedure calculates the mean, minimum and maximum values as well as the 1st quartile, median and 3rd quartile for columns containing only numerical data.

This is an example of the statistics for numerical columns on the dataset used in the Predicting Titanic Survival Demo:

New feature_hasher feature generator function

Feature hashing, also known as the hashing trick, is a way to turn a potentially very large feature vector into a much smaller fixed-length vector. It works by applying a hash function to each feature in the original vector and using the result as an index in the smaller vector.

The feature_hasher function offers this functionality and can operate in two modes: on columns only or on the union of columns and values. This gives it the flexibility to deal with both sparse and dense data respectivelly.

Updated Tensorflow to version 0.9.0

We updated the Tensorflow version shipped with MLDB to version 0.9.0. The new version includes many bug fixes and performance improvements.

This new Tensorflow release also includes contributions from a member of the MLDB team!

New functions

  • hash(expr): this function returns the hash of the value in expr.
  • extract_domain(str, {removeSubdomain: false}): this functions extracts the domain name from a URL. It can be very useful when dealing with web data.

Version 2016.06.28.1

28 Jun 21:13
Compare
Choose a tag to compare

MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.

We're happy to announce the immediate availability of MLDB version 2016.06.28.1. We've been hard at work using MLDB for several customer-facing projects and building internal features on it. We've also added a team member, Jonathan, who will be spending the summer building tutorials on MLDB. Welcome!

This release contains 112 new commits, modified 114 files and fixes 41 issues. On top of many bug fixes and performance improvements, here are some of the highlights of this release:

New demo

  • The Investigating the Panama Papers demo shows off MLDB's SQL engine by exploring the raw data from the Offshore Leaks Database (or "Panama Papers" as they were called in the media). MLDB is a great tool to understand the basic structure of the dataset and to start to identify the predictive power of some of the attributes.

New tutorials

  • The [Executing JavaScript Code Directly in SQL Queries Using the jseval Function Tutorial](https://docs.mldb.ai/ipy/notebooks/_tutorials/ _latest/Executing%20JavaScript%20Code%20Directly%20in%20SQL%20Queries%20Using%20the%20jseval%20Function%20Tutorial.html) showcases a very unique feature of MLDB: the ability to embed Javascript directly inside SQL queries in an extremely performant, multithreaded manner.
  • The Virtual Manipulation of Datasets Tutorial shows how to use the datasets of type sampled and merged. These are useful for splitting a dataset into testing and training sets and recombining them.
  • The Loading Data From An HTTP Server Tutorial shows how to load data from a public web server. Since MLDB is batteries included and much of machine learning is done over public datasets from
    the web, it's important to highlight how easy MLDB makes it to get started with one.

Improvements to import procedures

New autoGenerateHeaders option in import.text procedure

MLDB is a great way to deal with datasets that have lots of columns. The prefered way to import data is by using the import.text procedure. The procedure has lots of options to provide as much flexibility as possible when importing raw data.

However, if the imported file's first line did not contain the header (names of all the columns), it had to be provided as a list to the procedure. This added an extra, unnecessary step and slowed down the workflow. The new autoGenerateHeaders option solves this problem by automatically generating column names 0, 1, 2, etc.

Support for the WHERE argument in import.json procedure

When dealing with large real-life datasets, we don't always need all the rows of our raw data. Being able to filter unnecessary rows during the import stage can save both processing time and memory.

The addition of the WHERE argument to the import.json procedure allows a user to filter rows on the values inside the JSON blobs she or he is importing using an SQL expressoin. Here is a quick example:

Given the following file that contains these two lines:

{"a": "b1", "c": {"d": 1}}
{"a": "b2", "c": {"d": 2}}

we can now filter like this: WHERE: "c.d = 1". The resulting dataset will only contain the first line:

a c.d
b1 1

Machine Learning Improvements

The classifier.experiment procedure (as well as the classifier. test procedure) now report the accuracy, when evaluating boolean or categorical classifiers. This complements the metrics already reported: precision, recall, F1-score and Area Under the Curve.

A configuration present for the Naive Bayes classifier has also been added to the classifier.train procedure.

Added JSON payload suport to the /query endpoint and user functions

The /query endpoint and the application of functions via REST now support arguments passed in the JSON payload. Previously, they could only receive arguments in the query-string. Since the query-string has a limit in size, it could be problematic for very large queries. For example, if trying to query MLDB with the data extracted from an image. Supporting JSON payloads solves this shortcoming.

New SFTP protocol handler

It's easy to load data from a variety of sources with MLDB because of the different [protocols it can handle](https://docs.mldb.ai/doc/#builtin/ Url.md.html). This release adds the SSH File Transfer Protocol (SFTP) to the list of supported protocols.

New functions

http.useragent

For users dealing with web data, the http.useragent function provides very easy parsing of [user agent strings](https://en.wikipedia.org/wiki/ User_agent). After having instanciated a ua_parser function, a user can make the following call:

SELECT ua_parser({ua: 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5. 1 Mobile/9B206  Safari/7534.48.3'}) as *

which will return:

rowName browser.family browser.version device.brand device.model isSpider os. family os.version
result Mobile Safari 5.1.0 Apple iPhone 0 iOS 5.1.1

sign

Adding to the varied list of numerical functions available, the sign(x) function has been added. It returns the sign of x (-1, 0, +1).

Performance Improvements

The previous release of MLDB included internal refactorings to allow structured paths for row and column names. These fixed many problems with modelling data with MLDB, but came at a significant runtime cost. This release of MLDB has added a lot of rework and optimizations of those structured path names, allowing for reduced memory usage and much faster manipulation. In most cases, MLDB should be as performant or more as previous releases.

Version 2016.06.08.0

08 Jun 17:41
Compare
Choose a tag to compare

Version 2016.06.02.0

02 Jun 19:49
Compare
Choose a tag to compare
  • Data model change: row and column names are now stringified versions of row and column paths
  • The dot/period character (i.e. .) is now a path-element indirection operator, and so may no longer appear unquoted in identifiers
  • New function-type: embedding.neighbors
  • New builtin functions:
    • geo_distance(), levenshtein_distance(), jaccard_index()
    • rowPath(), rowPathElement(), path_element(), stringify_path(), parse_path()
    • isnan(), isinf(), isfinite(), replace_nan(), replace_inf(), replace_null(), replace_not_finite(), clamp()
    • count_distinct() (aggregator)
  • New demo Notebook: Enron Spam Filtering
  • Credentials management: credentials daemon now part of MLDB process, and routes have been moved from /v1/creds/rules to /v1/credentials
  • Renamed inputs for classifier.experiment:
    • trainingData->inputData, testingData->testingDataOverride
    • training_where->trainingWhere, testing_where->testingWhere
    • orderBy->trainingOrderBy/testingOrderBy
  • Performance, stability, documentation and packaging improvements