Releases: mldbai/mldb
Version 2017.04.17.0
MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see
Running MLDB for installation details.
We're happy to announce the immediate availability of MLDB
version 2017.04.17.0.
Here are some of the highlights of this release:
New features
- In addition to shared libraries, MLDB now auto loads javascript and python plugins.
- Added functions:
remove_prefix(string, prefix)
returns the string with the specified prefix removed if present.remove_suffix(string, suffix)
returns the string with the specified suffix removed if present.mime_type(x)
returns the mime type of the blob x.
- MLDB now supports azure blob storage through the
azureblob://
URI protocol. - Added Fast Text support to MLDB classifier’s algorithms. Note that our version of the Fast Text Classifier only supports feature counts, and currently does not support regression. Feature counts refers to a bag of word representation like what is returned by the
tokenize()
function. - Added route
POST /redirect/get
to give an alternate way to APIs unable to attach a body to GET requests to attain the same functionality through a POST call. - Added new parameter
ignoreExtraColumns
to theimport.text
procedure to ignore extra columns that weren't in header instead of failing. classifier.train
,classifier.experiment
andclassifier.test
now now support multi-label classification mode.- Function
reshape
now handles row expressions. - Function
import.word2vec
now supports the named parameter. - Function
fetcher
now supports a concurrency limit which is particularly useful to avoid to overwhelm a server with requests.
Changes
- Procedure
export.csv
automatically flattens structures. For example, if you have a dataset “ds” with column “x” containing a structure ‘{“a” : “foo”}’, a simple export with the query “SELECT x AS x FROM ds” now works and would output the column name as “x.a”. (MLDB-2126) - When you create unnamed dataset, functions or procedures, auto generated ids use underscores instead of dashes.
- Newly trained models with an always null feature will work if they are tested against the same feature having a string value.
- Enhanced progress report for the transform procedure.
Fixes
Version 2016.01.24.0
MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.
We're happy to announce the immediate availability of MLDB version 2017.01.24.0.
This release contains 123 new commits and modified 1653 files. On top of many bug fixes and performance improvements, here are some of the highlights of this release:
- We have added new features to provide added visibility to long-running procedures. Most procedures now report their progress. Version 8.1 of pymldb brings progress bars to Jupyter so that it's easier to track the progress of procedures when working in notebooks. See the Using
pymldb
's Progress Bar and Cancel Button Tutorial for more details.
Version 2016.12.16.0
MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.
We're happy to announce the immediate availability of MLDB version 2016.12.16.0.
This release contains 228 new commits and modified 823 files. On top of many bug fixes and performance improvements, here are some of the highlights of this release:
- It's now possible to make multiple predictions per REST call by using the
/v1/functions/<function>/batch
REST route. - New Identifying Biased Features Tutorial
- New signal processing functions. This includes the
fft(data [,direction='forward' [,type='real']])
function that performs a fast fourier transform on the given data. - Added the
devices
configuration argument to thetensorflow.graph
function to specify on which device the graph is allowed to run. - MLDB now contains CUDA kernels for shader model 5.2 (Kepler), 5.3 (Maxwell), 6.0 (P100) and 6.1 (Titan X)
- Improved support for aarch64 and ARM architectures. CUDA is now supported on the Jetson TX1
- It is now possible to track the progress of long-running procedures, as well as interrupt them. Check the Intro to Procedures page for more details.
- New numerical functions:
sin(x)
,cos(x)
andtan(x)
are the normal trigonometric functionsasin(x)
,acos(x)
andatan(x)
are the normal inverse trigonometric functionsatan2(x, y)
returns the two-argument arctangent ofx
andy
, in other words the angle (in radians) of the point throughx
andy
from the origin with respect to the positivex
axissinh(x)
,cosh(x)
andtanh(x)
are the normal hyperbolic functionsasinh(x)
,acosh(x)
andatanh(x)
are the normal inverse hyperbolic functionspi()
returns the value of pi, the ratio of a circle's circumference to its diameter, as a double precision floating point number.e()
returns the value of e, the base of natural logarithms, as a double precision floating point number.
- New
concat(x, ...)
function that takes several embeddings with identical sizes in all but their last dimension and join them together on the last dimension. - The
import.json
procedure now supports thearrays
configuration argument to specify how arrays should be encoded in the JSON output. - The
import.text
procedure now returns arowCount
field representing the number of rows that were imported, just as theimport.json
procedure does. - Fixes to the
import.text
procedure:- Fixed trailing whitespace on a CSV file that contains numbers in the last column makes MLDB think those columns are strings (as it keeps the trailing whitespace)
- Fixed MLDB crashes with a "cannot seek" exception when attempting to open a file with autoGenerateHeaders
- The
reshape()
function now has a 3 argument form.reshape(val, shape, newel)
is similar to the two argument version of reshape, but allows for the number of elements to be different. If the number of elements increases, new elements will be filled in with the newel parameter. - Updated the
uap-core
library to the latest version improving user agent patterns used by thehttp.useragent
function. - Fixed wide rows causing data corruption in tabular datasets
- Many fixes to the
JOIN
operators - The
merge()
dataset function now accepts a single dataset - Random forest speedups and improvements
- The
svd.train
procedure now supports all select expressions to specify it's input data, instead of the restricted form of select statements. - Fixed regexes being recompiled for every row when using a
LIKE
operator - The user function infrastructure has been modified to be more like built-in functions. In particular, inputs and outputs no longer need to be rows.
- The
row_dataset
has been modified to return one row per column, and anatom_dataset
construct added with semantics similar to the original. The types of these datasets have been improved, with inference of thevalue
type and thecolumn
type is now path, not string. - The
sql.expression
object has been improved to allowraw
andautoInput
parameters to be passed, bypassing the requirement for a row on output and input respectively.
Version 2016.10.05.0
MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.
We're happy to announce the immediate availability of MLDB version 2016.10.05.0.
This release contains 141 new commits and modified 903 files. On top of many bug fixes and performance improvements, here are some of the highlights of this release:
New MongoDB interface
A big new feature is support for importing and exporting data to and from MongoDB, a popular NoSQL database. Although MongoDB can be very useful for certain use cases, it doesn't have any machine learning capabilities. We want to make it as easy as possible for our users to get their data in MLDB. So we have added the following new MLDB entities that make it easy to interface with MongoDB:
- mongodb.import procedure: used to import a MongoDB collection into an MLDB dataset
- mongodb.dataset dataset: read only MLDB dataset based on a MongoDB collection
- mongodb.record dataset: write-only MLDB dataset that writes to a MongoDB collection
- mongodb.query function: function to perform an MLDB SQL query against a MongoDB collection
Updated TensorFlow to 0.10.0
We updated the TensorFlow version shipped with MLDB to version 0.10.0. The new version includes many bug fixes and performance improvements. We're now also shipping MLDB with different TensorFlow kernels, each optimized for different instruction sets. So for instance, the kernel with AVX2 instructions will be used if it the processor on which MLDB is run supports it.
If you're interested in deep learning, make sure to checkout the Tensorflow Image Recognition Tutorial and the Transfer Learning with Tensorflow demo to see how easy
it is to run trained models with MLDB.
Updated V8 to Release 5.0
We have updated V8, the Javascript engine used in MLDB, to Release 5.0. This brings in a lot of improvements and new features, like improved ECMAScript 2015 (ES6), as well as increasing performance. It now also compiles for the ARM architecture, which is an important step as we're working towards having MLDB run on embedded architectures.
An example of what this benefits is the jseval
function, that makes it possible to execute arbitrary JavaScript code inline in an SQL query.
Check out the Executing JavaScript Code Directly in SQL Queries Using the jseval Function Tutorial for great examples of how jseval
can be used.
Fixes and improvements to import procedures
The SELECT
statement of the import.text
procedure has been improved to support the CASE
keyword. The adds extra flexibility to process data as it is being imported.
We also fixed a bug when using the NAMED
clause with the import.json
procedure that could cause undesired behaviour.
Updates to the classifier configuration
We have improved the user experience around configuring supervised algorithms in two ways.
First, we have clarified the documentation by creating a new Classifier configuration section that contains the information related to the configuration of supervised models. When using one of the two procedures that can be used to train models, the classifier.train and classifier.experiment, all the information you need to configure your algorithm now lives in one place.
Second, we have made the training more robust to configuration errors by having better validation of elements meant to control hyper-parameters. Incorrect parameters will now trigger errors.
New vector space functions
We added two new vector space functions:
First, the new reshape(val, shape)
function takes an n-dimensional embedding and reinterprets it as an N-dimensional embedding of the provided shape containing all of the elements. This allows, for example, a 1-dimensional vector to be reinterpreted as a 2-dimensional array. The shape argument is an embedding containing the size of each dimension.
Second, the new shape(val)
takes an n-dimensional embedding and returns the size of each dimension as an array.
Other changes and fixes
- The
COLUMN EXPR
expression now supports theSTRUCTURED
keyword. By default,COLUMN EXPR
returns a flattened representation. Adding theSTRUCTURED
keyword will return the structured representation. - The
tsne.train
procedure now has alearningRate
configuration option. - Improved speed and fixes to JOIN operations.
- Columns and rows can now be named with an empty string.
- When evaluating a model using the classifier.test or the classifier.experimeny procedure, the F1-score was returned in a key named
f
. The name has been renamed tof1Score
. - The HTTP layer now correctly handles the
HTTP 1.1 100 CONTINUE
request header. - Fixed the ordering of paths when mixing Unicode and digits.
- The user function
fetcher
is now available as a built-in functionfetch
. - Fixed a bug with the
levenshtein_distance()
function where it did not work properly with UTF-8 characters. - Fixed an issue with the Javascript plugin's
serveStaticFolder()
function where the path to serve would not be considered relative to the plugin's installation directory. - Added the optional argument
sortField
to thestring_agg(expr, separator [, sortField])
function, that allows to sort the returned by thesortField
.
Version 2016.08.31.0
MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.
We're happy to announce the immediate availability of MLDB version 2016.08.31.0.
This release contains 114 new commits and modified 366 files. On top of many bug fixes and performance improvements, here are some of the highlights of this release:
MLPaint: the Real-Time Handwritten Digit Recognizer plugin
We're very excited to present MLPaint, the Real-Time Handwritten Digit Recognizer, a web app that runs on MLDB. It was made by Jonathan, the awesome intern we had with us this summer. Check out the video demo here: https://www.youtube.com/embed/WGdLCXDiDSo
The two demos below go into the technical details of how this plugin was built. The plugin is hosted on Github if you want to check out the implementation.
New demos
- The Image Processing with Convolutions demo explains what convolutions are and shows different ways of doing them with MLDB, including using Tensorflow 2D convolution operator directly in SQL.
- The Recognizing Handwritten Digits demo explains the machine learning steps that went into creating the MLPaint plugin.
Classifier testing procedure now fully supports weights
Weighting examples correctly is a crucial part of training machine learning models that will generalize well. It can be used to compensate for sampling bias, class imbalance, etc. This is well supported for training in two ways:
- specifying the weight for each example by using the
weight
column in thetrainingData
query - using the
equalizationFactor
parameter that specifies the amount by which to adjust weights so that all classes have an equal total weight
Weights can also be useful for testing. For instance, the cost of making mistakes for certain examples can be much less than for others. Having the metrics take that into consideration will help deliver a clearer picture of the performance expectations you can have for the model.
All the metrics reported by the classifier.test
prodecure now fully take the weight of each example into account. You can specify the weight of each example by using the weight
column in the testingData
query.
Credentials
MLDB makes it very easy to access secured resources using a variety of protocols like http
, sftp
or even s3
. MLDB can store credentials and supply them transparently whenever required when accessing protected files.
We fixed an issue that cause a problem when credentials file were loaded from a remote resource when launching MLDB from the command-line by using the add-credentials-from-url
flag. This is mostly used in a production scenario. Error messages related to handling of credential files were also improved so they're clearer.
Updated pymldb to version 0.7.1
The pymldb library is an open-source pure-Python module which provides a wrapper library that makes it easy to work with MLDB from Python. Version 0.7.1 is a minor update changes the way the query
function sends requests to MLDB. Instead of passing the query using the query string, it now sends it in the JSON payload. This makes it possible to send big feature vectors without hitting the query-string size limit.
Check out the Using pymldb Tutorial notebook for more info.
Improvements for c++ plugin developers
MLDB allows its functionality to be extended with plugins. While we often showcase Python plugins, like MLPaint mentioned at the top of this post, it's also possible to write plugins in c++.
And so c++ plugin developers rejoice! It is now easier to take advantage of MLDB's powerful SQL engine from c++ by using the new eval_sql
function. It makes running queries easier and faster.
You can now also specify built-in functions by using SQL from c++. This allows for much more compact code and less boilerplate.
Shout out: Golang interface
We'd like to shout out to ZzEeKkAa who developed a very nice Golang interface for MLDB. Check it out if you're into Golang!
If you created a plugin or library that works with MLDB, make sure to reach out!
Exciting and upcoming
We have been hard at work on a new LiDAR MLDB plugin. This enables MLDB to process 3D point cloud data and do voxel rendering. It makes is possible to visualize raw and voxelized data from any point of view. Combined with our existing Tensorflow integration, it opens the door to a solving cutting-edge deep learning image recognition problems with MLDB.
It is also now possible to build MLDB on 32 and 64 bit ARM architectures. This will enable us to target a wider range of hardware. Think of smartphones, Raspberry Pi, or even Nvidia's Jetson TX1. This is a stepping stone in having MLDB run on-device.
Other changes and fixes
- It is now possible to compile MLDB with clang
- Cleanup of the logarithm functions:
ln(dp or numeric)
: natural logarithmlog(dp or numeric)
: base 10 logarithmlog(b numeric, x numeric)
: logarithm to base b- Calling the following functions is now valid:
sqrt(-1)=nan
,log(0)=-inf
,log(-1)=nan
.
- Fixed default timestamps coherence
- Improved speed and fixes to JOIN operations
- Fixed a bug that prevented credentials from being deleted
- In Python script error messages, proper file paths are now returned
- The dataset-specific query route (
/v1/datasets/<dataset_name>/query
) has been removed. Use the/v1/query
instead. - We've unified the way GET endpoints accept parameters. Some routes will take parameters from either the query-string or as a JSON payload. We now enforce that all parameters should be sent one way or the other, not a mix of both.
Version 2016.08.04.0
MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.
We're happy to announce the immediate availability of MLDB [version 2016.08.04.0](https://github.com/mldbai/mldb/releases/tag/v2016.08.04. 0).
This release contains 161 new commits, modified 290 files and fixes 82 issues. On top of many bug fixes and performance improvements, here are some of the highlights of this release:
New DISTINCT ON
clause
The DISTINCT ON
clause can be used to to filter out duplicate rows based on the value of an expression. The syntax is as follows:
SELECT DISTINCT ON (algorithm, project) algorithm, project, date
FROM ml_experiments
ORDER BY algorithm, project
This will return one row per unique value of the columns algorithm
and project
.
See the Select Expression documentation for more details.
New try
builtin function
When an error occurs when processing a query, the whole query fails and no result is returned, even if only a single line caused the error. The new try
function is meant to handle this type of situation. The first argument is the expression to try to apply. The optional second argument is what will be returned if an error is encountered.
In the example below, since the string foo will not parse as valid JSON, the row expression {'error': 1}
will be returned instead:
SELECT try(parse_json('foo'), {'error': 1}) AS *
Check out the try
function documentation for more details.
Deep learning
Added support for NVIDIA CUDNN, improving the performance of MLDB's Tensorflow integration on GPUs. This is another step in making MLDB the easiest platform to use to run Tensorflow graphs.
Updated pymldb to version 0.7.0
The pymldb library is an open-source pure-Python module which provides a wrapper library that makes it easy to work with MLDB from Python. Version 0.7.0 adds support for passing in a JSON payload in GET requests. This is necessary when passing in big feature vectors to MLDB functions.
Check out the Using pymldb Tutorial notebook for more info.
Internal hashing is now done using HighwayHash
MLDB's hash functions now use the Highway Tree Hash, which is claimed to be both likely secure and very fast. This will improve the speed of working with large numbers of columns.
Other changes and fixes
- New aggregators:
vertical_stddev
(alias ofstddev
) andvertical_variance
(alias ofvariance
). - The classifier.experiment procedure now returns the ID of the scorer function it creates for each fold. This makes it easier to reuse the functions in later steps of a script.
- The
runOnCreation
arguments present for all procedures now defaults toTrue
, which was the value used by the vast majority of users. - Theexport.csv
procedure has a newskipDuplicateCells
which, when set toTrue
, will skip rows that contain cells with many values. This is necessary because the CSV format cannot represent many values per cell the way MLDB datasets can by using the time dimension. More information is available on the [export.csv procedure's documentation](https://docs.mldb.ai/doc/#builtin/procedures/CsvExportProcedure.md. html). - Fixed a loss of precision for floats when using MLDB's Python layer.
- The arguments of the tokenize function, the import.text procedure and the tokensplit function are now all camel-case.
- The tabular dataset is more efficient in storing numbers and timestamps, leading to a reduction in memory usage.
- The speed at which the sparse.mutable dataset can record rows has been improved.
- Fixed a memory leak in the
levenshtein_distance()
built-in function. - Fixed NULL propagation for math operators. Example:
5+NULL = NULL
.
Version 2016.07.12.0
MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.
We're happy to announce the immediate availability of MLDB version 2016.07.12.0. Since the latest release, we've been working on many exciting projects. For example, we've started applying MLDB to LiDAR data and header bidding. We're also gearing up for more projects on image classification and deep learning, which means MLDB's support for Tensorflow will keep on improving over the coming weeks and months.
This release contains 135 new commits, modified 283 files and fixes 47 issues. On top of many bug fixes and performance improvements, here are some of the highlights of this release:
New tutorial
The Selecting Columns Programmatically Using Column Expressions Tutorial explains how the column expressions can be used to programmatically chose which columns are returned by the SELECT statement. This is an example of the powerful additions MLDB made to standard SQL to make it possible to work efficiently with schema-free sparse datasets made up of millions of columns.
Improvements to import procedures
Support for the SELECT
and NAMED
arguments in the import.json
procedure
Following the addition of the WHERE argument to the import.json
procedure in MLDB's last release, the procedure now supports the SELECT and NAMED arguments.
The json.import
procedure allows a user to import a dataset made up of JSON blobs. The SELECT argument allows a user to select which keys to import, while the NAMED argument allows a user to name each row by potentially using values from the JSON blob.
Given the following file that contains these two lines:
{"a": "b1", "c": {"d": 1}, "e": [0, 1]}
{"a": "b2", "c": {"d": 2}}, "e": [0, 5]}
if we use the new arguments in the following way:
SELECT: c.d
NAMED: a
the resulting dataset will look like this:
_rowName | c.d |
---|---|
b1 | 1 |
b2 | 2 |
Added rowHash()
function to the import.text
procedure
In the import.text
procedure, certain functions are available in the SELECT, NAMED, WHERE and TIMESTAMP expressions.
This release adds the rowHash()
function, that should be mainly useful when used in the WHERE argument to do random sampling. For example, when importing a huge file in MLDB, if we know beforehand that we only want to load a random 10% sample of the data, we can now simply use WHERE: rowHash() % 10 = 0
. This will only keep the required sample as we're streaming through the data saving both time and memory.
Improvements to the SFTP protocol handler
In MLDB's last release, we introduced the sftp://
protocol handler. In this release, we made it more robust with the support of non-standard SSH ports as well as improving how it handles an unexpected loss of connection.
Machine learning
New summary.statistics
procedure
As a first step in the modelling process, a data scientist usually wants to get a feel for the data. Looking at summary statistics is a great way to do this since they provide a high-level summary of the data.
The new summary.statistics
procedure calculates summary statistics for the different columns in a dataset, and works for both numerical and categorical data.
The procedure calculates the number of unique and of null values,
and the most frequent items for both numerical and categorical data. In addition to those, the procedure calculates the mean, minimum and maximum values as well as the 1st quartile, median and 3rd quartile for columns containing only numerical data.
This is an example of the statistics for numerical columns on the dataset used in the Predicting Titanic Survival Demo:
New feature_hasher
feature generator function
Feature hashing, also known as the hashing trick, is a way to turn a potentially very large feature vector into a much smaller fixed-length vector. It works by applying a hash function to each feature in the original vector and using the result as an index in the smaller vector.
The feature_hasher
function offers this functionality and can operate in two modes: on columns only or on the union of columns and values. This gives it the flexibility to deal with both sparse and dense data respectivelly.
Updated Tensorflow to version 0.9.0
We updated the Tensorflow version shipped with MLDB to version 0.9.0. The new version includes many bug fixes and performance improvements.
This new Tensorflow release also includes contributions from a member of the MLDB team!
New functions
hash(expr)
: this function returns the hash of the value inexpr
.extract_domain(str, {removeSubdomain: false})
: this functions extracts the domain name from a URL. It can be very useful when dealing with web data.
Version 2016.06.28.1
MLDB is the Machine Learning Database. It's the best way to get machine learning or AI into your applications or personal projects. Head on over to MLDB.ai to try it right now or see Running MLDB for installation details.
We're happy to announce the immediate availability of MLDB version 2016.06.28.1. We've been hard at work using MLDB for several customer-facing projects and building internal features on it. We've also added a team member, Jonathan, who will be spending the summer building tutorials on MLDB. Welcome!
This release contains 112 new commits, modified 114 files and fixes 41 issues. On top of many bug fixes and performance improvements, here are some of the highlights of this release:
New demo
- The Investigating the Panama Papers demo shows off MLDB's SQL engine by exploring the raw data from the Offshore Leaks Database (or "Panama Papers" as they were called in the media). MLDB is a great tool to understand the basic structure of the dataset and to start to identify the predictive power of some of the attributes.
New tutorials
- The [Executing JavaScript Code Directly in SQL Queries Using the jseval Function Tutorial](https://docs.mldb.ai/ipy/notebooks/_tutorials/ _latest/Executing%20JavaScript%20Code%20Directly%20in%20SQL%20Queries%20Using%20the%20jseval%20Function%20Tutorial.html) showcases a very unique feature of MLDB: the ability to embed Javascript directly inside SQL queries in an extremely performant, multithreaded manner.
- The Virtual Manipulation of Datasets Tutorial shows how to use the datasets of type
sampled
andmerged
. These are useful for splitting a dataset into testing and training sets and recombining them. - The Loading Data From An HTTP Server Tutorial shows how to load data from a public web server. Since MLDB is batteries included and much of machine learning is done over public datasets from
the web, it's important to highlight how easy MLDB makes it to get started with one.
Improvements to import procedures
New autoGenerateHeaders
option in import.text
procedure
MLDB is a great way to deal with datasets that have lots of columns. The prefered way to import data is by using the import.text
procedure. The procedure has lots of options to provide as much flexibility as possible when importing raw data.
However, if the imported file's first line did not contain the header (names of all the columns), it had to be provided as a list to the procedure. This added an extra, unnecessary step and slowed down the workflow. The new autoGenerateHeaders
option solves this problem by automatically generating column names 0
, 1
, 2
, etc.
Support for the WHERE
argument in import.json
procedure
When dealing with large real-life datasets, we don't always need all the rows of our raw data. Being able to filter unnecessary rows during the import stage can save both processing time and memory.
The addition of the WHERE
argument to the import.json
procedure allows a user to filter rows on the values inside the JSON blobs she or he is importing using an SQL expressoin. Here is a quick example:
Given the following file that contains these two lines:
{"a": "b1", "c": {"d": 1}}
{"a": "b2", "c": {"d": 2}}
we can now filter like this: WHERE: "c.d = 1"
. The resulting dataset will only contain the first line:
a | c.d |
---|---|
b1 | 1 |
Machine Learning Improvements
The classifier.experiment
procedure (as well as the classifier. test
procedure) now report the accuracy, when evaluating boolean
or categorical
classifiers. This complements the metrics already reported: precision, recall, F1-score and Area Under the Curve.
A configuration present for the Naive Bayes classifier has also been added to the classifier.train
procedure.
Added JSON payload suport to the /query
endpoint and user functions
The /query
endpoint and the application of functions via REST now support arguments passed in the JSON payload. Previously, they could only receive arguments in the query-string. Since the query-string has a limit in size, it could be problematic for very large queries. For example, if trying to query MLDB with the data extracted from an image. Supporting JSON payloads solves this shortcoming.
New SFTP protocol handler
It's easy to load data from a variety of sources with MLDB because of the different [protocols it can handle](https://docs.mldb.ai/doc/#builtin/ Url.md.html). This release adds the SSH File Transfer Protocol (SFTP) to the list of supported protocols.
New functions
http.useragent
For users dealing with web data, the http.useragent
function provides very easy parsing of [user agent strings](https://en.wikipedia.org/wiki/ User_agent). After having instanciated a ua_parser
function, a user can make the following call:
SELECT ua_parser({ua: 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5. 1 Mobile/9B206 Safari/7534.48.3'}) as *
which will return:
rowName | browser.family | browser.version | device.brand | device.model | isSpider | os. family | os.version |
result | Mobile Safari | 5.1.0 | Apple | iPhone | 0 | iOS | 5.1.1 |
sign
Adding to the varied list of numerical functions available, the sign(x)
function has been added. It returns the sign of x (-1, 0, +1).
Performance Improvements
The previous release of MLDB included internal refactorings to allow structured paths for row and column names. These fixed many problems with modelling data with MLDB, but came at a significant runtime cost. This release of MLDB has added a lot of rework and optimizations of those structured path names, allowing for reduced memory usage and much faster manipulation. In most cases, MLDB should be as performant or more as previous releases.
Version 2016.06.08.0
- New demo Notebook:
- Performance, stability, documentation and packaging improvements
Version 2016.06.02.0
- Data model change: row and column names are now stringified versions of row and column paths
- The dot/period character (i.e.
.
) is now a path-element indirection operator, and so may no longer appear unquoted in identifiers - New function-type:
embedding.neighbors
- New builtin functions:
geo_distance()
,levenshtein_distance()
,jaccard_index()
rowPath()
,rowPathElement()
,path_element()
,stringify_path()
,parse_path()
isnan()
,isinf()
,isfinite()
,replace_nan()
,replace_inf()
,replace_null()
,replace_not_finite()
,clamp()
count_distinct()
(aggregator)
- New demo Notebook: Enron Spam Filtering
- Credentials management: credentials daemon now part of MLDB process, and routes have been moved from
/v1/creds/rules
to/v1/credentials
- Renamed inputs for
classifier.experiment
:trainingData
->inputData
,testingData
->testingDataOverride
training_where
->trainingWhere
,testing_where
->testingWhere
orderBy
->trainingOrderBy
/testingOrderBy
- Performance, stability, documentation and packaging improvements