Add mechanism to check analysis before it's executed #215

jgoizueta · 2016-10-27T10:46:12Z

See #214

Open questions:

We set the status to fail and keep the error message in the catalog for nodes over the limits, but this does not prevent the limits checking to be performed again if, for example, the map is reentered.
Should we pass a QueryRunner instead of a DatabaseService to the Requirements constructor?
We're using ANALYZE for a fast estimation of the rows in a query, and we're doing nothing if this fails (e.g. for lack of stats). We should handle that case or return to use COUNT (see a6aae8d)

Pending tasks

Determine which requirements to estimate/check for all the nodes (number columns, ...) ~~or if we instead let each node checks whatever it decides to check~~ we'll start with number of rows only
Identify which analyses we need to limit and implement computeRequirements for them.
Tests
Better UI integration, see Handle Invalid analysis (and other analysis creation errors) cartodb#10391

Some things to look at at a later time:

Are we using node ids consistently? (e.g. to save status) [do we need to pay attention to the filters arg of id()?] we'll look at this later.
Must we handle node filters?

Implement basic checking for source and filter-category nodes for testing purposes.

Now creation of an analysis access the database to check requirements and the corresponding tables must exist

Style fixes

rename some variables to camelCase use configuration limits for computeRequirements

jshint doesn't like == the semantics change a little, but for our purposes === will work

jgoizueta · 2016-10-28T07:57:00Z

A note about aliased nodes

Let's consider this analysis definition:

{
    id: 'a2',
    type: 'point-in-polygon',
    params: {
        points_source: {
          id: 'airbnb_rooms',
          type: 'source',
          params: {
              query: 'SELECT * FROM airbnb_rooms'
          }
        },
        polygons_source: {
            id: 'a1',
            type: 'trade-area',
            params: {
                source: {
                  id: 'a0',
                  type: 'source',
                  params: {
                      query: 'SELECT * FROM airbnb_rooms'
                  }
                },
                kind: 'car',
                time: 100,
                isolines: 1,
                dissolved: false
            }
        }
    }
};

The ids that appear in the definition are the external names (e.g. generated by the UI) of the analysis nodes, and must not be confused with the internal ids which are a hash of the attributes that define the node uniquely.

In this case we have two nodes, a0 and airbnb_rooms which have an identical definition and so will have the same internal id; they both represent the same node of the DAG.

If Node objects are created with a workflow Factory for this analysis it will create 4 Nodes; actually: new Analysis(rootNode).getNodes().length === 4

But if we get the topologically sorted nodes we will have new Analysis(rootNode).getNodes().length === 3, because it includes each distinct node only once. This means that in an analysis we may have multiple Node instances representing the same graph node.

In Requirements since we iterate through the sorted nodes, we visit all the graph nodes, but we may not visit all the Node objects.

If we only assign requirements to the visited Nodes, a node such as a2 above could try to get the requirements from its "points_source" and find it has no requirements (because we visited a0 but not airbnb_rooms).

This is fixed in 9808d6c by replicating the requirements to all Nodes having the same id.

Note that the naive solution in that commit introduces O(n²) time complexity on the total number of nodes n;
this is probably not a problem since analyses should not have lots of nodes (otherwise we can to index allNodes by its id()s).

jgoizueta · 2016-10-28T09:27:40Z

How to define per-node limits to be checked before performing an analysis.

Currently we have introduced only one limit: maximumNumberOfRows, the maximum number of row a node can produce. For each limit, the functions computeRequirements must store the value of the limit in an attribute limits of the node and the estimated value in estimatedRequirements, for example:

this.estimatedRequirements = { numberOfRows: 1000 };
this.limits = { maximumNumberOfRows: 1000000 };

In addition, the function requirementMessages of Node must compare the stored values of all the limits and generate error messages for any limits exceeded. If necesary this function can be redefined for specific node classes (so it can use any additional information stored for that kind of node in estimatedRequirements and limits).

We have a default computeRequirements in Node which currently computes the maximum number of rows of all its inputs. This function can be overridden in the classes corresponding to the types of analysis that require it.

The function computeRequirements receives a DatabaseService object which can be used to access the user database. It also receives a limits object that when used from Windshaft-CartoDB will come from redis key limits:analyses in database 5.

In the limits object global limits to apply to all analysis can be passed, and also limits for specific analysis types. For example:

limits: {
    analyses: {
         maximumNumberOfRows: 100000, // general limit
        'trade-area': {
            maximumNumberOfRows: 10000 // specifc limit
        }
    }
}

The compute requirements functions should use the passed limits object and have a default value for any limit. This can be fetched using the Node.getNodeLimit function.

For an example, take a look at how checking has been added for line-creating analyses: 700a116

jgoizueta · 2016-10-28T09:38:11Z

Hey @dgaubert can you look at this?

cc/ @rafatower @ethervoid

jgoizueta · 2016-10-28T12:09:44Z

lib/node/nodes/aggregate-intersection.js

+
+AggregateIntersection.prototype.computeRequirements = function(databaseService, limits, callback) {
+    // we estimate the maximum possible number of rows of the result
+    var product = this.source.estimatedRequirements.numberOfRows *


I guess this is not consider good practice, should I define an accessor and use this.source.getEstimatedRequirements().numberOfRows instead?
(your opinion will be welcome, @dgaubert )

We have several places where we don't follow getter/setter pattern. IMHO in Javascript is not useful, you can always get a member property directly with .. My advice is to avoid the Cannot read property 'wadus' of null using default values and so on.

jgoizueta · 2016-10-28T13:08:47Z

lib/node/node.js

+        numberOfRows: maxRows
+    };
+    this.limits = {
+        maximumNumberOfRows: getNodeLimit(limits, this.getType(), 'maximumNumberOfRows', 1000000)


Since limit values should be definable in Redis, maybe we should use snake_case instead of camelCase?

you mean maximumNumberOfRows?

Yes, having maximum_number_of_rows instead.

If a timeout occurs while estimating requirements consider the query too complex and reject it.

rafatower · 2016-11-02T12:07:49Z

@dgaubert did you have a chance to take a further look? I'm more or less familiar with the code so I can walk you through the changes if you want.

dgaubert · 2016-11-02T12:10:35Z

@rafatower I'm dealing with some test weird stuff in SQL-API. This afternoon let's take a look of it. Sorry!

rafatower · 2016-11-02T13:08:04Z

sure, no problem... and thanks a lot :)

jgoizueta · 2016-11-04T11:52:08Z

We have decided to release this ASAP, so I'll take care of the most relevant issues and will leave the rest for a second iteration.

dgaubert · 2016-11-04T13:55:01Z

Let me know when you feel ready to review again!

jgoizueta · 2016-11-04T14:41:43Z

Hey @dgaubert, I'll be testing this in staging, but you can already review the changes whenever it suits you

So that errors are properly shown in the UI

Also change default number of max rows to unlimited for source nodes, but keep a limit of 1M as a default for other kind of nodes.

General limits are removed; only analysis with specific limits will be limited.

Avoid the number of categories estimation and go for a conservative estimation (potentially all input points could end up on the same line).

jgoizueta · 2016-11-08T14:35:11Z

I think is ready to be deployed, @dgaubert can you take a last look at it?

The current kind of analysis are limited; the default limits can be overriden in redis:

line-sequential: the input table cannot have more than 1M 100000 rows (even if we aggregate by some column; initially I tried to adjust for that but have removed because I think it is safer that way
line-source-to-target: the product of the number of rows of the input tables cannot be greater than 1M.
georeference-country: the input cannot have more than 500 rows.
georeference-admin-region: the input cannot have more than 1000 rows.
georeference-postal-code: if creating polygons cannot have more than 10000 input rows (otherwise the limit is 1M)

I have place some limits in other analysis not initially considered:

buffer: if it will be dissolve, no more than 10000 input rows (otherwise 1M)
aggregate-intersection: product of input rows <= 1M

cc @javisantana @rafatower

dgaubert · 2016-11-08T15:33:07Z

lib/analysis.js

@@ -80,6 +82,11 @@ AnalysisFactory.prototype.create = function(configuration, definition, callback)
                    return done(err, analysis);
                });
            },
+            function analysis$checkLimits(analysis, done) {


Why do we check limits after registering analysis? We are going to have a wrong status in cdb_analysis_catalog if analysis reaches some limit.

oops, I meant to do it before registration... I'm fixing it

jgoizueta · 2016-11-10T08:54:00Z

@rochoa please take a look at this

rochoa · 2016-11-10T09:47:31Z

lib/node/node.js

@@ -468,3 +471,44 @@ function validate(validator, params, expectedParamName) {

    return param;
 }
+
+Node.prototype.computeRequirements = function(databaseService, limits, callback) {
+    var default_limit = null; // no limit by default


Why don't we use something like Infinity here?

rochoa · 2016-11-10T09:47:55Z

lib/node/node.js

+    var messages = [];
+
+    // Note that any falsy value (null, 0, '', ...) means no limits:
+    if (this.requirements.getLimit('maximumNumberOfRows')) {


I would hide these details in requirements,

rochoa · 2016-11-10T09:49:17Z

lib/node/node.js

@@ -1,3 +1,4 @@
+


rochoa · 2016-11-10T09:58:18Z

Hey, don't consider my small comments as a blocker for this.

As we will work on #220, I'm pretty much OK with this first iteration. The limitations and caveats are well defined, so let's .

BTW, good catch about A note about aliased nodes, I would like to address this problem at some point because it's super confusing when you need to use getNodes vs getSortedNodes.

Thanks a lot for working on this one!

This release introduces analysis pre-checking, see #215 This is the first attempt at this, see #220 for intended changes.

javisantana · 2016-11-10T11:07:26Z

Could we add a "limits" section in the camshaft reference so the UI can use them to show warnings? like

"aggregate-intersection": {
            "params": {
                "source": {
                    "type": "node",
                    "geometry": [
                        "*"
                    ]
                    limits: {
                      max_input_rows: 100000
                    }
                },
                ...
            }

cc @rochoa

rochoa · 2016-11-10T11:48:12Z

The problem with integrating directly in the reference is that this is something dynamic, you might want to limit to 100k rows in some cases for an analysis but it's OK in others: depending on your hardware, for instance.

Add mechanism to check analysis before it's executed

17925d1

Implement basic checking for source and filter-category nodes for testing purposes.

jgoizueta added this to the Entrepinos - Dataservices milestone Oct 27, 2016

jgoizueta self-assigned this Oct 27, 2016

jgoizueta added 10 commits October 27, 2016 15:01

Fix tests

ba97e6e

Now creation of an analysis access the database to check requirements and the corresponding tables must exist

💄

023a5d5

Style fixes

Replace count by faster estimation of the number of rows

a6aae8d

Keep precheck failed nodes status in the nodes themselves

74b1fad

Fix base Node computeRequirements

6fc38a4

rename some variables to camelCase use configuration limits for computeRequirements

Compute requirements for aggregate-intersection anaylises

f38729e

Fix requirements problem with aliased nodes

9808d6c

💄

a2e4520

Replace comparison operator

f91a33d

jshint doesn't like == the semantics change a little, but for our purposes === will work

Syntax fixes

86961cf

jgoizueta added 3 commits October 28, 2016 10:55

Fix Node limit computation

b5d2bbd

Fix requirements & limits names

880fa16

Integration tests for requirements/limits

3b40450

jgoizueta force-pushed the 214-prechecking branch from 173a8d9 to 3b40450 Compare October 28, 2016 08:58

jgoizueta commented Oct 28, 2016

View reviewed changes

Add prechecking for line-creation analyses

700a116

jgoizueta force-pushed the 214-prechecking branch from dcd7825 to 700a116 Compare October 28, 2016 13:12

jgoizueta added 3 commits October 28, 2016 16:59

Add prechecking for geocoding returning polygons

fa211d9

Avoid using internal details of DatabaseService

e36669e

Handle SQL timeouts during pre-checks

6d6d71c

If a timeout occurs while estimating requirements consider the query too complex and reject it.

jgoizueta added 2 commits November 4, 2016 15:36

Remove testing remnant

3d26bfe

Estimate requirements and check limits in sigle step before registration

e2e6e59

Expose node id in limit-rejected analyses

baf428c

So that errors are properly shown in the UI

rafatower modified the milestones: Entrepinos - Dataservices, Málaga - Dataservices Nov 4, 2016

jgoizueta added 6 commits November 4, 2016 18:10

💄

722de54

Allow unlimited number of rows for a node's output.

1309a75

Also change default number of max rows to unlimited for source nodes, but keep a limit of 1M as a default for other kind of nodes.

Move some limit-checking functionality to Requirements Class

27782e6

Don't limit analyses by number of output rows in general

4d97ee2

General limits are removed; only analysis with specific limits will be limited.

Simplify sequential line requirement estimation.

e88b88e

Avoid the number of categories estimation and go for a conservative estimation (potentially all input points could end up on the same line).

Fix tests

aa12fef

dgaubert reviewed Nov 8, 2016

View reviewed changes

Perform limit-checking before registering the analysis

e931566

Raise limit for points per sequential line

b247f61

rochoa reviewed Nov 10, 2016

View reviewed changes

lib/node/node.js

@@ -1,3 +1,4 @@

Copy link

Contributor

rochoa Nov 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈

jgoizueta merged commit 74205da into master Nov 10, 2016

jgoizueta added a commit that referenced this pull request Nov 10, 2016

Release 0.48.0

09540bc

This release introduces analysis pre-checking, see #215 This is the first attempt at this, see #220 for intended changes.

rochoa deleted the 214-prechecking branch December 15, 2016 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mechanism to check analysis before it's executed #215

Add mechanism to check analysis before it's executed #215

jgoizueta commented Oct 27, 2016 •

edited by rafatower

Loading

jgoizueta commented Oct 28, 2016 •

edited

Loading

jgoizueta commented Oct 28, 2016 •

edited

Loading

jgoizueta commented Oct 28, 2016

jgoizueta Oct 28, 2016

dgaubert Oct 28, 2016

jgoizueta Oct 28, 2016

dgaubert Oct 28, 2016

jgoizueta Oct 28, 2016

rafatower commented Nov 2, 2016

dgaubert commented Nov 2, 2016

rafatower commented Nov 2, 2016

jgoizueta commented Nov 4, 2016

dgaubert commented Nov 4, 2016

jgoizueta commented Nov 4, 2016

jgoizueta commented Nov 8, 2016 •

edited

Loading

dgaubert Nov 8, 2016

jgoizueta Nov 8, 2016

jgoizueta commented Nov 10, 2016

rochoa Nov 10, 2016

rochoa Nov 10, 2016

rochoa Nov 10, 2016

rochoa commented Nov 10, 2016

javisantana commented Nov 10, 2016

rochoa commented Nov 10, 2016

Add mechanism to check analysis before it's executed #215

Add mechanism to check analysis before it's executed #215

Conversation

jgoizueta commented Oct 27, 2016 • edited by rafatower Loading

jgoizueta commented Oct 28, 2016 • edited Loading

A note about aliased nodes

jgoizueta commented Oct 28, 2016 • edited Loading

How to define per-node limits to be checked before performing an analysis.

jgoizueta commented Oct 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rafatower commented Nov 2, 2016

dgaubert commented Nov 2, 2016

rafatower commented Nov 2, 2016

jgoizueta commented Nov 4, 2016

dgaubert commented Nov 4, 2016

jgoizueta commented Nov 4, 2016

jgoizueta commented Nov 8, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgoizueta commented Nov 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rochoa commented Nov 10, 2016

javisantana commented Nov 10, 2016

rochoa commented Nov 10, 2016

jgoizueta commented Oct 27, 2016 •

edited by rafatower

Loading

jgoizueta commented Oct 28, 2016 •

edited

Loading

jgoizueta commented Oct 28, 2016 •

edited

Loading

jgoizueta commented Nov 8, 2016 •

edited

Loading