Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mechanism to check analysis before it's executed #215

Merged
merged 29 commits into from
Nov 10, 2016
Merged

Conversation

jgoizueta
Copy link
Contributor

@jgoizueta jgoizueta commented Oct 27, 2016

See #214

Open questions:

  • We set the status to fail and keep the error message in the catalog for nodes over the limits, but this does not prevent the limits checking to be performed again if, for example, the map is reentered.
  • Should we pass a QueryRunner instead of a DatabaseService to the Requirements constructor?
  • We're using ANALYZE for a fast estimation of the rows in a query, and we're doing nothing if this fails (e.g. for lack of stats). We should handle that case or return to use COUNT (see a6aae8d)

Pending tasks

  • Determine which requirements to estimate/check for all the nodes (number columns, ...) or if we instead let each node checks whatever it decides to check we'll start with number of rows only
  • Identify which analyses we need to limit and implement computeRequirements for them.
  • Tests
  • Better UI integration, see Handle Invalid analysis (and other analysis creation errors) cartodb#10391

Some things to look at at a later time:

  • Are we using node ids consistently? (e.g. to save status) [do we need to pay attention to the filters arg of id()?] we'll look at this later.
  • Must we handle node filters?

Implement basic checking for source and filter-category nodes for
testing purposes.
@jgoizueta jgoizueta added this to the Entrepinos - Dataservices milestone Oct 27, 2016
@jgoizueta jgoizueta self-assigned this Oct 27, 2016
Now creation of an analysis access the database to check requirements
and the corresponding tables must exist
Style fixes
rename some variables to camelCase
use configuration limits for computeRequirements
jshint doesn't like ==
the semantics change a little, but for our purposes === will work
@jgoizueta
Copy link
Contributor Author

jgoizueta commented Oct 28, 2016

A note about aliased nodes

Let's consider this analysis definition:

{
    id: 'a2',
    type: 'point-in-polygon',
    params: {
        points_source: {
          id: 'airbnb_rooms',
          type: 'source',
          params: {
              query: 'SELECT * FROM airbnb_rooms'
          }
        },
        polygons_source: {
            id: 'a1',
            type: 'trade-area',
            params: {
                source: {
                  id: 'a0',
                  type: 'source',
                  params: {
                      query: 'SELECT * FROM airbnb_rooms'
                  }
                },
                kind: 'car',
                time: 100,
                isolines: 1,
                dissolved: false
            }
        }
    }
};

The ids that appear in the definition are the external names (e.g. generated by the UI) of the analysis nodes, and must not be confused with the internal ids which are a hash of the attributes that define the node uniquely.

In this case we have two nodes, a0 and airbnb_rooms which have an identical definition and so will have the same internal id; they both represent the same node of the DAG.

If Node objects are created with a workflow Factory for this analysis it will create 4 Nodes; actually: new Analysis(rootNode).getNodes().length === 4

But if we get the topologically sorted nodes we will have new Analysis(rootNode).getNodes().length === 3, because it includes each distinct node only once. This means that in an analysis we may have multiple Node instances representing the same graph node.

In Requirements since we iterate through the sorted nodes, we visit all the graph nodes, but we may not visit all the Node objects.

If we only assign requirements to the visited Nodes, a node such as a2 above could try to get the requirements from its "points_source" and find it has no requirements (because we visited a0 but not airbnb_rooms).

This is fixed in 9808d6c by replicating the requirements to all Nodes having the same id.

Note that the naive solution in that commit introduces O(n2) time complexity on the total number of nodes n;
this is probably not a problem since analyses should not have lots of nodes (otherwise we can to index allNodes by its id()s).

@jgoizueta
Copy link
Contributor Author

jgoizueta commented Oct 28, 2016

How to define per-node limits to be checked before performing an analysis.

Currently we have introduced only one limit: maximumNumberOfRows, the maximum number of row a node can produce. For each limit, the functions computeRequirements must store the value of the limit in an attribute limits of the node and the estimated value in estimatedRequirements, for example:

this.estimatedRequirements = { numberOfRows: 1000 };
this.limits = { maximumNumberOfRows: 1000000 };

In addition, the function requirementMessages of Node must compare the stored values of all the limits and generate error messages for any limits exceeded. If necesary this function can be redefined for specific node classes (so it can use any additional information stored for that kind of node in estimatedRequirements and limits).

We have a default computeRequirements in Node which currently computes the maximum number of rows of all its inputs. This function can be overridden in the classes corresponding to the types of analysis that require it.

The function computeRequirements receives a DatabaseService object which can be used to access the user database. It also receives a limits object that when used from Windshaft-CartoDB will come from redis key limits:analyses in database 5.

In the limits object global limits to apply to all analysis can be passed, and also limits for specific analysis types. For example:

limits: {
    analyses: {
         maximumNumberOfRows: 100000, // general limit
        'trade-area': {
            maximumNumberOfRows: 10000 // specifc limit
        }
    }
}

The compute requirements functions should use the passed limits object and have a default value for any limit. This can be fetched using the Node.getNodeLimit function.

For an example, take a look at how checking has been added for line-creating analyses: 700a116

@jgoizueta
Copy link
Contributor Author

Hey @dgaubert can you look at this?

cc/ @rafatower @ethervoid


AggregateIntersection.prototype.computeRequirements = function(databaseService, limits, callback) {
// we estimate the maximum possible number of rows of the result
var product = this.source.estimatedRequirements.numberOfRows *
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is not consider good practice, should I define an accessor and use this.source.getEstimatedRequirements().numberOfRows instead?
(your opinion will be welcome, @dgaubert )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have several places where we don't follow getter/setter pattern. IMHO in Javascript is not useful, you can always get a member property directly with .. My advice is to avoid the Cannot read property 'wadus' of null using default values and so on.

numberOfRows: maxRows
};
this.limits = {
maximumNumberOfRows: getNodeLimit(limits, this.getType(), 'maximumNumberOfRows', 1000000)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since limit values should be definable in Redis, maybe we should use snake_case instead of camelCase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean maximumNumberOfRows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, having maximum_number_of_rows instead.

@rafatower
Copy link
Contributor

@dgaubert did you have a chance to take a further look? I'm more or less familiar with the code so I can walk you through the changes if you want.

@dgaubert
Copy link
Contributor

dgaubert commented Nov 2, 2016

@rafatower I'm dealing with some test weird stuff in SQL-API. This afternoon let's take a look of it. Sorry!

@rafatower
Copy link
Contributor

sure, no problem... and thanks a lot :)

@jgoizueta
Copy link
Contributor Author

We have decided to release this ASAP, so I'll take care of the most relevant issues and will leave the rest for a second iteration.

@dgaubert
Copy link
Contributor

dgaubert commented Nov 4, 2016

Let me know when you feel ready to review again!

@jgoizueta
Copy link
Contributor Author

Hey @dgaubert, I'll be testing this in staging, but you can already review the changes whenever it suits you

So that errors are properly shown in the UI
Also change default number of max rows to unlimited for source nodes,
but keep a limit of 1M as a default for other kind of nodes.
General limits are removed; only analysis with specific limits will be limited.
Avoid the number of categories estimation and go for a conservative
estimation (potentially all input points could end up on the same line).
@jgoizueta
Copy link
Contributor Author

jgoizueta commented Nov 8, 2016

I think is ready to be deployed, @dgaubert can you take a last look at it?

The current kind of analysis are limited; the default limits can be overriden in redis:

  • line-sequential: the input table cannot have more than 1M 100000 rows (even if we aggregate by some column; initially I tried to adjust for that but have removed because I think it is safer that way
  • line-source-to-target: the product of the number of rows of the input tables cannot be greater than 1M.
  • georeference-country: the input cannot have more than 500 rows.
  • georeference-admin-region: the input cannot have more than 1000 rows.
  • georeference-postal-code: if creating polygons cannot have more than 10000 input rows (otherwise the limit is 1M)

I have place some limits in other analysis not initially considered:

  • buffer: if it will be dissolve, no more than 10000 input rows (otherwise 1M)
  • aggregate-intersection: product of input rows <= 1M

cc @javisantana @rafatower

@@ -80,6 +82,11 @@ AnalysisFactory.prototype.create = function(configuration, definition, callback)
return done(err, analysis);
});
},
function analysis$checkLimits(analysis, done) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check limits after registering analysis? We are going to have a wrong status in cdb_analysis_catalog if analysis reaches some limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I meant to do it before registration... I'm fixing it

@jgoizueta
Copy link
Contributor Author

@rochoa please take a look at this

@@ -468,3 +471,44 @@ function validate(validator, params, expectedParamName) {

return param;
}

Node.prototype.computeRequirements = function(databaseService, limits, callback) {
var default_limit = null; // no limit by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use something like Infinity here?

var messages = [];

// Note that any falsy value (null, 0, '', ...) means no limits:
if (this.requirements.getLimit('maximumNumberOfRows')) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hide these details in requirements,

@@ -1,3 +1,4 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈

@rochoa
Copy link
Contributor

rochoa commented Nov 10, 2016

Hey, don't consider my small comments as a blocker for this.

As we will work on #220, I'm pretty much OK with this first iteration. The limitations and caveats are well defined, so let's :shipit:.

BTW, good catch about A note about aliased nodes, I would like to address this problem at some point because it's super confusing when you need to use getNodes vs getSortedNodes.

Thanks a lot for working on this one!

@jgoizueta jgoizueta merged commit 74205da into master Nov 10, 2016
jgoizueta added a commit that referenced this pull request Nov 10, 2016
This release introduces analysis pre-checking, see #215
This is the first attempt at this, see #220 for intended changes.
@javisantana
Copy link
Contributor

Could we add a "limits" section in the camshaft reference so the UI can use them to show warnings? like

"aggregate-intersection": {
            "params": {
                "source": {
                    "type": "node",
                    "geometry": [
                        "*"
                    ]
                    limits: {
                      max_input_rows: 100000
                    }
                },
                ...
            }

cc @rochoa

@rochoa
Copy link
Contributor

rochoa commented Nov 10, 2016

The problem with integrating directly in the reference is that this is something dynamic, you might want to limit to 100k rows in some cases for an analysis but it's OK in others: depending on your hardware, for instance.

@rochoa rochoa deleted the 214-prechecking branch December 15, 2016 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants