Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about columns names assignment #213

Open
jgoizueta opened this issue Oct 21, 2016 · 3 comments
Open

Question about columns names assignment #213

jgoizueta opened this issue Oct 21, 2016 · 3 comments
Labels

Comments

@jgoizueta
Copy link
Contributor

I have a doubt; I'm thinking about ways to have nodes generate its requirements using its input nodes' requirements, and I was looking at how the column names for each node are computed.

I see here that columns are obtained for each node, and that the queries to obtain the columns will be executed asynchronously.

Since the columns of a node may depend on the columns of its input nodes (imagine the node generates a query that involves its input column names) I would have thought that computing the column need to be performed sequentially.

I'm probably missing something (or many things!) but can't this have a problem if the columns for a node are requested before the nodes it depend on have had their columns assigned?

@jgoizueta
Copy link
Contributor Author

I'm not sure if my concerns were valid, but I guess e4be2ab would solve it anyway.

/cc @dgaubert

@dgaubert
Copy link
Contributor

Actually nodes that belong to the same graph are performed sequentially (see create function) using recursion (asynchronous recursion ¿? ).

e4be2ab solves another problem (regenerating a cache node) whether that node is used in two graphs (map with two layers with analysis performing the same node).

@dgaubert
Copy link
Contributor

dgaubert commented Oct 27, 2016

@Hey @jgoizueta! I was totally wrong on my previous explanation about e4be2ab:

The problem was whether you have an analysis graph that needs two node inputs and those child nodes have the same input node(s) that needs an intermediate table (cached). Child nodes were performed in parallel and sometimes it raised a race condition when the intermediate table was recreating. For instance:

                                 (child-1: source) <- [ C1: kmeans ] <- [ C0: source ] 
[ B2: line-source-to-target ] <-
                                 (child-2: target) <- [ A2: weighted-centroid ] <- [ C1: kmeans ] <- [ C0: source ]

Both child-1 and child-2 were performed in parallel using async.map() and C1 is a cached node, when C0 changes the intermediate table needs to be recreated and some scenarios when child-1 was getting input columns child-2 was recreating cached table raising a DB error. So, I've used async.mapSeries() to perform node children in series, so C1 in child-2 is already recalculated avoiding race conditions.

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants