Bulk insert aggregated results #225

rafamanzo · 2016-07-22T20:19:08Z

This is mostly @danielkza a great work for which I've split some commits into atomic ones.

To improve even more the bulk insertions effect, walking by levels bottom up has been added alongside this same PR.

As this stores a lot of data in memory, I think adding memory statistics to the profiling is sane before accepting this.

Aggregation performance results:

Aggregation performance test results below:

Metric Count	Branch	Wall Time (s)	Process Time (s)
1	v1.3.2	34.16732797622681	29.564079139199997
1	optimize_aggregation_reviewed	4.813459873199463	6.0212057172
2	v1.3.2	71.57923197746277	63.06931912939999
2	optimize_aggregation_reviewed	7.97909255027771	9.9699096154
4	v1.3.2	154.1164906024933	147.8050995992
4	optimize_aggregation_reviewed	14.562432861328125	17.795247482599997
8	v1.3.2	333.3348692417145	311.43665788299995
8	optimize_aggregation_reviewed	33.7912938117981	35.3004300856

Almost a 10x speedup! Nicely done!

This is part of #207.

After accepting this, please create an issue so we do not forget to rewrite MetricResult#pre_order unit tests to take full advantage of the context already existent objects.

danielkza · 2016-07-22T21:21:30Z

I've ran some tests and memory usage is actually better, probably due to not fetching multiple copies of the same entities repeatedly. Gist

Add level order, which should be faster when fetching from the database, by making one query per-level of the tree, instead of possibly one per-node. Signed-off-by: Rafael Reggiani Manzo <[email protected]>

It was too loose for `level_order` relying on `descendants_by_level` and too tight for this last one mocking the protected method `fetch_all_children` leaving it uncovered by tests.

This is not necessary for the given ModuleResult#find_by_module_and_processing unit test.

@danielkza

Places all the tree walking related methods within the same describe statement enabling setup sharing. This is based on the work of @danielkza on commit 805c7c1.

Previously aggregation worked by traversing the modules tree in pre-order. But to ensure that children are aggregated before their parents, we can relax that order a bit to just processes all the results on the same level before all of those on a level above it (the topmost level consisting of the root). This allows fetching much more results at once and significantly reduce t he number of trips to the database - from a number proportional to the number of nodes, to exactly and no more than the maximum depth of the tree. It also makes it much easier to accumulate the created tree metric results to be created all at once. That also saves a huge number of trips to the database. Using the aggregation performance tests, in my development machine the average time - combined with the indexing changes that were previously made - went from around 200s to <20s. Regarding tests: a complete refactor was necessary, and made possible by the module results tree factory. The tests ended up much cleaner and arguably better, as they can verify the actual values being aggregated while mocking only the necessary data accesses.

It is no longer used since the new aggregation processing collects the tree metric results itself, making the auxiliary logic to find out whether a node already has a result for a metric unnecessary.

Using `import!` with no batch limit can theoretically offer the best performance, but generates obscenely large queries that are very hard to deal with in logs. A batch size of 100 wrapped in a transaction seems to have virtually no performance penalty, but makes things much easier to manage.

diegoamc · 2016-07-26T01:28:45Z

CHANGELOG.rdoc

@@ -4,6 +4,9 @@ KalibroProcessor is the processing web service for Mezuro.

 == Unreleased

+* Insert in one query all aggregated MetricResults


I don't mean to be too meticulous here, but now we are batching the queries, right?

@diegoamc Right. I personally prefer changelog entries that describe changes as they would affect a user reading it, and not exactly what changed. In this case it doesn't really matter whether we batch, insert at once, or whatever else, only that the aggregation got faster.

diegoamc · 2016-07-27T18:28:07Z

app/models/module_result.rb

@@ -30,8 +30,27 @@ def pre_order
    @pre_order ||= pre_order_traverse(root).to_a
  end

+  def level_order


Is this method actually called somewhere?

Nice catch, It isn't used. it was part of an earlier version of the PR. It
might be useful for other processing steps, but we can add it whenever we
actually need it.

Em qua, 27 de jul de 2016 15:28, Diego de Araújo Martinez Camarinha <
[email protected]> escreveu:

In app/models/module_result.rb
#225 (comment)
:

@@ -30,8 +30,27 @@ def pre_order
@pre_order ||= pre_order_traverse(root).to_a
end

def level_order

Is this method actually called somewhere?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/mezuro/kalibro_processor/pull/225/files/eb4176d9b85a5550c3830c82d5831086525cfc63#r72495614,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAtnToL6_XMIcirJL-cE7_b2sPZhqGpuks5qZ6M4gaJpZM4JTGJy
.

diegoamc · 2016-07-27T18:39:19Z

Congratulations, this is a very fine piece of work! 👏

I think we should remove #level_order along with its tests before merging. What do you think?

danielkza · 2016-07-27T18:53:47Z

Good to me!

Em qua, 27 de jul de 2016 15:39, Diego de Araújo Martinez Camarinha <
[email protected]> escreveu:

Congratulations, this is a very fine piece of work! 👏

I think we should remove #level_order along with its tests before
merging. What do you think?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#225 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAtnTlvbhJFriNk2crsfQpK2KR5COL-1ks5qZ6XXgaJpZM4JTGJy
.

danielkza · 2016-07-27T21:39:25Z

Done.

diegoamc · 2016-07-27T22:10:10Z

Sweet 🍬

rafamanzo mentioned this pull request Jul 22, 2016

Rewrite aggregation processing to be more efficient #223

Closed

rafamanzo added the in progress label Jul 22, 2016

This was referenced Jul 23, 2016

Optimize TreeMetricResult#descendant_values #224

Merged

Refactor ModuleResult#pre_order #222

Open

danielkza and others added 6 commits July 23, 2016 14:10

Update ModuleResult tree-walking methods

1c7932b

Add level order, which should be faster when fetching from the database, by making one query per-level of the tree, instead of possibly one per-node. Signed-off-by: Rafael Reggiani Manzo <[email protected]>

Fix tree level walking methods unit tests mocking

75e8247

It was too loose for `level_order` relying on `descendants_by_level` and too tight for this last one mocking the protected method `fetch_all_children` leaving it uncovered by tests.

Remove unecessary usage of let!

232dc5e

This is not necessary for the given ModuleResult#find_by_module_and_processing unit test.

Refactor ModuleResult tree walking unit tests

50e83b8

Places all the tree walking related methods within the same describe statement enabling setup sharing. This is based on the work of @danielkza on commit 805c7c1.

Remove unused MetricResultAggregator class

bbd513d

It is no longer used since the new aggregation processing collects the tree metric results itself, making the auxiliary logic to find out whether a node already has a result for a metric unnecessary.

rafamanzo force-pushed the optimize_aggregation_reviewed branch from 8a7847d to bbd513d Compare July 23, 2016 17:13

danielkza force-pushed the optimize_aggregation_reviewed branch from cc5feeb to eb4176d Compare July 23, 2016 23:13

diegoamc reviewed Jul 26, 2016
View reviewed changes

danielkza mentioned this pull request Jul 26, 2016

Add metric extraction utility methods to Context #227

Open

rafamanzo mentioned this pull request Jul 27, 2016

Remove unnecessary timestamp columns from models #219

Open

diegoamc reviewed Jul 27, 2016
View reviewed changes

Remove unused MetricResult#level_order method

d52ee8d

diegoamc merged commit 1746695 into master Jul 27, 2016

diegoamc deleted the optimize_aggregation_reviewed branch July 27, 2016 22:10

diegoamc removed the in progress label Jul 27, 2016

rafamanzo mentioned this pull request Jul 27, 2016

Slow processings #207

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk insert aggregated results #225

Bulk insert aggregated results #225

rafamanzo commented Jul 22, 2016 •

edited

Loading

danielkza commented Jul 22, 2016

diegoamc Jul 26, 2016

danielkza Jul 26, 2016

diegoamc Jul 27, 2016

danielkza Jul 27, 2016

diegoamc commented Jul 27, 2016

danielkza commented Jul 27, 2016

danielkza commented Jul 27, 2016

diegoamc commented Jul 27, 2016

		@@ -4,6 +4,9 @@ KalibroProcessor is the processing web service for Mezuro.

		== Unreleased

		* Insert in one query all aggregated MetricResults

Bulk insert aggregated results #225

Bulk insert aggregated results #225

Conversation

rafamanzo commented Jul 22, 2016 • edited Loading

danielkza commented Jul 22, 2016

diegoamc Jul 26, 2016

Choose a reason for hiding this comment

danielkza Jul 26, 2016

Choose a reason for hiding this comment

diegoamc Jul 27, 2016

Choose a reason for hiding this comment

danielkza Jul 27, 2016

Choose a reason for hiding this comment

diegoamc commented Jul 27, 2016

danielkza commented Jul 27, 2016

danielkza commented Jul 27, 2016

diegoamc commented Jul 27, 2016

rafamanzo commented Jul 22, 2016 •

edited

Loading