Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk insert aggregated results #225

Merged
merged 8 commits into from
Jul 27, 2016
Merged

Conversation

rafamanzo
Copy link
Member

@rafamanzo rafamanzo commented Jul 22, 2016

This is mostly @danielkza a great work for which I've split some commits into atomic ones.

To improve even more the bulk insertions effect, walking by levels bottom up has been added alongside this same PR.

As this stores a lot of data in memory, I think adding memory statistics to the profiling is sane before accepting this.

Aggregation performance results:

Aggregation performance test results below:

Metric Count Branch Wall Time (s) Process Time (s)
1 v1.3.2 34.16732797622681 29.564079139199997
1 optimize_aggregation_reviewed 4.813459873199463 6.0212057172
2 v1.3.2 71.57923197746277 63.06931912939999
2 optimize_aggregation_reviewed 7.97909255027771 9.9699096154
4 v1.3.2 154.1164906024933 147.8050995992
4 optimize_aggregation_reviewed 14.562432861328125 17.795247482599997
8 v1.3.2 333.3348692417145 311.43665788299995
8 optimize_aggregation_reviewed 33.7912938117981 35.3004300856

Almost a 10x speedup! Nicely done!

This is part of #207.

After accepting this, please create an issue so we do not forget to rewrite MetricResult#pre_order unit tests to take full advantage of the context already existent objects.

@danielkza
Copy link
Contributor

I've ran some tests and memory usage is actually better, probably due to not fetching multiple copies of the same entities repeatedly. Gist

danielkza and others added 6 commits July 23, 2016 14:10
Add level order, which should be faster when fetching from the database,
by making one query per-level of the tree, instead of possibly one
per-node.

Signed-off-by: Rafael Reggiani Manzo <[email protected]>
It was too loose for `level_order` relying on `descendants_by_level` and
too tight for this last one mocking the protected method
`fetch_all_children` leaving it uncovered by tests.
This is not necessary for the given
ModuleResult#find_by_module_and_processing unit test.
Places all the tree walking related methods within the same describe
statement enabling setup sharing.

This is based on the work of @danielkza on commit
805c7c1.
Previously aggregation worked by traversing the modules tree in
pre-order. But to ensure that children are aggregated before their
parents, we can relax that order a bit to just processes all the results
on the same level before all of those on a level above it (the topmost
level consisting of the root).

This allows fetching much more results at once and significantly reduce
t he number of trips to the database - from a number proportional to the
number of nodes, to exactly and no more than the maximum depth of the
tree.

It also makes it much easier to accumulate the created tree metric
results to be created all at once. That also saves a huge number of
trips to the database.

Using the aggregation performance tests, in my development machine the
average time - combined with the indexing changes that were previously
made - went from around 200s to <20s.

Regarding tests: a complete refactor was necessary, and made possible by
the module results tree factory. The tests ended up much cleaner and
arguably better, as they can verify the actual values being aggregated
while mocking only the necessary data accesses.
It is no longer used since the new aggregation processing collects the
tree metric results itself, making the auxiliary logic to find out
whether a node already has a result for a metric unnecessary.
@rafamanzo rafamanzo force-pushed the optimize_aggregation_reviewed branch from 8a7847d to bbd513d Compare July 23, 2016 17:13
Using `import!` with no batch limit can theoretically offer the best
performance, but generates obscenely large queries that are very hard
to deal with in logs. A batch size of 100 wrapped in a transaction
seems to have virtually no performance penalty, but makes things much
easier to manage.
@danielkza danielkza force-pushed the optimize_aggregation_reviewed branch from cc5feeb to eb4176d Compare July 23, 2016 23:13
@@ -4,6 +4,9 @@ KalibroProcessor is the processing web service for Mezuro.

== Unreleased

* Insert in one query all aggregated MetricResults
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean to be too meticulous here, but now we are batching the queries, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diegoamc Right. I personally prefer changelog entries that describe changes as they would affect a user reading it, and not exactly what changed. In this case it doesn't really matter whether we batch, insert at once, or whatever else, only that the aggregation got faster.

@@ -30,8 +30,27 @@ def pre_order
@pre_order ||= pre_order_traverse(root).to_a
end

def level_order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method actually called somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, It isn't used. it was part of an earlier version of the PR. It
might be useful for other processing steps, but we can add it whenever we
actually need it.

Em qua, 27 de jul de 2016 15:28, Diego de Araújo Martinez Camarinha <
[email protected]> escreveu:

In app/models/module_result.rb
#225 (comment)
:

@@ -30,8 +30,27 @@ def pre_order
@pre_order ||= pre_order_traverse(root).to_a
end

  • def level_order

Is this method actually called somewhere?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/mezuro/kalibro_processor/pull/225/files/eb4176d9b85a5550c3830c82d5831086525cfc63#r72495614,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAtnToL6_XMIcirJL-cE7_b2sPZhqGpuks5qZ6M4gaJpZM4JTGJy
.

@diegoamc
Copy link
Contributor

Congratulations, this is a very fine piece of work! 👏

I think we should remove #level_order along with its tests before merging. What do you think?

@danielkza
Copy link
Contributor

Good to me!

Em qua, 27 de jul de 2016 15:39, Diego de Araújo Martinez Camarinha <
[email protected]> escreveu:

Congratulations, this is a very fine piece of work! 👏

I think we should remove #level_order along with its tests before
merging. What do you think?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#225 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAtnTlvbhJFriNk2crsfQpK2KR5COL-1ks5qZ6XXgaJpZM4JTGJy
.

@danielkza
Copy link
Contributor

Done.

@diegoamc
Copy link
Contributor

Sweet 🍬

@diegoamc diegoamc merged commit 1746695 into master Jul 27, 2016
@diegoamc diegoamc deleted the optimize_aggregation_reviewed branch July 27, 2016 22:10
@rafamanzo rafamanzo mentioned this pull request Jul 27, 2016
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants