-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul database schema #215
base: master
Are you sure you want to change the base?
Conversation
980194e
to
90a710d
Compare
Have you tested Could you pull 4d0982e out into a new PR? It'd be nice because I'm thinking about creating a new branch ( |
Looking at changes I have to ask you the same for those three changes:
|
Does it fixes #212? |
I don't think We can use a tag to mark the scenarios that require processing and set their cleaner strategy, but they are actually a majority, so I just switched the default. I haven't tested
Sure.
AFAIK none of the changes should be breaking. I made sure that any invalid data is cleaned up in the migration, and that data couldn't be accessed anyway, either because it was improperly referenced, or because it contained duplicates where the code assumes only one record exists (and acts accordingly). I would prefer to fork the current master into a v1 branch instead of making the dev branch the fork - otherwise we'll make outdated code the most prominent version. |
It does AFAIK, I actually had to fix a bunch of integrity violations, some that were dependent on the order used by |
I'm being conservative about calling those changes breaking ones, because of the side effect on current installations not the API. There is the small risk of data loss or updates that require human action. I agree about leaving the up to date code on master and fork it to v1. About #212 when testing the separate PRs I'll try to test it too to figure out at which point it got fixed. Great work and suggestions. Thanks. |
About splitting this up more: the cleaning, foreign key and indexes changes are somewhat tied. The cleaning migration only exists to allow the foreign keys to be created. Some of the indexes are in fact unique, and others exist to provide the opposite side to some foreign key. It will also be quite complicated to ensure the tests are reliable in each separate step: fixing them all was quite a bit of work in itself, and I never tested it with any schema other than the final version. Also, considering we will need to do tests with the production DB backups, doing it piecemeal seems like more work than it's really worth. |
OK, makes sense mostly. At least the timestamp removal can we extract? As one final try, is it possible to extract the new indexes regarding performance issues? I'd like to test them separately. |
Also, giving a second look to the changes it came to me if you'd also be gentle and make the tree walking methods modifications into a separate PR as well? Sorry about the annoyance. |
Finally this branch conflicts with master now. |
Should be fine.
I think the easiest way to do that is probably to just edit the migrations and comment out the foreign keys. The
These actually were not meant to be there, I think I accidentally added them by being careless with git-stash. Very observant, thanks! 👀
I'll rebase and make the other changes in one go. |
4f1cc05
to
c0f4936
Compare
Rebased to master, updated with split of timestamp changes (#219) and removal of the accidental module result change. I tested the |
I have cherry-picked b95d15f into master. The transaction time is: I'm still unsure if that is significant given how often we run acceptance tests and that KalibroProcessor does not seem to be growing much in the short/midterm. How hard would be to make this truncation strategy into an tag used only where needed? That would ensure minimum impact on test run time and that the application can grow, if necessary, without this being an issue. And probably make that another PR :/ What do you think? |
Using a tag to select the strategy seems like a good idea, but I think splitting up the strategy change may not be necessary - it doesn't matter until the integrity constraints are added. |
I implemented the selective usage of the truncation strategy. Any further comments @mezuro/core? |
02a9fc3
to
402e0cb
Compare
def self.up | ||
# Unset project reference for repositories with non-existing projects | ||
execute <<-SQL | ||
UPDATE repositories AS r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this generate inconsistencies in ownerships of Prezento? Should we do something similar there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how it could matter considering this only unsets the references to records that do not exist. If Prezento had those same references they would already be broken if anyone tried to actually use them for anything.
IMO we should have proper database constraints everywhere, but adding them after a long time is quite a bit harder. I don't know whether it's worth the extra work. I won't have the free time to do the same for the other services.
@@ -0,0 +1,73 @@ | |||
class CleanInconsistencies < ActiveRecord::Migration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this script be a migration? We are not changing the database structure, just removing records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are semantically changing the database structure, from allowing inconsistencies to not allowing. We can run this manually, but it will mean anyone using the software will have to do the same, otherwise the next migrations won't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this script is preventing us from creating inconsistencies. It is merely cleaning things up so other scripts (migrations) change the structure (by adding indexes and foreign keys, for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the next scripts add the constraints to prevent the inconsistencies, but they would fail if the inconsistencies already exist. If you can suggest a better method to handle it I'll be happy to use it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a rake task can achieve similar goals, with warning messages for those running the migrations. Skipping them if they should fail would be good too.
That way, we don't force anyone to erase data without warning them first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any data that the migration erases was already unusable and/or unreachable anyway. It existed purely due to bugs and poor validations, and can't even be retrieved by the APIs (such as multiple metric results with the same metric in the same module result).
That's true for the KalibroClient's point of view. If we were mining our database for statistics or some kind of study, suddenly losing data would make us desperate! xD
Notice that on those cases it may make sense to not erase a processing because it is not associated with a repository, for example.
I think that because we are changing data, not structure, this should not be a migration. Adding the extra step is surely more bureaucratic, but I think it's a good pratice to at least warn anyone who uses this software before erasing anything. Maybe a rake task is not the best option, but I don't think a migration is neither (even though rails guides say migrations can be made to add or modify data).
Anyway, if I'm the only one not happy with this, we can move on. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true for the KalibroClient's point of view. If we were mining our database for statistics or some kind of study, suddenly losing data would make us desperate! xD
It's true from the point of view of the model that the database is an expression of. If we had made a more faithful conversion, we would have foreign keys from the start, and the data being deleted now would never exist. They only reason it does exist is because we failed to keep garbage out. The fact that it got in doesn't make it not garbage.
What meaningful statistics can you gather from data that violates the semantic constraints of your model, and that should never exist? In the particular cases of things this migration deletes, you can't even reference the context that would allow you to make any sense of them.
What worth are process times if you can't even know what was processed? What worth is a kalibro module with no associated module result? What worth is a metric result that would never be looked at, and that is possibly complete garbage because it's associated with the wrong module result because we generated the wrong name?
I think that because we are changing data, not structure, this should not be a migration.
I don't see why that would be the case. Migrations are a mechanism to allow evolving a database progressively without throwing it away. Nothing makes them specific to structure. Also, changing structure many times entails changing data to be able to do it properly - this is just another case of that that is ugly because the situation is ugly.
Maybe a rake task is not the best option, but I don't think a migration is neither
The migration is the least bad option, as it is at least run in a controlled environment, and is a part of the known upgrade steps that any admin must run. How would someone upgrading find out that they need to run the rake task and how? Could we make it work in an automated processes without requiring manual intervention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you processed a repository and then deleted it, all the relevant data would be there. Suppose we had 500 repositories processed and then all of them were deleted. If you looked at their process times and saw that say, the aggregation time was surprinsingly high, would you not consider the data because you can't see the repository associated?
I agree that changing structure many times entails changing data. But I think it's just wrong to delete data without even notifying the users. I think it's not a good policy because the data is not yours, even though you may think it's garbage. In my opinion, we have to make users aware of it even if it means the process is not entirely automatic. I would be happier if we could at least make the migration need some user input like This migration is going to delete this kind of data, ok?
.
I've come to the conclusion that the migration is not a bad option. However, deleting data without informing the users, in my opinion, definitely is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diegoamc I understand your concerns, but I am confident that everything that this migration deletes is data that could not be accessed at all, or that doesn't even make sense. In your particular example, it's hard to judge whether a processing took long when the repository used is unknown. The fact that a processing does not make sense without it's repository is encoded in the model: these changes just enforce what was already supposed to happen at the database level.
If there is any data being deleted in a way that is not specified in the model, we should absolutely fix the migration or the model (whichever is wrong).
But I think it's just wrong to delete data without even notifying the usersI think it's not a good policy because the data is not yours
We have to keep in mind that we had the responsibility to create valid data and failed it. We have the knowledge to separate what is garbage (according to the model) and what is not, such that the remaining data can be trusted from now into the future. I'm not proposing to delete old data or anything like that, but data that should never have existed at all. No judgements of value can be made about its contents because they can be absolute nonsense.
I would be happier if we could at least make the migration need some user input like This migration is going to delete this kind of data, ok?.
I don't know of a good way to do it, since we have no guarantee that the migration is interactive - it can be running in an automated way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood your point too, but I can't accept the PR. I wish I had the time right now to sit for a while, study and propose another solution.
If nobody else is unhappy about it, please accept it :)
I don't understand why you closed this. There is no need for a consensus to accept a PR. As I said, if I'm the only one unhappy, please accept it! |
I guess sometimes we need a little pushback to spur creativity. I changed the migration to take backups of all the deleted data. Tell me what you think @diegoamc. |
The database schema was lacking many integrity checks and indexes. Correct it by first applying a migration that removes all old/stale data, then creating those indexes. The driving reason for this is the very slow performance of processing (specially aggregation) on the new mezuro.org servers. It will hopefully remove (or at least heavily improve) the superlinear slowdown when the number of metrics rises, as observed in #207.
Hardcoded IDs only worked because we had no referential intergrity constraints. Remove them and clean up the whole model creation code.
The transaction strategy was only working because we didn't have referential integrity checks. The job workers are not guaranteed to stay in the same the same transaction as the tests - the PHPMD tests, for example, spawned a new thread - and therefore, we must use truncation instead so all the models become visible to the workers in time.
PostgreSQL does not create them automatically, but we want them when deletions are cascade to other tables - otherwise we'll have to scan the whole table at that time. Some of the are also used for searching, such as the processing_id in process_times, or the project_id in repositories.
A Processing is created unconditionally, and therefore, always needs to be mocked, even when an error will immediately be thrown.
Forgot to rebase to master, done now. |
I've tried to add just the indexes for the I think we should hold this until we get the proper discussion related to data consistency restrictions (#228) and performance gain of indexes (#229) |
Overhaul database schema by adding indexes, foreign key constraints, cleaning up stale data, and fixing tests that relied on hardcoded/non-existing IDs.
The changes are a bit long but relatively straight-forward. I ended up changing the runner steps layout a lot because it was very hard to understand what was going on and how models were related, which proved necessary to remove the hardcoding issues.
The constraints also showed that running tests using the transaction DatabaseCleaner strategy was prone to failure - the processing jobs don't run in the same transaction as Cucumber and won't see much of the data created. Switching to the truncation strategy fixes it.
I tested the data cleanup migration quite extensively, using the database dump from production at mezuro.org. It would be wise to perform a backup anyway and be ready to restore it if anything goes wrong.