-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove_after
no longer respects task dependencies
#129
Comments
The old behaviour already did not work when there were too many jobs in the queue unfortunately. But i don't quite understand your use case here. We only delete |
I use Minion for flows, and use the task tree to render an overview of how far the process is, like this: flowchart LR
task_a["Task A"] --> task_b["Task B"]
task_b --> task_c["Task C"]
task_c --> task_d["Task D"]
If any task in the middle finishes before In essens I only want |
Job |
Not that it matters much, the old query falls apart with a few million jobs in the queue. We couldn't bring it back even if we wanted. |
at my company we are running Minion, with a throughput of roughly ~500k tasks a day. And ~1M tasks at most times in table. for some reason we never did a PR on our fixed back with 10.22. we also noticed problems with large amount of jobs in the queue. Our solution still works. WITH interesting_jobs AS (
SELECT
parents,
state
FROM
minion_jobs
WHERE
cardinality(parents) > 0
AND state != 'finished' )
DELETE FROM minion_jobs AS j
WHERE (finished <= NOW() - INTERVAL '1 second' * 172800
AND state = 'finished'
AND NOT EXISTS ( SELECT 1 FROM interesting_jobs
WHERE parents @> ARRAY[j.id] AND state != 'finished'))
OR (expires <= NOW() AND state = 'inactive'); its fairly fast as well explain analyze WITH interesting_jobs AS (
SELECT
parents,
state
FROM
minion_jobs
WHERE
cardinality(parents) > 0
AND state != 'finished' )
SELECT FROM minion_jobs AS j
WHERE (finished <= NOW() - INTERVAL '1 second' * 172800
AND state = 'finished'
AND NOT EXISTS ( SELECT 1 FROM interesting_jobs
WHERE parents @> ARRAY[j.id] AND state != 'finished'))
OR (expires <= NOW() AND state = 'inactive');
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on minion_jobs j (cost=23066.17..346802608.51 rows=67548 width=0) (actual time=463.859..28426.737 rows=11952 loops=1)
Recheck Cond: (((finished <= (now() - '48:00:00'::interval)) AND (state = 'finished'::minion_state)) OR (state = 'inactive'::minion_state))
Rows Removed by Index Recheck: 46175
Filter: (((finished <= (now() - '48:00:00'::interval)) AND (state = 'finished'::minion_state) AND (NOT (SubPlan 1))) OR ((expires <= now()) AND (state = 'inactive'::minion_state)))
Rows Removed by Filter: 26650
Heap Blocks: exact=33423
-> BitmapOr (cost=23066.17..23066.17 rows=135096 width=0) (actual time=90.466..90.467 rows=0 loops=1)
-> BitmapAnd (cost=23044.85..23044.85 rows=135096 width=0) (actual time=89.573..89.573 rows=0 loops=1)
-> Bitmap Index Scan on minion_jobs_finished (cost=0.00..2833.77 rows=153245 width=0) (actual time=30.673..30.673 rows=152942 loops=1)
Index Cond: (finished <= (now() - '48:00:00'::interval))
-> Bitmap Index Scan on minion_jobs_state_priority_id_idx (cost=0.00..20177.06 rows=879551 width=0) (actual time=56.709..56.709 rows=883147 loops=1)
Index Cond: (state = 'finished'::minion_state)
-> Bitmap Index Scan on minion_jobs_state_priority_id_idx (cost=0.00..4.43 rows=1 width=0) (actual time=0.893..0.893 rows=3 loops=1)
Index Cond: (state = 'inactive'::minion_state)
SubPlan 1
-> Bitmap Heap Scan on minion_jobs (cost=289.45..6250.75 rows=23 width=0) (actual time=0.731..0.731 rows=1 loops=38599)
Recheck Cond: ((parents @> ARRAY[j.id]) AND (cardinality(parents) > 0))
Filter: ((state <> 'finished'::minion_state) AND (state <> 'finished'::minion_state))
Rows Removed by Filter: 0
Heap Blocks: exact=26678
-> BitmapAnd (cost=289.45..289.45 rows=1663 width=0) (actual time=0.730..0.730 rows=0 loops=38599)
-> Bitmap Index Scan on minion_jobs_parents_idx (cost=0.00..41.90 rows=4989 width=0) (actual time=0.001..0.001 rows=1 loops=38599)
Index Cond: (parents @> ARRAY[j.id])
-> Bitmap Index Scan on minion_jobs_cardinality_parents (cost=0.00..247.30 rows=332571 width=0) (actual time=1.046..1.046 rows=28013 loops=26647)
Planning Time: 0.401 ms
Execution Time: 28427.121 ms
(26 rows) we are currently moving to pg16, and found something interesting. The above query, finishes just a tiny bit faster. BUT the most surprising is that the original code in Minion 10.22, actually runs fast. running the query on the same data as above: explain analyze SELECT FROM minion_jobs AS j
WHERE (finished <= NOW() - INTERVAL '1 second' * 172800 AND NOT EXISTS (
SELECT 1 FROM minion_jobs WHERE parents @> ARRAY[j.id] AND state != 'finished'
) AND state = 'finished') OR (expires <= NOW() AND state = 'inactive');
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on minion_jobs j (cost=23079.63..44015254.24 rows=67855 width=0) (actual time=106.526..240.670 rows=12766 loops=1)
Recheck Cond: (((finished <= (now() - '48:00:00'::interval)) AND (state = 'finished'::minion_state)) OR (state = 'inactive'::minion_state))
Rows Removed by Index Recheck: 46175
Filter: (((finished <= (now() - '48:00:00'::interval)) AND (NOT (SubPlan 1)) AND (state = 'finished'::minion_state)) OR ((expires <= now()) AND (state = 'inactive'::minion_state)))
Rows Removed by Filter: 26650
Heap Blocks: exact=33512
-> BitmapOr (cost=23079.63..23079.63 rows=135711 width=0) (actual time=100.451..100.452 rows=0 loops=1)
-> BitmapAnd (cost=23058.24..23058.24 rows=135711 width=0) (actual time=99.631..99.632 rows=0 loops=1)
-> Bitmap Index Scan on minion_jobs_finished (cost=0.00..2847.00 rows=153943 width=0) (actual time=32.276..32.276 rows=153756 loops=1)
Index Cond: (finished <= (now() - '48:00:00'::interval))
-> Bitmap Index Scan on minion_jobs_state_priority_id_idx (cost=0.00..20177.06 rows=879551 width=0) (actual time=64.983..64.983 rows=883147 loops=1)
Index Cond: (state = 'finished'::minion_state)
-> Bitmap Index Scan on minion_jobs_state_priority_id_idx (cost=0.00..4.43 rows=1 width=0) (actual time=0.819..0.819 rows=3 loops=1)
Index Cond: (state = 'inactive'::minion_state)
SubPlan 1
-> Bitmap Heap Scan on minion_jobs (cost=42.04..16158.52 rows=591 width=0) (actual time=0.001..0.001 rows=1 loops=39416)
Recheck Cond: (parents @> ARRAY[j.id])
Filter: (state <> 'finished'::minion_state)
Rows Removed by Filter: 0
Heap Blocks: exact=26678
-> Bitmap Index Scan on minion_jobs_parents_idx (cost=0.00..41.90 rows=4989 width=0) (actual time=0.001..0.001 rows=1 loops=39416)
Index Cond: (parents @> ARRAY[j.id])
Planning Time: 0.322 ms
Execution Time: 241.096 ms
(24 rows) Ok, while collecting this information, and writting. another solution presented itself explain analyze SELECT FROM minion_jobs
WHERE id IN (
(
SELECT id FROM minion_jobs WHERE state = 'finished' AND finished <= NOW() - INTERVAL '1 second' * 172800
EXCEPT
SELECT unnest(parents) AS id FROM minion_jobs WHERE state != 'finished'
)
UNION
SELECT id FROM minion_jobs WHERE state = 'inactive' AND expires <= NOW());
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=256494.17..285054.46 rows=138181 width=0) (actual time=427.742..482.616 rows=15610 loops=1)
Hash Cond: (minion_jobs.id = "*SELECT* 1".id)
-> Index Only Scan using minion_jobs_pkey on minion_jobs (cost=0.42..25941.71 rows=997714 width=8) (actual time=0.043..35.708 rows=999174 loops=1)
Heap Fetches: 11
-> Hash (cost=254766.48..254766.48 rows=138181 width=8) (actual time=417.175..417.176 rows=15610 loops=1)
Buckets: 262144 Batches: 1 Memory Usage: 2658kB
-> HashAggregate (cost=252305.13..254766.48 rows=138181 width=8) (actual time=415.115..415.903 rows=15610 loops=1)
Group Key: "*SELECT* 1".id
Planned Partitions: 4 Batches: 1 Memory Usage: 3857kB
-> Append (cost=0.00..244877.90 rows=138181 width=8) (actual time=411.209..413.750 rows=15610 loops=1)
-> Result (cost=0.00..244178.55 rows=138180 width=8) (actual time=411.208..413.247 rows=15610 loops=1)
-> HashSetOp Except (cost=0.00..242796.75 rows=138180 width=12) (actual time=411.207..412.678 rows=15610 loops=1)
-> Append (cost=0.00..239497.23 rows=1319810 width=12) (actual time=0.210..398.082 rows=69663 loops=1)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..111576.08 rows=138180 width=12) (actual time=0.210..224.810 rows=42257 loops=1)
-> Seq Scan on minion_jobs minion_jobs_1 (cost=0.00..110194.28 rows=138180 width=8) (actual time=0.209..222.820 rows=42257 loops=1)
Filter: ((state = 'finished'::minion_state) AND (finished <= (now() - '48:00:00'::interval)))
Rows Removed by Filter: 956917
-> Subquery Scan on "*SELECT* 2" (cost=0.00..121322.10 rows=1181630 width=12) (actual time=0.046..170.768 rows=27406 loops=1)
-> ProjectSet (cost=0.00..109505.80 rows=1181630 width=8) (actual time=0.046..169.658 rows=27406 loops=1)
-> Seq Scan on minion_jobs minion_jobs_2 (cost=0.00..102711.43 rows=118163 width=13) (actual time=0.040..157.409 rows=116027 loops=1)
Filter: (state <> 'finished'::minion_state)
Rows Removed by Filter: 883147
-> Index Scan using minion_jobs_state_priority_id_idx on minion_jobs minion_jobs_3 (cost=0.42..8.45 rows=1 width=8) (actual time=0.012..0.012 rows=0 loops=1)
Index Cond: (state = 'inactive'::minion_state)
Filter: (expires <= now())
Rows Removed by Filter: 3
Planning Time: 0.520 ms
Execution Time: 484.704 ms
(28 rows) this runs fast on both pg13 and pg16 (above is from pg16) if interested, I will make a proper PR for the latest solution we have |
I'm not opposed to a PR if there really are no performance issues with the solution. But i'm still quite sceptical about the use cases. Do we really need this? If we treat this as a new feature, can we find two unique use cases for it? |
For the first use case i will count debugging the whole chain of jobs that led to a failure somewhere in the middle. Do we have a second? |
I once build a provisioning flow for a telco where we had automated the whole process of moving a cellphone number between providers and becoming a new customer utomated using minion. The whole process could take anywhere from 1 day to several months to complete depending on notice periods. Having the whole flow stored allowed us to render it in the "service agent ui", making it quite easy for everyone to see when the different task were performed and how many tasks there are left before the flow is done. So second use-case would be possibility of "flow overview" (for non-technical people). This together with the debugging case are my two favourite things about Minion. |
At At one client of ours there could also be pretty catastrophic if parts of a flow for long-running flows were deleted, as the later parts of a flow would compile an end-dataset based on data stored on the individual jobs in a flow. So missing parts of a chain would represent incomplete data. Naturally these datasets could be stored elsewhere, but that has not been required since minion has made this storage of job results so easy and convenient. |
One thing to consider: the query in 10.22, the different query in 10.23, and both versions of the query that @HEM42 posted are equivalent to each other if we ignore the performance issues. The code introduced in 10.27 thus breaks POLA, hard. |
I suppose that's good enough for two use cases. Now we just need a solution that 100% won't cause performance issues. |
I was just coming to submit a bug report on this. The "old" routine with the union all causes our entire queue to get in a broken state and locks the entire queue. |
Just as an aside to the OP's request, I had temporarily fixed this by simply creating a loop to delete with a limit. This allowed full deletion in "chunks". The following was a quick hack to get my queue back in working order, but could probably be refined some:
|
As an aside, and a quick question, why does the remove_after not delete failed jobs? |
Speaking from our own situasjon at |
That's certainly understandable, though I would think critical failures would be handled within a reasonable amount of time and the data at some point would just be cruft hanging around in the table forever. Maybe this discussion, belongs in another place, but I think it should be addressable in the framework somehow. |
Discussion is always good, and yes perhaps this could be a discussion on IRC, but I don't really mind having it here as it'll then be discoverable for other at a later time. As to your point about "reasonable amount of time" I absolutely agree. The problem is that what a reasonable time is does vary quite a bit. For some of our workloads a failed minion-job should prompt immediate action, as in "drop whatever you are doing, all hands on deck"-type immediate. But I do know of at least one entity that have minion-jobs that take multiples of hours to complete, sometimes may fail and (due to some business domain-specific weirdness) cannot be retried more than once a week. The latter is a fringe case yes, but there are users who use minion for that kind of thing. For us at Your point about "cruft hanging around in the table forever" is something that I also absolutely agree would be bad. Though I would say that if you allowed that to happen then you have bad data hygene. Just accumulating a lot of stuff (failed or successful) is generally something I consider bad practice unless you have a specific use for that data later. I guess what I am saying is that failed jobs is something that in my mind should be adressed specifically. Exactly how this is done could of course vary from company to company. Some might do like us and have technical personell investigate and either remove, retry immediately or schedule for a delayed retry. Others migth have workloads that are expected to fail, often do, and thus have scheduled tasks that analyze the current failed tasks and purges those that are consider "failed but OK". All that being said, I am very curious to hear how other people / orgs use minion and how they deal with these kinds of issues. |
Are there even multiple use cases for failed jobs that should just vanish that couldn't be handled by an |
As I'm on-boarding this in my scenario I'm using it for temporary visibility mainly. I'm using Minion to handle task automation in multiple environments running against about 12,000 servers globally. The busiest environment is ~8000-9000 servers and is taking ~150,000 jobs per hour. The finalized job states and results get stored in a long term database and associated with different applications. Any job failures are handled in an eval/try-catch block and incidents are generated in another system. The sheer number of potential job failures due to one reason or another is too much to monitor individually or at the time of failure. For my use case I set up a separate clean-up routine that runs periodically in my scheduler that validates states between the Minion queue and my job schedule tables in another database. At the same time I run the following:
This is the same query as in the repair routine except taking care of failed jobs, it suits my needs. I feel like this should be the default with an option to retain failed records with something like:
If an exception hasn't been handled or extracted into a different workflow in the 2 day window is it really a critical problem? |
A breaking change to clip failed tasks is completely unacceptable. Maybe an opt-in feature, but if you really don't care about the tasks failing, then why don't you just catch the errors and make the jobs succeed? |
Weekends happen. This is not about critical errors, but errors that require user intervention. |
Understood. I didn't consider it being a breaking change, didn't think about it. In my use case, I report the errors outside the queue for our Operations Center and product "owners" to review and I review/follow the queue. Anything I would miss in the queue failures that is a critical application problem is reported to the application. Most things I report as failures in the queue are (mostly) issues I would need to handle as the Minion owner. I have considered just marking everything as finished, but I've opted not to because I like to do some reporting off those statistics and it gives me a quick visual of what's happening. If I see a sudden uptick of failures on the queue UI, it's probably indicative of a network issue or some other global issues. For instance, this week I had a slew of DNS issues where some SSH tasks were not able to make connections. I can see the errors spike from the queue and if it's a long term thing they are also (in most cases) reported off somewhere else as well. I was bringing this up because I have been working on some capacity management in at least one of my queues where at differing levels of items in the minion_jobs table, the performance of the queue begins to degrade. The issue is closed, but I would like to suggest reviewing my last post (Minion becomes slow, admin unusable with processed jobs between 2-3 million records that I made this morning concerning the stats query from the backend which seems to be directly related to how many items are in the table. I've added functionality to take care of this for me. I don't advocate for a breaking change. Perhaps and option for $minion->repair({ discard => 'all' }} or something similar would be a better solution if one were to be adopted. |
After the performance update in v10.27, Minion no longer respects its dependencies as stated in the documentation.
This means that a tree of dependant tasks might be severed if tasks in the "middle" finishes before the top and bottom of the tree.
I guess this is on purpose, but it totally breaks the excellent "flow" use case that Minion used to support.
Would it be possible to add a flag that toggles between the new and the old behaviour?
Expected behavior
This is what was done previously in
Pg.pm
:Actual behavior
This is the current code in
Pg.pm
, that deletes all finished jobsThe text was updated successfully, but these errors were encountered: