You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This race condition occured while running the integration tests here.
At first glance it seems like the celery worker correctly threw a DB exception in the compute_filter_similarity task, leading to the callback throwing an exception... so far this is fine. I note that the celery worker then tries to write to the runs table again to mark the run as failed:
UPDATE runs SET
state ='error',
time_completed = now()
WHERE ...
Over in the flask app we queue a task to delete any similarity files then start a db transaction to delete the run. It is conceivable that the delete files task finishes before a file is written.
Lastly in the flask app, we don't catch the DB integrity error and end up giving the user a 500 response :-/
Perhaps this could be addressed more like project deletion - by setting the run as deleted in the database instead of deleting the row.
Note I didn't get to see the high memory worker logs - see #418
Logs
Integration test failure:
============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-3.5.1, py-1.8.0, pluggy-0.6.0
rootdir: /var/www, inifile:
plugins: celery-4.3.0
collected 371 items
entityservice/tests/test_admin.py .. [ 0%]
entityservice/tests/test_deletes.py ..............F
=================================== FAILURES ===================================
______ test_delete_runs_after_completed_run[result_type_number_parties2] _______
requests = <requests.sessions.Session object at 0x7ff4d5ebdfd0>
result_type_number_parties = ('permutations', 2)
def test_delete_runs_after_completed_run(requests, result_type_number_parties):
project, run_id = _create_data_linkage_run(requests, result_type_number_parties)
result_token = project['result_token']
wait_for_run_completion(requests, project, run_id, result_token, timeout=20)
> delete_run(requests, project['project_id'], run_id, result_token)
entityservice/tests/test_deletes.py:23:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
entityservice/tests/util.py:219: in delete_run
_check_delete_response(r)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
r = <Response [500]>
def _check_delete_response(r):
# Note we allow for a 403 because the project may already have been deleted
> assert r.status_code in {204, 403}, 'I received this instead: {}'.format(r.text)
E AssertionError: I received this instead: {
E "detail": "The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.",
E "status": 500,
E "title": "Internal Server Error",
E "type": "about:blank"
E }
entityservice/tests/util.py:201: AssertionError
--------------------- generated xml file: /mnt/results.xml ---------------------
===================== 1 failed, 16 passed in 46.97 seconds =====================
Flask:
2019-09-05T06:29:37.944268Z [debug ] Run status authorized using result_token token [entityservice.views.run.status] pid=796049f3ca46af151d53da399b119cd8370118a5c1139a9f rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
2019-09-05T06:29:37.946954Z [debug ] Connecting to redis [entityservice.cache]
2019-09-05T06:29:37.948791Z [debug ] Looking up read only redis slave using sentinel protocol [entityservice.cache]
2019-09-05T06:29:38.686373Z [debug ] request run status [entityservice.views.run.status] pid=796049f3ca46af151d53da399b119cd8370118a5c1139a9f rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
2019-09-05T06:29:38.708630Z [debug ] checking if provided authorization is a results_token [entityservice.views.auth_checks]
2019-09-05T06:29:38.727345Z [debug ] Run status authorized using result_token token [entityservice.views.run.status] pid=796049f3ca46af151d53da399b119cd8370118a5c1139a9f rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
2019-09-05T06:29:38.954044Z [debug ] request to delete run [entityservice.views.run.description] pid=796049f3ca46af151d53da399b119cd8370118a5c1139a9f rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
2019-09-05T06:29:38.956113Z [debug ] checking authorization of 'result_token' [entityservice.views.auth_checks]
2019-09-05T06:29:38.957657Z [debug ] approved request to delete run [entityservice.views.run.description] pid=796049f3ca46af151d53da399b119cd8370118a5c1139a9f rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
2019-09-05T06:29:38.957912Z [debug ] Retrieving run details from database [entityservice.views.run.description] pid=796049f3ca46af151d53da399b119cd8370118a5c1139a9f rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
2019-09-05T06:29:38.958936Z [debug ] Queuing task to remove similarities file from object store [entityservice.views.run.description] pid=796049f3ca46af151d53da399b119cd8370118a5c1139a9f rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
2019-09-05T06:29:39.040080Z [debug ] start db transaction to delete run [entityservice.database.deletion] rid=17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d
Exception on /projects/796049f3ca46af151d53da399b119cd8370118a5c1139a9f/runs/17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d [DELETE]
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/flask/app.py", line 2292, in wsgi_app
response = self.full_dispatch_request()
File "/usr/lib/python3.7/site-packages/flask/app.py", line 1815, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/lib/python3.7/site-packages/flask/app.py", line 1718, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/lib/python3.7/site-packages/flask/_compat.py", line 35, in reraise
raise value
File "/usr/lib/python3.7/site-packages/flask/app.py", line 1813, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/lib/python3.7/site-packages/flask/app.py", line 1799, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/usr/lib/python3.7/site-packages/connexion/decorators/decorator.py", line 73, in wrapper
response = function(request)
File "/usr/lib/python3.7/site-packages/connexion/decorators/validation.py", line 293, in wrapper
return function(request)
File "/usr/lib/python3.7/site-packages/connexion/decorators/response.py", line 104, in wrapper
response = function(request)
File "/usr/lib/python3.7/site-packages/connexion/decorators/decorator.py", line 44, in wrapper
response = function(request)
File "/usr/lib/python3.7/site-packages/connexion/decorators/parameter.py", line 219, in wrapper
return function(**kwargs)
File "/var/www/entityservice/views/run/description.py", line 38, in delete
delete_run_data(conn, run_id)
File "/var/www/entityservice/database/deletion.py", line 23, in delete_run_data
""", [run_id])
psycopg2.IntegrityError: update or delete on table "runs" violates foreign key constraint "permutation_masks_run_fkey" on table "permutation_masks"
DETAIL: Key (run_id)=(17a0b4d366e08d7eb199b87e51e40262caa388ba1342cc6d) is still referenced from table "permutation_masks".
Worker:
[2019-09-05 06:30:37,655: INFO/ForkPoolWorker-4] 2019-09-05T06:30:37.655109Z [info ] Counting contributing parties [entityservice.utils]
[2019-09-05 06:30:37,721: INFO/ForkPoolWorker-4] 2019-09-05T06:30:37.721443Z [info ] Parties where data is ready: 2 [entityservice.utils]
[2019-09-05 06:30:37,722: INFO/ForkPoolWorker-4] 2019-09-05T06:30:37.722682Z [info ] 2/2 parties have contributed clks [entityservice.utils]
[2019-09-05 06:30:37,729: INFO/ForkPoolWorker-4] Task entityservice.tasks.pre_run_check.check_for_executable_runs[84af2b70-1dc5-45aa-9acf-99557b71004d] succeeded in 0.07580971880815923s: None
[2019-09-05 06:30:37,937: INFO/MainProcess] Received task: entityservice.tasks.run.prerun_check[6889779e-8007-45b9-8fbf-68d1ca3571a4]
[2019-09-05 06:30:38,029: INFO/ForkPoolWorker-6] Task entityservice.tasks.run.prerun_check[6889779e-8007-45b9-8fbf-68d1ca3571a4] succeeded in 0.09161759214475751s: None
[2019-09-05 06:30:38,046: INFO/MainProcess] Received task: entityservice.tasks.comparing.compute_filter_similarity[5937a35e-1470-4b36-ad39-1c463c2240b1]
[2019-09-05 06:30:38,048: INFO/MainProcess] Received task: entityservice.tasks.base_task.celery_bug_fix[b580b54e-0d08-4b23-a646-70aae4900711]
[2019-09-05 06:30:38,051: INFO/ForkPoolWorker-2] Task entityservice.tasks.base_task.celery_bug_fix[b580b54e-0d08-4b23-a646-70aae4900711] succeeded in 0.0018211179412901402s: [0, None, None]
[2019-09-05 06:30:38,056: ERROR/ForkPoolWorker-8] Chord callback for 'a4e9c9a2-1d50-4a58-97b1-dfe0a001c179' raised: ChordError("Dependency 5937a35e-1470-4b36-ad39-1c463c2240b1 raised DBResourceMissing('unknown unknown project or run not found in database not found in database not found in database')")
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/celery/app/trace.py", line 385, in trace_task
R = retval = fun(*args, **kwargs)
File "/var/www/entityservice/tasks/base_task.py", line 87, in __call__
return super(TracedTask, self).__call__(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/celery/app/trace.py", line 648, in __protected_call__
return self.run(*args, **kwargs)
File "/var/www/entityservice/tasks/comparing.py", line 123, in compute_filter_similarity
raise DBResourceMissing("project or run not found in database")
entityservice.errors.DBResourceMissing: unknown project or run not found in database not found in database
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/celery/backends/redis.py", line 399, in on_chord_part_return
callback.delay([unpack(tup, decode) for tup in resl])
File "/usr/lib/python3.7/site-packages/celery/backends/redis.py", line 399, in <listcomp>
callback.delay([unpack(tup, decode) for tup in resl])
File "/usr/lib/python3.7/site-packages/celery/backends/redis.py", line 352, in _unpack_chord_result
raise ChordError('Dependency {0} raised {1!r}'.format(tid, retval))
celery.exceptions.ChordError: Dependency 5937a35e-1470-4b36-ad39-1c463c2240b1 raised DBResourceMissing('unknown unknown project or run not found in database not found in database not found in database')
[2019-09-05 06:30:38,060: INFO/MainProcess] Received task: entityservice.tasks.base_task.on_chord_error[d21e2ea8-e1c0-4148-b058-75275446eac8]
[2019-09-05 06:30:38,060: INFO/ForkPoolWorker-8] Task entityservice.tasks.comparing.compute_filter_similarity[5937a35e-1470-4b36-ad39-1c463c2240b1] raised expected: DBResourceMissing('unknown project or run not found in database not found in database')
[2019-09-05 06:30:38,063: WARNING/ForkPoolWorker-5] 2019-09-05T06:30:38.063739Z [warning ] Marked run as failure [entityservice.tasks]
[2019-09-05 06:30:38,064: INFO/ForkPoolWorker-5] Task entityservice.tasks.base_task.on_chord_error[d21e2ea8-e1c0-4148-b058-75275446eac8] succeeded in 0.0030970340594649315s: None
The text was updated successfully, but these errors were encountered:
Given the pr for supporting run deletion introduced a cascading delete on the permutation mask table, which is on the flask error, it could be the database for the integration tests of the develop branch wasn't cleanly initiated eg due to a persistent volume.
This race condition occured while running the integration tests here.
At first glance it seems like the celery worker correctly threw a DB exception in the
compute_filter_similarity
task, leading to the callback throwing an exception... so far this is fine. I note that the celery worker then tries to write to the runs table again to mark the run as failed:Over in the flask app we queue a task to delete any similarity files then start a db transaction to delete the run. It is conceivable that the delete files task finishes before a file is written.
Lastly in the flask app, we don't catch the DB integrity error and end up giving the user a 500 response :-/
Perhaps this could be addressed more like project deletion - by setting the run as
deleted
in the database instead of deleting the row.Note I didn't get to see the high memory worker logs - see #418
Logs
Integration test failure:
Flask:
Worker:
The text was updated successfully, but these errors were encountered: