-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First computation on a new cluster is extremely slow #126
Comments
Very interesting question for investigation. I have seen some very weird ~20 sec slowdown only for the first service bus enqueue call. |
This is probably just some Azure weirdness right? e.g. taking time to put VMs in the right places or something? |
I think so too. I think we need to test whether this slowdown also occurs when the first computation is a simple hello world. |
Ok, I think I know what's going on here. On a fresh Azure cluster, the following computation cluster.CreateProcess(Cloud.ParallelEverywhere(cloud { return 42})) Takes 30 seconds on the first run and 2 seconds on the second. Also, running Isaac's example after that results in normal execution times. I think that this is attributable to AppDomain isolation in worker instances. Worker AppDomains are lazily instantiated and these can take quite a bit of time to start up. You can actually observe this by looking at individual worker logs: cluster.Workers.[0].ShowSystemLogs() which upon inspection reveal:
This indicates that the first work item dequeue can take up to 20 seconds before it actually completes, even if it is a trivial |
Any way we can force each worker to eagerly start up and instantiate the appdomain? |
I'm not sure this would help much. Thespian, which uses the same underlying implementation, does not demonstrate the same degree of delay when sending a first computation. So it seems likely that this is caused by the time it takes to download the initial dependency set or perhaps some other Azure issue? |
Either that, or the configuration of the CLR on Azure machines e.g. 64-bit JIT - has some issue which should be handled differently. I don't know. It would be nice if we could know for sure what the issue is - even if it's something that can't be fixed at this time. |
How can I progress this - if it means going to the Azure team somewhere, fine - what's the next step in closing this issue out? |
I'm not entirely convinced this is necessarily an Azure issue. As indicated in my earlier measurements the delay seems to be happening during the initial assembly/data download (TopicMonitor log outputs come from an independent entity and not related to that particular dequeue). |
I agree with Eirik's remarks. Appdomain initialization is the big delay factor.
|
OK. So my original question was - can we not somehow eagerly initialize the app domain e.g. send a dummy computation immediately to all nodes as part of the initialization process? Also - why don't we see the same behaviour with Thespian? |
I think it is adhoc to eagerly initialiaze appdomains with dummy computations. Btw Hadoop/Spark exhibit also the same warm-up behaviour. |
@palladin why ad-hoc. And - maybe spark / hadoop exhibits this, but so what :-) |
I've observed this behaviour and it's reproduceable. I don't know (yet) if it only occurs with CloudFlow, but here's the repro.
The first call is significantly slower than the second. I can repeat this consistently, but only on a brand new cluster. After that, resetting FSI makes no difference.
So if you do it three times, you'll see timings like this: -
On one occasion I even saw a different number of work items for the "first" computation than the second and third calls. This is on MBrace.Azure (16.1) using the Starter Kit.
Probably we need some further investigation - it could be something like: -
The text was updated successfully, but these errors were encountered: