Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First computation on a new cluster is extremely slow #126

Open
isaacabraham opened this issue Nov 10, 2015 · 13 comments
Open

First computation on a new cluster is extremely slow #126

isaacabraham opened this issue Nov 10, 2015 · 13 comments
Labels

Comments

@isaacabraham
Copy link
Contributor

I've observed this behaviour and it's reproduceable. I don't know (yet) if it only occurs with CloudFlow, but here's the repro.

  1. Execute this code
[ "http://publicdata.landregistry.gov.uk/market-trend-data/price-paid-data/b/pp-2013-part1.csv" ]
|> CloudFlow.OfHttpFileByLine
|> CloudFlow.length
  1. Observe timing
  2. Execute it again
  3. Observe timing

The first call is significantly slower than the second. I can repeat this consistently, but only on a brand new cluster. After that, resetting FSI makes no difference.

So if you do it three times, you'll see timings like this: -

       806b043b-90a0-4750-87d5-56a4845e6852  Completed  00:00:33.8204630    0 /   0 /   3 /   3  int64                     10/11/2015 22:57:48  10/11/2015 22:58:22 
       8c041273-edf1-4f49-af4f-5b0073b72064  Completed  00:00:04.5669221    0 /   0 /   3 /   3  int64                     10/11/2015 22:59:05  10/11/2015 22:59:10 
       e6d89094-6a2c-45d7-b707-c055a5923ea7  Completed  00:00:04.8601043    0 /   0 /   3 /   3  int64                     10/11/2015 22:59:25  10/11/2015 22:59:29 

On one occasion I even saw a different number of work items for the "first" computation than the second and third calls. This is on MBrace.Azure (16.1) using the Starter Kit.

Probably we need some further investigation - it could be something like: -

  1. Azure "warming up" the IO of the cluster the first time?
  2. Something with the first computation (or CloudFlow) on MBrace?
  3. Something to do with network caching for subsequent calls (I tend to discount this though).
  4. Something else?
@palladin
Copy link
Member

Very interesting question for investigation. I have seen some very weird ~20 sec slowdown only for the first service bus enqueue call.

@dsyme
Copy link
Contributor

dsyme commented Nov 11, 2015

This is probably just some Azure weirdness right? e.g. taking time to put VMs in the right places or something?

@eiriktsarpalis
Copy link
Member

I think so too. I think we need to test whether this slowdown also occurs when the first computation is a simple hello world.

@eiriktsarpalis
Copy link
Member

Ok, I think I know what's going on here. On a fresh Azure cluster, the following computation

cluster.CreateProcess(Cloud.ParallelEverywhere(cloud { return 42}))

Takes 30 seconds on the first run and 2 seconds on the second. Also, running Isaac's example after that results in normal execution times. I think that this is attributable to AppDomain isolation in worker instances. Worker AppDomains are lazily instantiated and these can take quite a bit of time to start up. You can actually observe this by looking at individual worker logs:

cluster.Workers.[0].ShowSystemLogs()

which upon inspection reveal:

[2015-11-12 13:29:54] INFO : Dequeued work item '9ce2f7b1-9b99-4405-b164-7c6dd944e869'.
[2015-11-12 13:29:54] INFO : Concurrent work item count increased to 1/32.
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.Azure.Common.NetFramework, Version=2.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.WindowsAzure.Management.Compute, Version=12.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.WindowsAzure.Management, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.Azure.Common, Version=2.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Hyak.Common, Version=1.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Streams, Version=0.4.1.0, Culture=neutral, PublicKeyToken=null'
[2015-11-12 13:29:54] INFO : Downloading 'MBrace.Azure.Management, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null'
[2015-11-12 13:29:54] INFO : Downloading 'FSharp.Compiler.Interactive.Settings, Version=4.4.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'
[2015-11-12 13:29:54] INFO : Downloading 'FSI-ASSEMBLY_72050b76-f0d5-4379-9cb2-80368711bae3_1, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.WindowsAzure.Management.Storage, Version=5.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'MBrace.Flow, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.WindowsAzure.Management.ServiceBus, Version=0.9.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.Threading.Tasks, Version=1.0.12.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'
[2015-11-12 13:29:57] INFO : TopicMonitor : starting topic maintenance.
[2015-11-12 13:29:57] INFO : Initializing Application Domain "a24819cc-4311-412e-be04-f48249618a63"
[2015-11-12 13:29:58] INFO : TopicMonitor : maintenance complete.
[2015-11-12 13:30:04] INFO : Initializing AppDomain ClusterManager
[2015-11-12 13:30:08] INFO : TopicMonitor : starting topic maintenance.
[2015-11-12 13:30:08] INFO : TopicMonitor : maintenance complete.
[2015-11-12 13:30:11] INFO : Starting cloud process '3a9d0577-24f6-4bad-8381-42d47948a658' of type 'int'.
[2015-11-12 13:30:13] INFO : Completed work item '9ce2f7b1-9b99-4405-b164-7c6dd944e869' after 00:00:01.1328427

This indicates that the first work item dequeue can take up to 20 seconds before it actually completes, even if it is a trivial cloud { return 42 }. You can also see that the actual execution took more than 1 second to complete. If I'm not mistaken, worker roles use the legacy 64-bit JIT compiler, which is known to be slow.

@isaacabraham
Copy link
Contributor Author

Any way we can force each worker to eagerly start up and instantiate the appdomain?

@eiriktsarpalis
Copy link
Member

I'm not sure this would help much. Thespian, which uses the same underlying implementation, does not demonstrate the same degree of delay when sending a first computation. So it seems likely that this is caused by the time it takes to download the initial dependency set or perhaps some other Azure issue?

@isaacabraham
Copy link
Contributor Author

Either that, or the configuration of the CLR on Azure machines e.g. 64-bit JIT - has some issue which should be handled differently. I don't know. It would be nice if we could know for sure what the issue is - even if it's something that can't be fixed at this time.

@isaacabraham
Copy link
Contributor Author

How can I progress this - if it means going to the Azure team somewhere, fine - what's the next step in closing this issue out?

@eiriktsarpalis
Copy link
Member

I'm not entirely convinced this is necessarily an Azure issue. As indicated in my earlier measurements the delay seems to be happening during the initial assembly/data download (TopicMonitor log outputs come from an independent entity and not related to that particular dequeue).

@palladin
Copy link
Member

I agree with Eirik's remarks. Appdomain initialization is the big delay factor.

[2015-11-12 13:29:57] INFO : Initializing Application Domain "a24819cc-4311-412e-be04-f48249618a63"
[2015-11-12 13:30:04] INFO : Initializing AppDomain ClusterManager
[2015-11-12 13:30:11] INFO : Starting cloud process '3a9d0577-24f6-4bad-8381-42d47948a658' of type 'int'.
[2015-11-12 13:30:13] INFO : Completed work item '9ce2f7b1-9b99-4405-b164-7c6dd944e869' after 00:00:01.1328427

@isaacabraham
Copy link
Contributor Author

OK. So my original question was - can we not somehow eagerly initialize the app domain e.g. send a dummy computation immediately to all nodes as part of the initialization process? Also - why don't we see the same behaviour with Thespian?

@palladin
Copy link
Member

I think it is adhoc to eagerly initialiaze appdomains with dummy computations. Btw Hadoop/Spark exhibit also the same warm-up behaviour.

@isaacabraham
Copy link
Contributor Author

@palladin why ad-hoc. And - maybe spark / hadoop exhibits this, but so what :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants