First computation on a new cluster is extremely slow #126

isaacabraham · 2015-11-10T23:16:52Z

I've observed this behaviour and it's reproduceable. I don't know (yet) if it only occurs with CloudFlow, but here's the repro.

Execute this code

[ "http://publicdata.landregistry.gov.uk/market-trend-data/price-paid-data/b/pp-2013-part1.csv" ]
|> CloudFlow.OfHttpFileByLine
|> CloudFlow.length

Observe timing
Execute it again
Observe timing

The first call is significantly slower than the second. I can repeat this consistently, but only on a brand new cluster. After that, resetting FSI makes no difference.

So if you do it three times, you'll see timings like this: -

       806b043b-90a0-4750-87d5-56a4845e6852  Completed  00:00:33.8204630    0 /   0 /   3 /   3  int64                     10/11/2015 22:57:48  10/11/2015 22:58:22 
       8c041273-edf1-4f49-af4f-5b0073b72064  Completed  00:00:04.5669221    0 /   0 /   3 /   3  int64                     10/11/2015 22:59:05  10/11/2015 22:59:10 
       e6d89094-6a2c-45d7-b707-c055a5923ea7  Completed  00:00:04.8601043    0 /   0 /   3 /   3  int64                     10/11/2015 22:59:25  10/11/2015 22:59:29

On one occasion I even saw a different number of work items for the "first" computation than the second and third calls. This is on MBrace.Azure (16.1) using the Starter Kit.

Probably we need some further investigation - it could be something like: -

Azure "warming up" the IO of the cluster the first time?
Something with the first computation (or CloudFlow) on MBrace?
Something to do with network caching for subsequent calls (I tend to discount this though).
Something else?

The text was updated successfully, but these errors were encountered:

palladin · 2015-11-11T11:31:29Z

Very interesting question for investigation. I have seen some very weird ~20 sec slowdown only for the first service bus enqueue call.

dsyme · 2015-11-11T13:37:57Z

This is probably just some Azure weirdness right? e.g. taking time to put VMs in the right places or something?

eiriktsarpalis · 2015-11-11T13:40:18Z

I think so too. I think we need to test whether this slowdown also occurs when the first computation is a simple hello world.

eiriktsarpalis · 2015-11-12T11:45:29Z

Ok, I think I know what's going on here. On a fresh Azure cluster, the following computation

cluster.CreateProcess(Cloud.ParallelEverywhere(cloud { return 42}))

Takes 30 seconds on the first run and 2 seconds on the second. Also, running Isaac's example after that results in normal execution times. I think that this is attributable to AppDomain isolation in worker instances. Worker AppDomains are lazily instantiated and these can take quite a bit of time to start up. You can actually observe this by looking at individual worker logs:

cluster.Workers.[0].ShowSystemLogs()

which upon inspection reveal:

[2015-11-12 13:29:54] INFO : Dequeued work item '9ce2f7b1-9b99-4405-b164-7c6dd944e869'.
[2015-11-12 13:29:54] INFO : Concurrent work item count increased to 1/32.
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.Azure.Common.NetFramework, Version=2.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.WindowsAzure.Management.Compute, Version=12.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.WindowsAzure.Management, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.Azure.Common, Version=2.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Hyak.Common, Version=1.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Streams, Version=0.4.1.0, Culture=neutral, PublicKeyToken=null'
[2015-11-12 13:29:54] INFO : Downloading 'MBrace.Azure.Management, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null'
[2015-11-12 13:29:54] INFO : Downloading 'FSharp.Compiler.Interactive.Settings, Version=4.4.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'
[2015-11-12 13:29:54] INFO : Downloading 'FSI-ASSEMBLY_72050b76-f0d5-4379-9cb2-80368711bae3_1, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.WindowsAzure.Management.Storage, Version=5.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'MBrace.Flow, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.WindowsAzure.Management.ServiceBus, Version=0.9.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
[2015-11-12 13:29:54] INFO : Downloading 'Microsoft.Threading.Tasks, Version=1.0.12.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'
[2015-11-12 13:29:57] INFO : TopicMonitor : starting topic maintenance.
[2015-11-12 13:29:57] INFO : Initializing Application Domain "a24819cc-4311-412e-be04-f48249618a63"
[2015-11-12 13:29:58] INFO : TopicMonitor : maintenance complete.
[2015-11-12 13:30:04] INFO : Initializing AppDomain ClusterManager
[2015-11-12 13:30:08] INFO : TopicMonitor : starting topic maintenance.
[2015-11-12 13:30:08] INFO : TopicMonitor : maintenance complete.
[2015-11-12 13:30:11] INFO : Starting cloud process '3a9d0577-24f6-4bad-8381-42d47948a658' of type 'int'.
[2015-11-12 13:30:13] INFO : Completed work item '9ce2f7b1-9b99-4405-b164-7c6dd944e869' after 00:00:01.1328427

This indicates that the first work item dequeue can take up to 20 seconds before it actually completes, even if it is a trivial cloud { return 42 }. You can also see that the actual execution took more than 1 second to complete. If I'm not mistaken, worker roles use the legacy 64-bit JIT compiler, which is known to be slow.

isaacabraham · 2015-11-13T22:48:23Z

Any way we can force each worker to eagerly start up and instantiate the appdomain?

eiriktsarpalis · 2015-11-30T22:45:40Z

I'm not sure this would help much. Thespian, which uses the same underlying implementation, does not demonstrate the same degree of delay when sending a first computation. So it seems likely that this is caused by the time it takes to download the initial dependency set or perhaps some other Azure issue?

isaacabraham · 2015-11-30T23:23:44Z

Either that, or the configuration of the CLR on Azure machines e.g. 64-bit JIT - has some issue which should be handled differently. I don't know. It would be nice if we could know for sure what the issue is - even if it's something that can't be fixed at this time.

isaacabraham · 2016-02-12T13:06:38Z

How can I progress this - if it means going to the Azure team somewhere, fine - what's the next step in closing this issue out?

eiriktsarpalis · 2016-02-12T14:25:11Z

I'm not entirely convinced this is necessarily an Azure issue. As indicated in my earlier measurements the delay seems to be happening during the initial assembly/data download (TopicMonitor log outputs come from an independent entity and not related to that particular dequeue).

palladin · 2016-02-12T14:40:33Z

I agree with Eirik's remarks. Appdomain initialization is the big delay factor.

[2015-11-12 13:29:57] INFO : Initializing Application Domain "a24819cc-4311-412e-be04-f48249618a63"
[2015-11-12 13:30:04] INFO : Initializing AppDomain ClusterManager
[2015-11-12 13:30:11] INFO : Starting cloud process '3a9d0577-24f6-4bad-8381-42d47948a658' of type 'int'.
[2015-11-12 13:30:13] INFO : Completed work item '9ce2f7b1-9b99-4405-b164-7c6dd944e869' after 00:00:01.1328427

isaacabraham · 2016-02-12T14:44:39Z

OK. So my original question was - can we not somehow eagerly initialize the app domain e.g. send a dummy computation immediately to all nodes as part of the initialization process? Also - why don't we see the same behaviour with Thespian?

palladin · 2016-02-12T14:48:10Z

I think it is adhoc to eagerly initialiaze appdomains with dummy computations. Btw Hadoop/Spark exhibit also the same warm-up behaviour.

isaacabraham · 2016-02-17T22:21:16Z

@palladin why ad-hoc. And - maybe spark / hadoop exhibits this, but so what :-)

isaacabraham added the question label Nov 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First computation on a new cluster is extremely slow #126

First computation on a new cluster is extremely slow #126

isaacabraham commented Nov 10, 2015

palladin commented Nov 11, 2015

dsyme commented Nov 11, 2015

eiriktsarpalis commented Nov 11, 2015

eiriktsarpalis commented Nov 12, 2015

isaacabraham commented Nov 13, 2015

eiriktsarpalis commented Nov 30, 2015

isaacabraham commented Nov 30, 2015

isaacabraham commented Feb 12, 2016

eiriktsarpalis commented Feb 12, 2016

palladin commented Feb 12, 2016

isaacabraham commented Feb 12, 2016

palladin commented Feb 12, 2016

isaacabraham commented Feb 17, 2016

First computation on a new cluster is extremely slow #126

First computation on a new cluster is extremely slow #126

Comments

isaacabraham commented Nov 10, 2015

palladin commented Nov 11, 2015

dsyme commented Nov 11, 2015

eiriktsarpalis commented Nov 11, 2015

eiriktsarpalis commented Nov 12, 2015

isaacabraham commented Nov 13, 2015

eiriktsarpalis commented Nov 30, 2015

isaacabraham commented Nov 30, 2015

isaacabraham commented Feb 12, 2016

eiriktsarpalis commented Feb 12, 2016

palladin commented Feb 12, 2016

isaacabraham commented Feb 12, 2016

palladin commented Feb 12, 2016

isaacabraham commented Feb 17, 2016