Cold start latency is measured much greater than that reported in ASPLOS paper #222

sosson97 · 2021-04-26T12:54:38Z

sosson97
Apr 26, 2021

Hi. I'm using vHIve to study cold start latency in snapshot-enabled serverless framework. To verify that vHive works correctly in my setup, I tried to measure cold start latency when snapshot is created . However, I found out that the cold start latency is measured much greater than that reported in ASPLOS paper Fig. 2 consistently(Paper: 232 ms, My Case: 1500-2000 ms). I wonder if I'm the only one having this issue.

For cold start latency measurement, I went through the following steps. I'm testing on a bare-metal server with Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz and 128GB memory, and the root filesystem is mounted on a high speed SSD whose peak read TP is up to 2GB/s. First I deployed helloworld.yaml using the example deployer to vHive with snapshot enabled. Then I invoked helloworld for snapshot creation and waited for helloworld's active pod to be destroyed. After cleaning up page cache, I invoked helloworld again, and I considered latency the invoker reports in that invocation as the cold start latency.

From logs, I discovered function invocation takes unexpectedly long time in vHive CRI and other SW stack.

For debugging, I first added prints right before and after SayHello invocation in the example invoker(example/invoker/client.go).

INFO[0007] 2021-04-26 12:38:27.791060372 +0000 UTC before invoke
INFO[0007] 2021-04-26 12:38:29.635372881 +0000 UTC after invoke

And the log of vHive during that period is

DEBU[2021-04-26T12:38:28.231421145Z] RunPodsandbox for &PodSandboxMetadata{Name:helloworld-0-00001-deployment-5fcdbb69ff-tmlt9,Uid:ea4f6786-cc42-4d05-8432-305df156852b,Namespace:default,Attempt:0,}
DEBU[2021-04-26T12:38:28.782916818Z] CreateContainer within sandbox "4c0cdc0d884f948a325dd2da7523a3dc0baec257a261fe41244c6770f09f5798" for container &ContainerMetadata{Name:user-container,Attempt:0,}
DEBU[2021-04-26T12:38:28.782989949Z] found idle instance to load                   image="vhiveease/helloworld:var_workload" vmID=3
DEBU[2021-04-26T12:38:28.783018564Z] Orchestrator received LoadSnapshot            vmID=3
DEBU[2021-04-26T12:38:28.807051631Z] Orchestrator received ResumeVM                vmID=3
DEBU[2021-04-26T12:38:28.808295599Z] successfully loaded idle instance             image="vhiveease/helloworld:var_workload" vmID=3
DEBU[2021-04-26T12:38:28.808736529Z] StartContainer for "843d49a24068f3a1066cb94f10035571dfeb01f38fc27f7f1abe756a9c9a2102"
DEBU[2021-04-26T12:38:28.905305774Z] StartContainer end %v2021-04-26 12:38:28.905283891 +0000 UTC
DEBU[2021-04-26T12:38:28.909825265Z] CreateContainer within sandbox "4c0cdc0d884f948a325dd2da7523a3dc0baec257a261fe41244c6770f09f5798" for container &ContainerMetadata{Name:queue-proxy,Attempt:0,}
DEBU[2021-04-26T12:38:28.921300866Z] StartContainer for "a8f17929ffd7ea662af178870384930a8ec9bc1b560686fb5c4bf13a2c0bcf9e"
DEBU[2021-04-26T12:38:29.012070036Z] ImageFsInfo

This shows that creating and starting fresh container takes more than 1s since the user's invoke request, which is unexpectedly long.

I would be really grateful if you share your cold start latency measurement for comparison!

Answered by ustiugov

Apr 26, 2021

hi @sosson97, thank you for bringing this up. The logs indicate that there are delays in the control plane that require further investigation. Can you please attach the complete logs of containerd and firecracker-containerd (as files)?

Note that a VM loads from its snapshot in just 24ms:

DEBU[2021-04-26T12:38:28.783018564Z] Orchestrator received LoadSnapshot            vmID=3
DEBU[2021-04-26T12:38:28.807051631Z] Orchestrator received ResumeVM                vmID=3

For the paper evaluation, we didn't run Kubernetes/Knative. To reproduce the paper results, please follow the instructions in the artifact. Meanwhile, we will investigate this control plane delays, please provide us the logs.

View full answer

ustiugov · 2021-04-26T13:13:52Z

ustiugov
Apr 26, 2021
Maintainer

hi @sosson97, thank you for bringing this up. The logs indicate that there are delays in the control plane that require further investigation. Can you please attach the complete logs of containerd and firecracker-containerd (as files)?

Note that a VM loads from its snapshot in just 24ms:

DEBU[2021-04-26T12:38:28.783018564Z] Orchestrator received LoadSnapshot            vmID=3
DEBU[2021-04-26T12:38:28.807051631Z] Orchestrator received ResumeVM                vmID=3

For the paper evaluation, we didn't run Kubernetes/Knative. To reproduce the paper results, please follow the instructions in the artifact. Meanwhile, we will investigate this control plane delays, please provide us the logs.

6 replies

ustiugov Apr 26, 2021
Maintainer

you can capture the logs when you run containerd and other daemons. Example when running on a single node:

./scripts/cloudlab/setup_node.sh
sudo containerd &> /tmp/ctrd_log
sudo PATH=$PATH /usr/local/bin/firecracker-containerd --config /etc/firecracker-containerd/config.toml &> /tmp/fc_ctrd_log
source /etc/profile && go build && sudo ./vhive &> /tmp/vhive_log
./scripts/cluster/create_one_node_cluster.sh

ustiugov Apr 30, 2021
Maintainer

@sosson97 any problem with capturing the logs? I would like to include this bug investigation in our roadmap and it would be great if you can provide the first-hand logs.

sosson97 May 2, 2021
Author

@ustiugov sorry for the delay. I was working on another project for a while. Today I tried to reproduce the log for cold start. However, I found the invoker to be blocked on calling grpc.Dial() for each function invocation so I'm currently not able to test vHive. I didn't change any configuration, and every setup processes seem ok to me. Also, listing knative service still gave me the right URLs which match that in urls.txt. I don't know why it's broken very suddenly :(

sosson97 May 2, 2021
Author

btw, I could reproduce the paper's result using the artifact. Thanks. I think I can use this until the issue is resolved.

ustiugov May 2, 2021
Maintainer

@sosson97 thanks for reporting. We have tests running every night, there should not be any issue. Have you tried to reproduce the issue with the invoker? if there is a problem with that, we will give it a priority.

Glad to hear that the artifact is enough for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cold start latency is measured much greater than that reported in ASPLOS paper #222

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Cold start latency is measured much greater than that reported in ASPLOS paper #222

sosson97 Apr 26, 2021

Replies: 1 comment · 6 replies

ustiugov Apr 26, 2021 Maintainer

ustiugov Apr 26, 2021 Maintainer

ustiugov Apr 30, 2021 Maintainer

sosson97 May 2, 2021 Author

sosson97 May 2, 2021 Author

ustiugov May 2, 2021 Maintainer

sosson97
Apr 26, 2021

Replies: 1 comment 6 replies

ustiugov
Apr 26, 2021
Maintainer

ustiugov Apr 26, 2021
Maintainer

ustiugov Apr 30, 2021
Maintainer

sosson97 May 2, 2021
Author

sosson97 May 2, 2021
Author

ustiugov May 2, 2021
Maintainer