Function app with 'Warmup' function #10169

pragnagopa · 2024-05-16T16:06:40Z

If functions runtime cannot fully specialize it can get stuck with the placeholder 'WarmUp' function

Sample pod - in the below case the specialization succeeded once and loaded the actual customer functions. Functions runtime restarted and then re-specialization failed (due to lack of container start context mount non-availability in this case) and the pod was left with 'Warmup' Function

Note1 - The exact thing to address is - we should never return WarmUp Function when customer lists functions (i.e portal shouldn't display 'Warmup' function)

Note2 - In the below pod the failure was during re-specialization but this can happen during initial specialization as well

Note3 - The exact reason here is a different bug (lack of context mount path) but even if we fix that, there might be other reasons for failures. so this workitem is to explore options to see if we can

a) dont switch to specialized mode if specialization is 'fully complete'

b) if we cannot do a, can we revert back to placeholder mode to prevent the instance from causing confusion

Execute: [Web] [Desktop] [Web (Lens)] [Desktop (SAW)] https://wawseus.kusto.windows.net/wawsprod
FunctionsLogs

| where PreciseTimeStamp >= datetime(2024-05-14T15:57:00.0000000Z) - 3h

| where PreciseTimeStamp <= datetime(2024-05-15T01:04:00.0000000Z) + 3h

| where SourceNamespace == "FunctionsLegion"

| extend PodName = RoleInstance

| where PodName == "3ea83371-4d4d-4d4e-bf18-ccf6aefc9798"

| project PreciseTimeStamp, Summary

| order by PreciseTimeStamp asc

| where Summary contains "Found the following functions:" or Summary contains "will attempt"

PreciseTimeStamp | Summary -- | -- 2024-05-14 22:47:28.0664600 | Found the following functions: Host.Functions.WarmUp 2024-05-14 22:47:31.0712436 | Found the following functions: Host.Functions.WarmUp 2024-05-15 00:56:52.1614246 | Found the following functions: Host.Functions.HelloWorld Host.Functions.HtmlParser Host.Functions.LoaderIO 2024-05-15 01:03:05.1607510 | WebHost has shut down. Will attempt restart. 2024-05-15 01:03:30.1589962 | Found the following functions: Host.Functions.WarmUp

If functions runtime cannot fully specialize it can get stuck with the placeholder 'WarmUp' function

Sample pod - in the below case the specialization succeeded once and loaded the actual customer functions. Functions runtime restarted and then re-specialization failed (due to lack of container start context mount non-availability in this case) and the pod was left with 'Warmup' Function

Note1 - The exact thing to address is - we should never return WarmUp Function when customer lists functions (i.e portal shouldn't display 'Warmup' function)
Note2 - In the below pod the failure was during re-specialization but this can happen during initial specialization as well
Note3 - The exact reason here is a different bug (lack of context mount path) but even if we fix that, there might be other reasons for failures. so this workitem is to explore options to see if we can
a) dont switch to specialized mode if specialization is 'fully complete'
b) if we cannot do a, can we revert back to placeholder mode to prevent the instance from causing confusion

Execute: [Web] [Desktop] [Web (Lens)] [Desktop (SAW)] https://wawseus.kusto.windows.net/wawsprod
FunctionsLogs

| where PreciseTimeStamp >= datetime(2024-05-14T15:57:00.0000000Z) - 3h

| where PreciseTimeStamp <= datetime(2024-05-15T01:04:00.0000000Z) + 3h

| where SourceNamespace == "FunctionsLegion"

| extend PodName = RoleInstance

| where PodName == "3ea83371-4d4d-4d4e-bf18-ccf6aefc9798"

| project PreciseTimeStamp, Summary

| order by PreciseTimeStamp asc

| where Summary contains "Found the following functions:" or Summary contains "will attempt"

PreciseTimeStamp

Summary

2024-05-14 22:47:28.0664600

Found the following functions: Host.Functions.WarmUp

2024-05-14 22:47:31.0712436

Found the following functions: Host.Functions.WarmUp

2024-05-15 00:56:52.1614246

Found the following functions: Host.Functions.HelloWorld Host.Functions.HtmlParser Host.Functions.LoaderIO

2024-05-15 01:03:05.1607510

WebHost has shut down. Will attempt restart.

2024-05-15 01:03:30.1589962

Found the following functions: Host.Functions.WarmUp

pragnagopa · 2024-05-16T16:06:54Z

cc @balag0

@fabiocav please triage

pragnagopa · 2024-06-25T08:47:00Z

@fabiocav please triage

fabiocav · 2024-06-27T21:42:46Z

Can you please clarify what you mean by re-specialization?

fabiocav · 2024-08-02T01:17:45Z

@pragnagopa isn't this an issue caused by the platform failing to persist the site context file on disk, combined with a change to the host lifecycle management, where it is restarted in the same container?

For the following:

Note2 - In the below pod the failure was during re-specialization but this can happen during initial specialization as well

For the default (initial) specialization failures, are you including host startup and/or indexing failures? Those would lead to health metrics emitted by the host.

Note3 - The exact reason here is a different bug (lack of context mount path) but even if we fix that, there might be other reasons for failures. so this workitem is to explore options to see if we can
a) dont switch to specialized mode if specialization is 'fully complete'
b) if we cannot do a, can we revert back to placeholder mode to prevent the instance from causing confusion

For a, if specialization fails, the host will be left in a bad state and we need to ensure the platform is notified. The end result if the platform doesn't act would be similar (invocations not properly handled), which is undesirable behavior.
For the restart flow, once the issue with the platform is addressed, what behavior would you expect from the host and how should it notify the platform that specialization failed? We need to ensure that, in both cases, the instance is removed when that occurs. The host would be able to emit health metrics or force a restart if those are honored.

For b, can you also clarify your expectation when you state it should revert to placeholder mode? If the host sets its state to match the placeholder state, unless the platform is aware and no longer routing traffic, any request coming in for the app would result in a 404 for HTTP cases, as no application functions would be indexed. Host APIs would also return information that would reflect a placeholder state, so any request (including portal) coming to that instance would result in responses that do not reflect the application payload.

pragnagopa · 2024-08-02T16:27:27Z

@fabiocav thanks for the follow up.

isn't this an issue caused by the platform failing to persist the site context file on disk, combined with a change to the host lifecycle management, where it is restarted in the same container?

Yes, this aligns with Note3

For a

The host would be able to emit health metrics or force a restart if those are honored.

This is a good direction. I would like to discuss more on this with @balag0 in the context for "AdminOnly" mode

For b

can you also clarify your expectation when you state it should revert to placeholder mode?

I do not think we should revert to placeholder mode. More important for the platform to get right signals so Pod can be recycled. Flagging this for discussion with @balag0 as well.

balag0 · 2024-08-06T21:47:18Z

I think there are 2 issues:

Issue 1)There are possible code paths where host will consider itself to be fully specialized and return that status even if the specialization failed because there are tasks performed in the background.

azure-functions-host/src/WebJobs.Script.WebHost/Management/LinuxInstanceManager.cs

Line 93 in 8fd50c6

// start the specialization process in the background

When this happens, since the host is 'specialized',
a)platform doesn't take any recovery actions
b)host responds to api calls and performs actions that should only be done in specialized mode. one example is - returning warmup function if api/functions is called.

What we want to do:
For a - yes, one idea is to signal platform possibly using the health events channel.
For b - in this case, it would be more suitable for host to return some failure response rather than incorrect response as a successful response. similar reasoning would apply for other api calls such as secrets, synctriggers etc (not listing all the scenarios here explicitly but you get the idea :))
How it achieves this is flexible. one way to achieve this is not switching out of placeholder mode or reverting back to placeholder mode so api calls are rejected. it is not the goal to re-use this host and attempt specialization again (this host is dirty and will be recycled). if reverting back to placeholder mode has side effects, other implementations to achieve this are also good. The main goal here is to not start invocations / return valid responses that could be incorrect (while the host is being removed since we have not emitted a failure health event).

Issue 2)
There are cases where a once specialized host can restart and lose its specialization status. there are lots of reasons this can happen and not listing them here but we do know we cannot avoid this always and so we are considering mitigation steps once this happens.
Current design - there will be a backup that will be used to trigger respecialization. if this is successful , all good and we are done.
if the respecialization failed either due to the backup being absent completely / a step in respecialization failed / some other transient error, we get into issue type 2

in this case, the host could be partially specialized or not specialized at all. platform thinks the host is fully specialized. so we are again in same exact scenario as issue 1.
so any solution we implement for issue1 should help here as well.

in summary, 2 things would help
-Signal health event to request recycle of host
-Do not perform specialized state activities if host is not specialized.

Note: Its definitely more likely for Issue2 to be more common since failures leading to issue1 is very rare due to lack of moving pieces compared to issue2.

frasermclean · 2024-08-27T01:57:15Z

I think I am experiencing this issue. I've just published to a Flex Consumption plan of a previously working app. The app is a .NET 8 Isolated worker. The only function that is showing is this "WarmUp" function.

davidobrien1985 · 2024-08-30T00:40:08Z

I have the same problem here: Azure/functions-action#245

frasermclean · 2024-12-17T04:52:47Z

Are there any updates on this? Just trying this again with a new .NET 9 app on a Flex Consumption plan, and getting the same result.

microsoft-github-policy-service bot added the Needs: Triage (Functions) label May 16, 2024

pragnagopa added the area: flex-consumption Items related to Flex Consumption support label Jun 25, 2024

fabiocav removed the Needs: Triage (Functions) label Jun 27, 2024

Azure deleted a comment from pragnagopa Jun 27, 2024

frasermclean mentioned this issue Aug 27, 2024

Jobs app only showing WarmUp function frasermclean/adoptrix#95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function app with 'Warmup' function #10169

Function app with 'Warmup' function #10169

pragnagopa commented May 16, 2024

pragnagopa commented May 16, 2024

pragnagopa commented Jun 25, 2024

fabiocav commented Jun 27, 2024

fabiocav commented Aug 2, 2024 •

edited

Loading

pragnagopa commented Aug 2, 2024

balag0 commented Aug 6, 2024 •

edited

Loading

frasermclean commented Aug 27, 2024

davidobrien1985 commented Aug 30, 2024

frasermclean commented Dec 17, 2024

Function app with 'Warmup' function #10169

Function app with 'Warmup' function #10169

Comments

pragnagopa commented May 16, 2024

pragnagopa commented May 16, 2024

pragnagopa commented Jun 25, 2024

fabiocav commented Jun 27, 2024

fabiocav commented Aug 2, 2024 • edited Loading

pragnagopa commented Aug 2, 2024

balag0 commented Aug 6, 2024 • edited Loading

frasermclean commented Aug 27, 2024

davidobrien1985 commented Aug 30, 2024

frasermclean commented Dec 17, 2024

fabiocav commented Aug 2, 2024 •

edited

Loading

balag0 commented Aug 6, 2024 •

edited

Loading