Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function app with 'Warmup' function #10169

Open
pragnagopa opened this issue May 16, 2024 · 9 comments
Open

Function app with 'Warmup' function #10169

pragnagopa opened this issue May 16, 2024 · 9 comments
Labels
area: flex-consumption Items related to Flex Consumption support

Comments

@pragnagopa
Copy link
Member

If functions runtime cannot fully specialize it can get stuck with the placeholder 'WarmUp' function

Sample pod - in the below case the specialization succeeded once and loaded the actual customer functions. Functions runtime restarted and then re-specialization failed (due to lack of container start context mount non-availability in this case) and the pod was left with 'Warmup' Function

Note1 - The exact thing to address is - we should never return WarmUp Function when customer lists functions (i.e portal shouldn't display 'Warmup' function)
Note2 - In the below pod the failure was during re-specialization but this can happen during initial specialization as well
Note3 - The exact reason here is a different bug (lack of context mount path) but even if we fix that, there might be other reasons for failures. so this workitem is to explore options to see if we can
a) dont switch to specialized mode if specialization is 'fully complete'
b) if we cannot do a, can we revert back to placeholder mode to prevent the instance from causing confusion

Execute: [Web] [Desktop] [Web (Lens)] [Desktop (SAW)] https://wawseus.kusto.windows.net/wawsprod
FunctionsLogs

| where PreciseTimeStamp >= datetime(2024-05-14T15:57:00.0000000Z) - 3h

| where PreciseTimeStamp <= datetime(2024-05-15T01:04:00.0000000Z) + 3h

| where SourceNamespace == "FunctionsLegion"

| extend PodName = RoleInstance

| where PodName == "3ea83371-4d4d-4d4e-bf18-ccf6aefc9798"

| project PreciseTimeStamp, Summary

| order by PreciseTimeStamp asc

| where Summary contains "Found the following functions:" or Summary contains "will attempt"

 

 

PreciseTimeStamp | Summary -- | -- 2024-05-14 22:47:28.0664600 | Found the following functions: Host.Functions.WarmUp 2024-05-14 22:47:31.0712436 | Found the following functions: Host.Functions.WarmUp 2024-05-15 00:56:52.1614246 | Found the following functions: Host.Functions.HelloWorld Host.Functions.HtmlParser Host.Functions.LoaderIO 2024-05-15 01:03:05.1607510 | WebHost has shut down. Will attempt restart. 2024-05-15 01:03:30.1589962 | Found the following functions: Host.Functions.WarmUp

 

 


If functions runtime cannot fully specialize it can get stuck with the placeholder 'WarmUp' function

Sample pod - in the below case the specialization succeeded once and loaded the actual customer functions. Functions runtime restarted and then re-specialization failed (due to lack of container start context mount non-availability in this case) and the pod was left with 'Warmup' Function

Note1 - The exact thing to address is - we should never return WarmUp Function when customer lists functions (i.e portal shouldn't display 'Warmup' function)
Note2 - In the below pod the failure was during re-specialization but this can happen during initial specialization as well
Note3 - The exact reason here is a different bug (lack of context mount path) but even if we fix that, there might be other reasons for failures. so this workitem is to explore options to see if we can
a) dont switch to specialized mode if specialization is 'fully complete'
b) if we cannot do a, can we revert back to placeholder mode to prevent the instance from causing confusion

Execute: [Web] [Desktop] [Web (Lens)] [Desktop (SAW)] https://wawseus.kusto.windows.net/wawsprod
FunctionsLogs

| where PreciseTimeStamp >= datetime(2024-05-14T15:57:00.0000000Z) - 3h

| where PreciseTimeStamp <= datetime(2024-05-15T01:04:00.0000000Z) + 3h

| where SourceNamespace == "FunctionsLegion"

| extend PodName = RoleInstance

| where PodName == "3ea83371-4d4d-4d4e-bf18-ccf6aefc9798"

| project PreciseTimeStamp, Summary

| order by PreciseTimeStamp asc

| where Summary contains "Found the following functions:" or Summary contains "will attempt"

PreciseTimeStamp

Summary

2024-05-14 22:47:28.0664600

Found the following functions: Host.Functions.WarmUp

2024-05-14 22:47:31.0712436

Found the following functions: Host.Functions.WarmUp

2024-05-15 00:56:52.1614246

Found the following functions: Host.Functions.HelloWorld Host.Functions.HtmlParser Host.Functions.LoaderIO

2024-05-15 01:03:05.1607510

WebHost has shut down. Will attempt restart.

2024-05-15 01:03:30.1589962

Found the following functions: Host.Functions.WarmUp

@pragnagopa
Copy link
Member Author

cc @balag0

@fabiocav please triage

@pragnagopa pragnagopa added the area: flex-consumption Items related to Flex Consumption support label Jun 25, 2024
@pragnagopa
Copy link
Member Author

@fabiocav please triage

@fabiocav
Copy link
Member

Can you please clarify what you mean by re-specialization?

@fabiocav
Copy link
Member

fabiocav commented Aug 2, 2024

@pragnagopa isn't this an issue caused by the platform failing to persist the site context file on disk, combined with a change to the host lifecycle management, where it is restarted in the same container?

For the following:

Note2 - In the below pod the failure was during re-specialization but this can happen during initial specialization as well

For the default (initial) specialization failures, are you including host startup and/or indexing failures? Those would lead to health metrics emitted by the host.

Note3 - The exact reason here is a different bug (lack of context mount path) but even if we fix that, there might be other reasons for failures. so this workitem is to explore options to see if we can
a) dont switch to specialized mode if specialization is 'fully complete'
b) if we cannot do a, can we revert back to placeholder mode to prevent the instance from causing confusion

For a, if specialization fails, the host will be left in a bad state and we need to ensure the platform is notified. The end result if the platform doesn't act would be similar (invocations not properly handled), which is undesirable behavior.
For the restart flow, once the issue with the platform is addressed, what behavior would you expect from the host and how should it notify the platform that specialization failed? We need to ensure that, in both cases, the instance is removed when that occurs. The host would be able to emit health metrics or force a restart if those are honored.

For b, can you also clarify your expectation when you state it should revert to placeholder mode? If the host sets its state to match the placeholder state, unless the platform is aware and no longer routing traffic, any request coming in for the app would result in a 404 for HTTP cases, as no application functions would be indexed. Host APIs would also return information that would reflect a placeholder state, so any request (including portal) coming to that instance would result in responses that do not reflect the application payload.

@pragnagopa
Copy link
Member Author

@fabiocav thanks for the follow up.

isn't this an issue caused by the platform failing to persist the site context file on disk, combined with a change to the host lifecycle management, where it is restarted in the same container?

Yes, this aligns with Note3

For a

The host would be able to emit health metrics or force a restart if those are honored.

This is a good direction. I would like to discuss more on this with @balag0 in the context for "AdminOnly" mode

For b

can you also clarify your expectation when you state it should revert to placeholder mode?

I do not think we should revert to placeholder mode. More important for the platform to get right signals so Pod can be recycled. Flagging this for discussion with @balag0 as well.

@balag0
Copy link
Contributor

balag0 commented Aug 6, 2024

I think there are 2 issues:

Issue 1)There are possible code paths where host will consider itself to be fully specialized and return that status even if the specialization failed because there are tasks performed in the background.

// start the specialization process in the background

When this happens, since the host is 'specialized',
a)platform doesn't take any recovery actions
b)host responds to api calls and performs actions that should only be done in specialized mode. one example is - returning warmup function if api/functions is called.

What we want to do:
For a - yes, one idea is to signal platform possibly using the health events channel.
For b - in this case, it would be more suitable for host to return some failure response rather than incorrect response as a successful response. similar reasoning would apply for other api calls such as secrets, synctriggers etc (not listing all the scenarios here explicitly but you get the idea :))
How it achieves this is flexible. one way to achieve this is not switching out of placeholder mode or reverting back to placeholder mode so api calls are rejected. it is not the goal to re-use this host and attempt specialization again (this host is dirty and will be recycled). if reverting back to placeholder mode has side effects, other implementations to achieve this are also good. The main goal here is to not start invocations / return valid responses that could be incorrect (while the host is being removed since we have not emitted a failure health event).

Issue 2)
There are cases where a once specialized host can restart and lose its specialization status. there are lots of reasons this can happen and not listing them here but we do know we cannot avoid this always and so we are considering mitigation steps once this happens.
Current design - there will be a backup that will be used to trigger respecialization. if this is successful , all good and we are done.
if the respecialization failed either due to the backup being absent completely / a step in respecialization failed / some other transient error, we get into issue type 2

in this case, the host could be partially specialized or not specialized at all. platform thinks the host is fully specialized. so we are again in same exact scenario as issue 1.
so any solution we implement for issue1 should help here as well.

in summary, 2 things would help
-Signal health event to request recycle of host
-Do not perform specialized state activities if host is not specialized.

Note: Its definitely more likely for Issue2 to be more common since failures leading to issue1 is very rare due to lack of moving pieces compared to issue2.

@frasermclean
Copy link

I think I am experiencing this issue. I've just published to a Flex Consumption plan of a previously working app. The app is a .NET 8 Isolated worker. The only function that is showing is this "WarmUp" function.

@davidobrien1985
Copy link
Contributor

I have the same problem here: Azure/functions-action#245

@frasermclean
Copy link

Are there any updates on this? Just trying this again with a new .NET 9 app on a Flex Consumption plan, and getting the same result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: flex-consumption Items related to Flex Consumption support
Projects
None yet
Development

No branches or pull requests

5 participants