Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASP.NET CORE API running on ServiceFabric does not have service annotations when remoting #117

Open
AlkisFortuneFish opened this issue Jun 8, 2021 · 16 comments

Comments

@AlkisFortuneFish
Copy link

AlkisFortuneFish commented Jun 8, 2021

We are running a couple of Service Fabric clusters on Azure VMSS, have integrated Application Insights instrumentation for performance data and have run into a technical issue with our integration. The APIs for our applications are implemented as ASP.NET CORE APIs served by Kestrel. They then make backend calls to the other microservices on the Service Fabric application using SF service remoting.

We integrated both the Application Insights .Net SDK and the Application Insights ServiceFabric SDK and they initialise correctly. However, we are ending up with a problem. While calls to the API are annotated correctly as coming from the ServiceFabric application on the correct nodes and everything, the Service Fabric remoting calls that occur on the API are detected as entirely separate calls and are not annotated correctly, so they appear to be coming from the default source named after the AppInsights instance.

Moreover, remoting calls seem to be correctly annotated with the target service but they do not resolve to the actual service on the Application Map, so they are not tracked past the actual remoting call itself. For example, if there is a call to to an API that makes a remoting call to service A, which calls service B, which calls service A, it looks like:
ApplicationInsightsInstance (missing the service and application annotation) -> A Remoting
A -> B Remoting
B -> A Remoting
A

image

Initialisation is done as follows:
In the service constructors:

m_TelemetryConfiguration.TelemetryInitializers.Add(FabricTelemetryInitializerExtension.CreateFabricTelemetryInitializer(context));
m_TelemetryConfiguration.TelemetryInitializers.Add(new OperationCorrelationTelemetryInitializer());
m_TelemetryConfiguration.TelemetryInitializers.Add(new HttpDependenciesParsingTelemetryInitializer());
m_TelemetryConfiguration.TelemetryInitializers.Add(new CodePackageVersionTelemetryInitializer());
var dependencyTrackingModule = new DependencyTrackingTelemetryModule();
var requestTrackingModule = new ServiceRemotingRequestTrackingTelemetryModule();
var remotingDependencyTrackingModule = new ServiceRemotingDependencyTrackingTelemetryModule();
dependencyTrackingModule.Initialize(m_TelemetryConfiguration);
requestTrackingModule.Initialize(m_TelemetryConfiguration);
remotingDependencyTrackingModule.Initialize(m_TelemetryConfiguration);

In the ASP.NET Core services:
In the constructor:

m_TelemetryConfiguration = TelemetryConfiguration.CreateDefault();
m_TelemetryConfiguration.InstrumentationKey = m_AppInsightsKey;

Further initialisation happens in ConfigureServices:

services.AddSingleton<ITelemetryInitializer>(provider => FabricTelemetryInitializerExtension.CreateFabricTelemetryInitializer(serviceContext));
services.AddSingleton<ITelemetryInitializer>(provider => new CodePackageVersionTelemetryInitializer());
services.AddSingleton<ITelemetryModule>(serviceProvider =>
{
	var requestTrackingModule = new ServiceRemotingRequestTrackingTelemetryModule();
	requestTrackingModule.Initialize(m_TelemetryConfiguration);
	return requestTrackingModule;
});
services.AddSingleton<ITelemetryModule>(serviceProvider =>
{
	var dependencyTrackingModule = new ServiceRemotingDependencyTrackingTelemetryModule();
	dependencyTrackingModule.Initialize(m_TelemetryConfiguration);
	return dependencyTrackingModule;
});
services.AddApplicationInsightsTelemetry(options =>
{
	options.InstrumentationKey = m_TelemetryConfiguration.InstrumentationKey;
	options.EnableActiveTelemetryConfigurationSetup = true;
});

These evolved multiple times before they ended up as they are, as I have been chasing this issue for a while and I have left out some details that I feel are not relevant (such as configuration actually being done by methods that are also called by the configuration package changed event etc.). I was directed to submit an issue here by the AppInsights support team.

I can point at the relevant resource and provide any more information as needed.

@yantang-msft
Copy link
Contributor

@AlkisFortuneFish did you add ServiceRemotingRequestTrackingTelemetryModule and ServiceRemotingDependencyTrackingTelemetryModule to all services? i.e, the API Service, Service A and Service B.
Here is an example project that might help with the initialization: https://github.com/yantang-msft/service-fabric-application-insights-example

@AlkisFortuneFish
Copy link
Author

Yes, as is visible in the initialisation code I pasted above. The example cannot be copied directly because it uses TelemetryConfiguration.Active, which is deprecated, hence the explicit initialisation of the telemetry configuration.

@AlkisFortuneFish
Copy link
Author

Originally I had the OperationCorrelationTelemetryInitializer there too, and it does appear to have been lost in the "throwing things at the wall to see what sticks" phase of trying to fix this. I've just re-added it to see if it makes any difference to the outcome, from memory, the issue was there with that too.

@yantang-msft
Copy link
Contributor

If the 2 modules has been added to all services, then I can't think of why it doesn't work for you.
You mean the Service Remoting call is detected as a dependency telemetry, it's just that it can't be correlated correctly, isn't it? This is usually means there is only the dependency telemetry but not matching request telemetry. You can use the transaction view to see if that's the case.

@AlkisFortuneFish
Copy link
Author

Looking at the data here, I'm a bit confused:
Request dd2ed206271c0a49 POST GetPendingCompetitionRewards/Post operationId == operation_ParentId == f5b34abd513e174b8fedeb7fde937574
Dependency 87495a9ca4af5441 GetPendingCompetitionRewards operationId == f5b34abd513e174b8fedeb7fde937574 && operation_ParentId == dd2ed206271c0a49 - as expected
Backend: 17be469bd121cf4a GetPendingCompetitionRewards - operation_ParentId == 87495a9ca4af5441 && operationId == aeb475dae1369a4ba0538b3386dc70b4

Request:
timestamp [UTC] 2021-06-08T10:26:08.4407244Z
id dd2ed206271c0a49
name POST GetPendingCompetitionRewards/Post
url https://devcluster.fortunefish.co.uk:8530/api/GetPendingCompetitionRewards
success True
resultCode 200
duration 6.5316
performanceBucket <250ms
itemType request
customDimensions {"_MS.ProcessedByMetricExtractors":"(Name:'Requests', Ver:'1.1')","AspNetCoreEnvironment":"Production","ServiceFabric.ApplicationTypeName":"FashionARServerType","ServiceFabric.ServiceTypeName":"APIFrontendType","ServiceFabric.ApplicationName":"fabric:/FashionARServer","ServiceFabric.ServiceName":"fabric:/FashionARServer/APIFrontend","ServiceFabric.PartitionId":"ef37365d-3be4-4d47-b0bc-3b6150f49d68","ServiceFabric.NodeName":"_nt1vm_1","ServiceFabric.InstanceId":"132676178159853282"}
operation_Name POST GetPendingCompetitionRewards/Post
operation_Id f5b34abd513e174b8fedeb7fde937574 
operation_ParentId f5b34abd513e174b8fedeb7fde937574
application_Version 9.9.23
client_Type PC
client_IP 0.0.0.0
client_CountryOrRegion United Kingdom
cloud_RoleName fabric:/FashionARServer/APIFrontend
cloud_RoleInstance 132676178159853282
appId d08126b9-7b82-4a9a-a4c0-1322a5105100
appName DevClusterAppInsights
iKey ef4d71c4-301a-440c-ad14-a08b7a2a2965
sdkVersion aspnet5c:2.17.0+c9d95e701e2474b7eb3b46ae7953b6c7570356ab
itemId fc00b189-c843-11eb-8848-cdd4efeaa105
itemCount 4
Dependency:
timestamp [UTC] 2021-06-08T10:26:08.4433374Z
id 87495a9ca4af5441
target fabric:/FashionARServer/UserActorService
type ServiceFabricServiceRemoting
name GetPendingCompetitionRewards
data fabric:/FashionARServer/UserActorService/GetPendingCompetitionRewards
success True
duration 2
performanceBucket <250ms
itemType dependency
operation_Name GetPendingCompetitionRewards
operation_Id f5b34abd513e174b8fedeb7fde937574
operation_ParentId dd2ed206271c0a49
client_Type PC
client_IP 0.0.0.0
client_City Cardiff
client_StateOrProvince Cardiff
client_CountryOrRegion United Kingdom
cloud_RoleInstance nt1vm000001
appId d08126b9-7b82-4a9a-a4c0-1322a5105100
appName DevClusterAppInsights
iKey ef4d71c4-301a-440c-ad14-a08b7a2a2965
sdkVersion rddsr:2.3.1-140
itemId fc9e0301-c843-11eb-8663-f7253130639b
itemCount 1
Backend Call:	
timestamp [UTC] 2021-06-08T10:26:08.4446195Z
id 17be469bd121cf4a
name GetPendingCompetitionRewards
success True
resultCode Not Applicable
duration 0.1078
performanceBucket <250ms
itemType request
customDimensions {"ServiceFabric.ApplicationTypeName":"FashionARServerType","ServiceFabric.ServiceTypeName":"UserActorServiceType","ServiceFabric.ApplicationName":"fabric:/FashionARServer","ServiceFabric.ServiceName":"fabric:/FashionARServer/UserActorService","ServiceFabric.PartitionId":"2796a814-b606-4381-ab1c-405152522cd7","ai_legacyRootId":"87495a9ca4af5441","ServiceFabric.NodeName":"_nt1vm_0","ServiceFabric.ReplicaId":"132645366371879734"}
operation_Name GetPendingCompetitionRewards
operation_Id aeb475dae1369a4ba0538b3386dc70b4
operation_ParentId 87495a9ca4af5441
application_Version 9.9.23
client_Type PC
client_IP 0.0.0.0
client_City Cardiff
client_StateOrProvince Cardiff
client_CountryOrRegion United Kingdom
cloud_RoleName fabric:/FashionARServer/UserActorService
cloud_RoleInstance 132645366371879734
appId d08126b9-7b82-4a9a-a4c0-1322a5105100
appName DevClusterAppInsights
iKey ef4d71c4-301a-440c-ad14-a08b7a2a2965
sdkVersion serviceremoting:2.3.1-140
itemId f6de2618-c843-11eb-b802-492c4f3bff35
itemCount 1

So it would appear that the backend call is correctly annotated with an operation_ParentId that matches the remoting call that triggerred it but it has an entirely separate operation_Id.

What could cause that?

@yantang-msft
Copy link
Contributor

The telemetry operationId definitely needs to be the same for correlation.
Make sure you didn't touch anything regarding System.Diagnostics.Activity (you can touch but if it's not done correctly then the correlation will be broken).
Also, I noticed the "Backend call" is from an actor service. The initialization of Actor Service has some trickiness, please check out this example project, hopefully that helps: https://github.com/yantang-msft/service-fabric-application-insights-example

@AlkisFortuneFish
Copy link
Author

The initialisation I pasted above is from one of the ActorServices. All services have the initialisation I pasted above (which I updated just now). I am not sure what trickiness you are referring to, as far as I can see it is the same as initialisation of any other non-ASP.NET SF service, we already have explicitly derived ActorService classes for all our actor services for other reasons anyway.

I am not sure why you keep pasting the example, I have seen the example, that is where I based my integration on, which you can clearly see if you read the code I pasted above.

@yantang-msft
Copy link
Contributor

The trickiness for Actor is that you need to initialize the code in ActorBackendService, not in Program.cs or ActorBackend.cs. If you are already doing this and didn't touch any System.Diagnostics.Activity, then I have no idea why the operationId changed.

@AlkisFortuneFish
Copy link
Author

Yeah, we're not touching System.Diagnostics.Activity anywhere as far as I can see. We need this sorted however, so what are the next steps to debug it? I have been directed here from a support ticket, so I'm not sure where I could take it from here otherwise.

@yantang-msft
Copy link
Contributor

You can generate the build yourself and try debug.
This is where the "dependency" telemetry is generated, and this is where the request telemetry is captured by "Backend call".
You can check System.Diagnostics.Activity and see whether the operationId is consistent/altered.

@AlkisFortuneFish
Copy link
Author

AlkisFortuneFish commented Jun 14, 2021

Now I've had a bit of time allocated to look into this and managed to make the remote debugger work for the first time ever (why on earth is it considered acceptable not to be able to debug if you have multiple instances of the same SF application on the cluster and to silently fail if you do try!?), I have some findings.

The activity Id is correctly passed from caller to receiver, the ParentId is deserialised and set.
The Request-Context is set.
The I don't see a correlation context being set on the headers, there is no Baggage on the activity, I don't know if there should be.

I am not hugely familiar with the overall pipeline, so any guidance as to what to look at in particular?

@AlkisFortuneFish
Copy link
Author

Poke. If you do not have the time to look into this let me know to re-engage with the support request and get some Microsoft resources that way, I just need to know.

@yantang-msft
Copy link
Contributor

@AlkisFortuneFish so the operation_parentId is set, but is it set correctly to match the id of the parent? If not, you can reengage the support or file issue under this repo: https://github.com/microsoft/ApplicationInsights-dotnet
The OperationContext is set when StartOperation gets called here, and StartOperation is implemented in this repo: https://github.com/microsoft/ApplicationInsights-dotnet

@AlkisFortuneFish
Copy link
Author

Ok, I think I've got to the bottom of this. There are a few different issues here, which is what made this difficult to track down.

  1. My initialization on the ASPNetCore side is not correct. Creating a TelemetryConfiguration, feeding it into the tracking module initialization and then calling AddApplicationInsightsTelemetry() causes a disconnect between request tracking and dependency tracking. I had to change the initialisation on that end to look more like the original example (albeit with the updated APIs).
  2. The Ids generated are not in the W3C format. The change of the default Id format to W3C makes the current (2.17.0) version of StartOperation() create a new Id and assign the old one to a legacy property. This does not work. I had to force Diagnostics.Activity.DefaultIdFormat back to Hierarchical for the Ids not to not change.
  3. TelemetryConfiguration.CreateDefault() now already adds the OperationCorrelationTelemetryInitializer, so creating a new one (as per the example) actually breaks correlation altogether.

So the fundamental issue is that this package, documentation, and examples are due an update to actually work with the latest ApplicationInsights package and .Net Core. I will update our integration to force the old format but I do not know if there are any plans to drop support for Hierarchical at some point, breaking the integration again in future.

@yantang-msft
Copy link
Contributor

I'm glad that you found where it went wrong and get it fixed.
BTW, for # 2, I would imagine the id format is gracefully handled. i.e., the operationId/operation_ParentId are all in new/old format and can be correlated correctly
for # 3, I think the TelemetryInitializer precedence strategy is first one wins, so # 3 might not be a real problem.

Anyways, this project is in maintenance mode, if the old format is a real problem and will be dropped someday, then this library won't work.

@AlkisFortuneFish
Copy link
Author

I'd have hoped that it's gracefully handled, as the documentation would suggest, but it does not appear to be entirely the case, if the Id format is W3C and the incoming Id is not, StartOperation in 2.17.0 generates a new Id instead of propagating what was passed in.
Number 3 doesn't sound like it should be problematic but I did test it and it is. I explicitly tested with and without the extra initialiser and with it correlation was broken. I did not have the time to check exactly what happened to the telemetry as it was nearly midnight at that point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants