Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datadogfleetautomation extension POC #5272

Open
wants to merge 1 commit into
base: prod
Choose a base branch
from

Conversation

jackgopack4
Copy link

@jackgopack4 jackgopack4 commented Mar 3, 2025

Description

initial POC for Datadog Fleet Automation Extension

Link to tracking issue

Fixes Jira OTEL-2445

Testing

Full unit testing (todo: will open a card to write a few integration tests)

Documentation

Have only written a barebones README.md to pass checks; figured that would make sense to do after we make changes or add/remove/adjust functionality.

Additional notes

Does not yet send the new, third payload type. I am working on a separate branch to add that payload and am planning to open a second PR after this one is accepted to our fork

@jackgopack4 jackgopack4 requested review from dineshg13 and mx-psi March 3, 2025 14:42
GetModuleInfos() service.ModuleInfos
}

if host, ok := host.(exportModules); ok {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Code Quality Violation

Avoid modifying function parameters (...read more)

Assigning new values to function parameters exhibits several bad coding practices and should be avoided for several reasons:

  1. Redefining parameters: The code redefines a parameter within the function body by assigning a new value. This is considered a bad practice because it can lead to confusion and make the code harder to understand. Modifying a function parameter in this manner breaks the expected behavior and can cause unexpected side effects. It is generally best to treat function parameters as read-only values and avoid reassigning them.
  2. Shadowing variables: The code further exacerbates the issue by using the short variable declaration := to define a new variable within the function body. This shadows the original parameter, making it inaccessible within the function. Shadowing variables can cause confusion and make the code harder to reason about. It is better to use distinct variable names to maintain clarity and avoid any unintended side effects.

To write more maintainable and understandable code, it is advisable to adhere to the following practices:

  • Avoid redefining function parameters.
  • Use descriptive and unambiguous variable names.
  • Avoid shadowing variables.
  • Maintain consistency in variable references.

By following these best practices, the code becomes more readable and easier to manage and avoids introducing unnecessary complexity and confusion.

View in Datadog  Leave us feedback  Documentation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: update to GetModulesInfo() in service/hostcapabilities

@jackgopack4
Copy link
Author

Addressing the golint violations

Comment on lines +51 to +80
extensionConfig *Config
extensionID component.ID
telemetry component.TelemetrySettings
collectorConfig *confmap.Conf
collectorConfigStringMap map[string]any
ticker *time.Ticker
done chan bool
mu sync.RWMutex
uuid uuid.UUID

buildInfo component.BuildInfo
moduleInfo service.ModuleInfos
moduleInfoJSON *moduleInfoJSON
activeComponentsJSON *activeComponentsJSON
version string

forwarder defaultForwarderInterface
compressor *compression.Compressor
serializer serializer.MetricSerializer

agentMetadataPayload AgentMetadata
otelMetadataPayload OtelMetadata

httpServer *http.Server
healthCheckV2Enabled bool
healthCheckV2ID *component.ID // currently first healthcheckv2 extension found; could expand to multiple health checks if needed
healthCheckV2Config map[string]any
componentStatus map[string]any // retrieved from healthcheckv2 extension, if enabled/configured

hostnameProvider source.Provider
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it is doing many things. Can we split this multiple internal packages. Like http server can be its own package

Comment on lines +4 to +15
package datadogfleetautomationextension // import "github.com/open-telemetry/opentelemetry-collector-contrib/extension/datadogfleetautomationextension"

import (
"encoding/json"
"fmt"

"github.com/DataDog/datadog-agent/pkg/serializer/marshaler"
"go.opentelemetry.io/collector/component"
)

type OtelCollector struct {
HostKey string `json:"host_key"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move it to internal/payload package.

// Copyright The OpenTelemetry Authors
// SPDX-License-Identifier: Apache-2.0

package datadogfleetautomationextension // import "github.com/open-telemetry/opentelemetry-collector-contrib/extension/datadogfleetautomationextension"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be name it datadogfaextension ?

Comment on lines +50 to +64
type defaultComponentChecker struct {
extension *fleetAutomationExtension
}

func (d *defaultComponentChecker) isComponentConfigured(typ string, componentsKind string) (bool, *component.ID) {
return d.extension.isComponentConfigured(typ, componentsKind)
}

func (d *defaultComponentChecker) isModuleAvailable(componentType string, componentKind string) bool {
return d.extension.isModuleAvailable(componentType, componentKind)
}

func (d *defaultComponentChecker) isHealthCheckV2Enabled() (bool, error) {
return d.extension.isHealthCheckV2Enabled()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we decouple component checker and extension. may be make component checker its own module

Comment on lines +141 to +149
if componentsKind == extensionsKind {
if componentStatus, ok := componentsConfig[componentsKind].(map[string]any); ok {
if receiversStatus, ok := componentStatus["components"].(map[string]any); ok {
componentNameInPipeline := componentKind + ":" + id
if componentStatus, ok := receiversStatus[componentNameInPipeline].(map[string]any); ok {
result = componentStatus
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to avoid arrow pattern. see https://blog.codinghorror.com/flattening-arrow-code/
try changing to

if componentsKind != extensionsKind {
continue 
}

Comment on lines +156 to +174
if pipelineMap, ok := value.(map[string]any); ok {
if components, ok := pipelineMap["components"].(map[string]any); ok {
for component, status := range components {
componentParts := strings.Split(component, ":")
if len(componentParts) != 2 {
continue
}
kind := componentParts[0]
fullID := componentParts[1]
if kind == componentKind && id == fullID {
if componentStatus, ok := status.(map[string]any); ok {
result[key] = map[string]any{
component: componentStatus,
}
}
}
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Comment on lines +95 to +97
e.mu.Lock()
defer e.mu.Unlock()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need mutex here ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think too hard about this one; just knew that other extensions that used ConfigWatcher also used a mutex, and I also am not too clear on how multiple instances of the same extension would work. maybe not necessary for this.

Comment on lines +19 to +24
// ErrUnsetAPIKey is returned when the API key is not set.
ErrUnsetAPIKey = datadogconfig.ErrUnsetAPIKey
// ErrEmptyEndpoint is returned when endpoint is empty
ErrEmptyEndpoint = datadogconfig.ErrEmptyEndpoint
// ErrAPIKeyFormat is returned if API key contains invalid characters
ErrAPIKeyFormat = datadogconfig.ErrAPIKeyFormat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to export all of these, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's true I can probably move most things to un-exported variables/functions

Comment on lines +25 to +26
// NonHexRegex is a regex of characters that are always invalid in a Datadog API Key
NonHexRegex = datadogconfig.NonHexRegex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment on lines +32 to +37
// DefaultSite is the default site for the Datadog API.
DefaultSite = datadogconfig.DefaultSite
// NonHexChars is a regex of characters that are always invalid in a Datadog API key.
NonHexChars = datadogconfig.NonHexChars
// DefaultReporterPeriod is the default amount of time between sending fleet automation payloads to Datadog.
DefaultReporterPeriod = 20 * time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment on lines +47 to +48
// ReporterPeriod sets the amount of time between sending fleet automation payloads to Datadog.
ReporterPeriod time.Duration `mapstructure:"reporter_period"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to make this configurable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we allow it to be configured in hostmetadata reporter so I figured it made sense to keep the configuration the same. I don't actually care if we let users configure it or not and would just as easily remove it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just remove it for this first version

Comment on lines +32 to +37
// APIKeyValidator is a function that validates the API key, provided to newExtension for mocking
APIKeyValidator func(context.Context, string, *zap.Logger, *datadog.APIClient) error
// SourceProviderGetter is a function that returns a source.Provider, provided to newExtension for mocking
SourceProviderGetter func(component.TelemetrySettings, string, time.Duration) (source.Provider, error)
// ForwarderGetter is a function that returns a defaultforwarder.Forwarder, provided to newExtension for mocking
ForwarderGetter func(coreconfig.Component, corelog.Component) defaultforwarder.Forwarder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can all be private

collectorConfig *confmap.Conf
collectorConfigStringMap map[string]any
ticker *time.Ticker
done chan bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the done channel used for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is used to close the ticker which sends the metadata every "reporterperiod" amount of time, but it looks like you found it and we can just do a cancellable context instead

}
}

func (e *fleetAutomationExtension) checkHealthCheckV2() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why we do all this. We could just call the same API that the healthcheck extension uses and I think this would be much simpler

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. I don't know that it makes sense to duplicate healthCheckV2 functionality in our own extension, but it would make us less tightly coupled to healthcheckv2 and allow us to iterate on our own cadence.
I'll work on a refactor that does it that way instead and see if that is simpler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to duplicate its functionality, just the bits that get the info from the host. We don't need to necessarily use the format that the extension is using

if err != nil {
e.telemetry.Logger.Error("Failed to prepare and send fleet automation payloads", zap.Error(err))
}
case <-e.done:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this I would pass a context and wait for the context to be cancelled

e.mu.RLock()
defer e.mu.RUnlock()
if e.hostnameSource == "unset" {
e.telemetry.Logger.Info("Skipping fleet automation payloads since the hostname is empty")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case should just not happen, we should have bailed out if the extension cannot work properly with the user provided configuration

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I thought maybe it made sense to log it, and allow the user to update the config (collector technically allows config changes while running, right?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, well, there is very limited functionality related to this, we can always revert back to not adding an error in the future (but it's harder to go from "no error" -> error)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be in its own internal package

@jackgopack4 jackgopack4 force-pushed the datadogfleetautomation-extension branch 2 times, most recently from 7bb0557 to 3176d4f Compare March 6, 2025 14:09
@jackgopack4 jackgopack4 force-pushed the datadogfleetautomation-extension branch from 3176d4f to e7ed96a Compare March 6, 2025 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants