Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created the blog post announcing Data Prepper 2.0 #1066

Merged
151 changes: 151 additions & 0 deletions _posts/2022-10-10-Announcing-Data-Prepper-2.0.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
---
layout: post
title: "Announcing Data Prepper 2.0.0"
authors:
- dlv
- oeyh
date: 2022-10-10 15:00:00 -0500
categories:
- technical-post
---

Today the maintainers are announcing the release of Data Prepper 2.0. It has been over a year since Data Prepper 1.0 was first introduced
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this paragraph to:

The Data Prepper maintainers are proud to announce the release of Data Prepper 2.0. This release makes Data Prepper easier to use and helps you improve your observability stack based on feedback from our users.

Here are some of the major changes and enhancements made for Data Prepper 2.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe:

The Data Prepper maintainers are proud to announce the release of Data Prepper 2.0. This release makes Data Prepper easier to use and helps you improve your observability stack based on feedback from you, our users.

Here are some of the major changes and enhancements made for Data Prepper 2.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlvenable: Could we add a line in this intro or somewhere in the blog about OpenSearch compatibility? Data Prepper 2.0 is compatible with all OpenSearch versions, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the following:

Data Prepper 2.0 retains compatibility with all current versions of OpenSearch.

and this release introduces significant changes based on feedback from our users. This release makes Data Prepper easier to use and helps
you improve your observability stack. This post will highlight some major changes and enhancements in this release.

## Conditional routing
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

Often time with log ingestion, pipeline authors need to send different logs to certain OpenSearch clusters. One example of this is routing logs based on log levels.
Perhaps you want info logs which produce large volumes of data to go to a cluster or index that has more frequent rollovers or deletions to clear out these large volumes of data.

Now Data Prepper supports conditional routing to help with use-cases such as these. A pipeline author can configure routes.
The author will define a name that is appropriate for the domain and a Data Prepper expression.
Then for any sink that should only have some data coming through, define one or more routes to apply Data Prepper will evaluate
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
these expressions for each event to determine which sinks to route these events to. Any sink that has no routes defined will accept all events.
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

Continuing with log-levels, consider an application log which includes log data. A common Java application log might look like the following.

```
2022-10-10T10:10:10,421 [main] INFO org.example.Application - Saving 10 records to SQL table "orders"
```

The text that reads `INFO` indicates that this is an INFO-level log. Data Prepper pipeline authors can now route logs with this level to only certain OpenSearch clusters.

The following example pipeline shows how this works. This pipeline takes application logs from the `http` source. This source
accepts log data from external sources such as Fluent Bit. The pipeline then uses the `grok` processor to split the log line into multiple fields.
Now the event has a field named `loglevel` that authors can use in routes. This pipeline has two OpenSearch sinks. The first sink only receives
logs with a log level of `WARN` or `ERROR`. Data Prepper will route all events to the second sink.

```
application-log-pipeline:
workers: 4
delay: "50"
source:
http:
processor:
- grok:
match:
log: [ "%{NOTSPACE:time} %{NOTSPACE:thread} %{NOTSPACE:loglevel} %{NOTSPACE:class} - %{GREEDYDATA:message}" ]

route:
- warn_and_above: '/loglevel == "WARN" or /loglevel == "ERROR"'
sink:
- opensearch:
routes:
- warn_and_above
hosts: ["https://opensearch:9200"]
insecure: true
username: "admin"
password: "admin"
index: warn-and-above-logs
- opensearch:
hosts: ["https://opensearch:9200"]
insecure: true
username: "admin"
password: "admin"
index: all-logs
```

There are many other use-cases that conditional routing can support. If there are other conditional expressions
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
you’d like to see support for, please create an issue in GitHub.

## Peer Forwarder
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

Data Prepper supports stateful aggregations for traces and logs. With these, pipeline authors can improve the quality of the data going into OpenSearch.
Previous to Data Prepper 2.0, performing stateful trace aggregations required using the `peer-forwarder` processor plugin.
But this plugin only worked for traces and would send data back to the source. Also, log aggregations only worked on a single node.

Data Prepper introduces peer forwarding as a core feature in Data Prepper 2.0. This allows pipeline authors to perform stateful
aggregations on multiple Data Prepper nodes. When performing stateful aggregations, Data Prepper uses a hash ring to determine
which nodes are responsible for processing different events based on the values of certain fields. Data Prepper's core peer-forwarder
routes events to the node responsible for processing the event. That node then holds all the state necessary for performing the aggregation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
routes events to the node responsible for processing the event. That node then holds all the state necessary for performing the aggregation.
routes events to the node responsible for processing them. That node then holds all the states necessary for performing the aggregation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the change to "states" here. Using a singular noun for state is quite common.

In information technology and computer science, a system is described as stateful if it is designed to remember preceding events or user interactions; the remembered information is called the state of the system.

https://en.wikipedia.org/wiki/State_(computer_science)


To use peer forwarding, you will configure how Data Prepper discovers other nodes and the security for connections in your
`data-prepper-config.yaml` file. The following snippet shows an example of how to do this.

```
peer_forwarder:
discovery_mode: dns
domain_name: "my-data-prepper-cluster.production"
ssl_certificate_file: /usr/share/data-prepper/config/my-certificate.crt
ssl_key_file: /usr/share/data-prepper/config/my-certificate.key
ssl_fingerprint_verification_only: true
authentication:
mutual_tls:
```

In the example above, Data Prepper will discover other peers using DNS. It will perform a DNS query on the domain `my-data-prepper-cluster.production`.
This DNS record should be an A record with a list of IP addresses for peers. The configuration uses a custom certificate and private key.
It performs host verification by checking the fingerprint of the certificate. And finally it configures each server to authenticate requests using
Mutual TLS (mTLS) to prevent tampering of data.

## Directory structure
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

Previously, Data Prepper was distributed as a single executable JAR file. This is simple and convenient, but also makes it difficult for Data Prepper
to include custom plugins. Data Prepper 2.0 introduces a change for it and now distributes the application in a bundled directory structure.
The new directory structure features a shell script to launch Data Prepper and dedicated subdirectories for JAR files, configurations, pipelines, logs, and more.
The directory structure looks like this:

```
data-prepper-2.0.0/
bin/
data-prepper # Shell script to run Data Prepper
config/
data-prepper-config.yaml # The Data Prepper configuration file
log4j.properties # Logging configuration
pipelines/ # New directory for pipelines
trace-analytics.yaml
log-ingest.yaml
lib/
data-prepper-core.jar
... any other jar files
logs/
```

With this change, a user can launch Data Prepper by simply running `bin/data-prepper`. No additional command line arguments or Java system property definitions
are required. Instead, the application will load configurations from `config/` subdirectory.

Data Prepper will also read pipeline configurations from `pipelines/` subdirectory. Users can now define pipelines across
multiple YAML files in the subdirectory, where each file contains the configuration for one or more pipelines. This will
allow users to keep their pipeline definitions distinct and thus more compact and focused.

## JSON & CSV parsing
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

Many of our users have incoming data with embedded JSON or CSV fields. Now Data Prepper supports parsing either JSON or CSV.

A common example of this is when one larger object includes a serialized JSON string. If your incoming event data has a
serialized JSON string, you can use the `parse_json` processor to extract the fields from the JSON into your event.

Data Prepper can now import CSV or TSV formatted files from Amazon S3 sources. This is useful for systems like Amazon CloudFront
dlvenable marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove "formatted"? Otherwise, this would need to be "CSV- or TSV-formatted files".

which write their access logs as TSV files. Now you can parse these logs using Data Prepper. Additionally, if your events have
CSV or TSV fields, Data Prepper has a `csv` processor which can create fields from your incoming CSV data.

## Other improvements
dlvenable marked this conversation as resolved.
Show resolved Hide resolved

Data Prepper 2.0 includes a number of other improvements. We’d like to highlight a few of them.

* The OpenSearch sink now supports create actions to OpenSearch. When Data Prepper writes documents to OpenSearch it normally does this via an update action. This will create the document if it does not exist or update it. Now a pipeline author can configure Data Prepper to use the create action. When this is configured, the OpenSearch cluster will not update the document if it already exists. Some scenarios call of for using this so that documents are only saved once and never updated.
* The HTTP source now supports loading TLS/SSL credentials from either Amazon S3 or Amazon Certificate Manager. The OTel Trace Source supported these options and now pipeline authors can configure them for their log ingestion use-cases as well.
* Data Prepper now requires Java 11 and the Docker image deploys with JDK 17.

Please see our release notes for a complete list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a link to these release notes?