Skip to content

jeremyber-aws/kinesis-data-analytics-apache-flink-async-io

 
 

Repository files navigation

Async IO for Flink Applications with Amazon Kinesis Data Analytics

This project is heavily influenced by the Apache Flink AsyncIO documentation and code samples which describe the methods for making asynchronous calls to an external data source. The overall idea behind the AsyncIO function is that we are utilizing the fact that this call can be made asynchronously (at the same time as other requests) to ultimately improve throughput.

dag cw

How does it work?

This codebase reads from a Kinesis Data Stream, invokes the AsyncFunction with an Apache HTTP Client and posts to an HTTP endpoint. The results are then returned asynchronously and streamed to the sink (KDS). This was done to showcase that the throughput was keeping up despite a slow Rest API.

It also performs the same Rest API call synchronously via a Map function. This was done for illustrative purposes, and should not be done in production. If you need to control concurrent requests to an external data source, use capacity instead.

Some considerations:

The most important parameters when defining an AsyncFunction are:

  • Capacity - how many requests are in-flight concurrently per parallel sub-task
  • Timeout - the timeout duration of an individual request to the external data source.

Ensure you are allocating enough capacity to account for the throughput, but not more than the external data source can handle.

Example Application with parallelism of 5 and capacity of 10 will send 50 concurrent requests to your external data source.

About

Showcases the AsyncIO Functionality within Apache Flink for Kinesis Data Analytics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%