-
Notifications
You must be signed in to change notification settings - Fork 297
MetadataHandler
As part of forming a query execution plan that includes a federated data source, Athena needs a way to obtain key metadata from your source. More precisely, Athena needs a way to obtain:
- list of schemas (aka databases).
- list of tables in a given schema.
- Table definitions (e.g. column names, column types).
- Partitions that should be queried for a given Schema, Table, and Predicate.
- How to split-up/parallelize reads of a partitions.
The Athena Query Federation SDK provides an MetadataHandler as an abstract class that you can extend in order to implement the above functionality via the below functions:
- doListSchemas(...) - lists available schemas.
- doListTables(...) - lists available tables in a schema.
- doGetTable(...) - get the definition of a Table.
- doGetTableLayout(...) - provides partition information and optionally performs partition pruning.
- doGetSplits(...) - tells Athena how it can split up and parallelize reads of a Partition.
Also provided is a partial implementation of these methods which uses the AWS Glue DataCatalog for metadata. The GlueMetadataHandler can jump start your MetadataHandler if your source lacks its own metadata source. The athena-redis is an example of a connector that uses AWS Glue DataCatalog since Redis lacks a traditional metastore for helping Athena understand how to interpret your Redis keys/prefixes/zsets as Tables and Columns.
In most cases you will deploy a MetadataHandler and RecordHandler together in the same Lambda function by using a CompositeHandler. There are however some unique cases where you may want to deploy them independently. This is supported by Athena and most often done for one of the below reasons:
- You have a centralized source of meta-data for all your data sources (e.g. a Single Source of Truth) which is in its own VPC.
- Your data sources themselves are in separate VPC which do not contain the meta-data source.
- Your meta data operations and data reads require different scale or languages in their lambda function.