Sink for writing APC (i.e. passenger count) data to Parquet files, which are stored in Blob Storage.
Building runnable JAR:
./gradlew shadowJar
Use Gradle or Docker to run the service locally. Connection to Apache Pulsar is needed. Two environment variables must be specified:
BLOB_CONNECTION_STRING
- connection string to the blob storageBLOB_CONTAINER
- name of the blob container to be used
Data is written to Parquet files for which the schema can be found here.
Each file contains data for 15 minutes, based on the time the data was received. File names are in format apc_<date>T<hour>-<minute>.parquet
, where <date>
is date in ISO8601 format, <hour>
is hour of the day (0-23) and <minute>
is 1-4 for each quarter of the hour. File name uses UTC timezone.
Metadata and index tags are added to the blob when it is uploaded to Blob Storage. Metadata are row_count
, which is the amount of rows in the Parquet file, and parquet_crc
, which is the CRC code of the file contents encoded in Base64. Index tags are min_tst
, which is the smallest timestamp (tst
) in the file, and max_tst
, which is the largest timestamp in the file.