Skip to content

Speech-to-text data collection with Kafka, Airflow, and Spark, building a pipeline that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model.

Notifications You must be signed in to change notification settings

TECH-IN-EYE/Speech-to-text-data_collection

 
 

Repository files navigation

Speech-to-text-data_collection

Introduction

In this project,we are going to design and build a robust, large scale, fault tolerant, highly available Kafka cluster that can be used to post a sentence and receive an audio file.

By the end of this project we will produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model. index

Table of Contents

Technologies

  • Apache Kafka: To sequentially log streaming data into specific topics
  • Apache Airflow: To create,ocherstrate and monitor data workflows
  • Apache Spark:To transform and load data from Kafka cluster
  • S3 Buckets: For storing transformed streaming data

Volunteer

To help us in collecting data audio data for Amharic language, visit datacollectionpipeline. On the home page, go to 'CONTRIBUTE AUDIO'.

You will be presented with a statement in Amharic. Click on 'Record' and read the statement out loud. Once you have finished, click 'Stop' and send. Record Aduio

Architecture

Following is a detailed technical diagram showing the configuration of the archictecure. Architecture

About

Speech-to-text data collection with Kafka, Airflow, and Spark, building a pipeline that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 57.8%
  • CSS 20.8%
  • Jupyter Notebook 8.4%
  • JavaScript 8.3%
  • HTML 4.0%
  • Dockerfile 0.7%