Skip to content

Latest commit

 

History

History
35 lines (26 loc) · 1.93 KB

README.md

File metadata and controls

35 lines (26 loc) · 1.93 KB

Speech-to-text-data_collection

Introduction

In this project,we are going to design and build a robust, large scale, fault tolerant, highly available Kafka cluster that can be used to post a sentence and receive an audio file.

By the end of this project we will produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model. index

Table of Contents

Technologies

  • Apache Kafka: To sequentially log streaming data into specific topics
  • Apache Airflow: To create,ocherstrate and monitor data workflows
  • Apache Spark:To transform and load data from Kafka cluster
  • S3 Buckets: For storing transformed streaming data

Volunteer

To help us in collecting data audio data for Amharic language, visit datacollectionpipeline. On the home page, go to 'CONTRIBUTE AUDIO'.

You will be presented with a statement in Amharic. Click on 'Record' and read the statement out loud. Once you have finished, click 'Stop' and send. Record Aduio

Architecture

Following is a detailed technical diagram showing the configuration of the archictecure. Architecture