Skip to content

Prometheus-AI-Project/2023-2-ai-speaker

 
 

Repository files navigation

라즈베리파이를 사용한 나만의 스피커 만들기

팀원 HW : 심하민, 성원희, 윤재선
팀원 SW : 박지완 , 박은주 , 남경현
[Notion]

How does it work?

  • Listen for request(especially post request for our project)
  • When post request is triggered at endpoint('rapa'), upload file content recieved from post to /server_uploaded(Removed when request is ended)
  • Read file from that path, and process data(model part)
  • Finally, return text answer data(Made by GPT Chat-API, propting)
  • Request some tasks to server(especially posting recorded file for our project)
  • When .wav file is recorded by user(=human) using button, it requests server to make appropriate response(=text) for that input audio, considering the user(=human)'s sex and age
  • Finally, gets answer text(=response) from server and make that answer to audio file using google TTS, and speak that to user(=human)
  • Get personal info(sex, age): Whipser model, fine-tuned with korean audio data
  • Recognize text of the audio: Use Google API - STT(Speech to Text)
  • Prompting through GPT, make appropriate answer text

Explanation for python files

  • server.py: utilizes server, to activate it, open terminal and type uvicorn server:api --reload
  • virtual_model.py: executed when server recalls. first recall stt, and track sex, age from the audio file. Then make a answer with prompted gpt
  • client.py: after making .wav file(=audio input), save it to recorded_audio, and execute this file (with appropriate path designated)
  • stt.py: google cloud speech-to-text, must activate auth(=json file) to use
  • tts.py: same as stt.py, by google, must be activated first
  • ask_gpt.py: requires extra info(=sex, age) and message from audio input which is transcripted by stt code

SW Team

image
- 초기 계획 : HW 팀으로 부터 받은 음성파일을 형태소 분석 , 구문 분석 , 의미 분석 , 담화 분석을 통해 그에 알맞는 언어를 생성한 후 다시 음성파일로 변환 후에 HW 팀으로 다시 넘긴다.

STT , TTS 에 사용한 OpenAI Whisper model

https://openai.com/research/whisper
image
Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing.

image
  • whisper STT 를 사용하여 음성을 텍스트로 변환한다음 모델을 통하여 화자정보를 추출
  • 이에 화자정보에 따른 rule-based의 프롬프트 튜닝을 거쳐 gpt api를 사용, input text에 맞는 텍스트를 생성
  • 생성된 텍스트를 whisper TTS 를 사용하여 음성파일의 형태로 변환하는 작업을 거침

화자 분류에 사용된 데이터셋

  • 화자를 분류할 때 있어서 데이터셋이 중요했습니다.
  • 먼저 어린아이와 성인의 화자 정보를 구분하기 위해 AI 허브의 어린이 음성 데이터셋을 사용하였습니다. 어린아이를 0 , 성인을 1로 라벨링 하였습니다.
image
  • 더 나아가 어린아이와 성인의 화자정보만 구분하는 것에더 더 나아가 남녀의 발화정보도 구분하였습니다.
  • 성별을 구분하기 위해서 AI 허브의 한국어 음성을 사용하였습니다. 여자는 0 , 남자를 1로 라벨링 하였습니다.
    image

화자 분류에 사용된 모델

model

(openai/whisper-large)

feature extractor (based on cnn)

image

encoder (based on transformer)

image

linear classifier layer

image

model training

finetuing whisper

image freeze model's feature extractor and encoder but train randomly initialized linear layer

HW Team

Explanation for python files

  • audiolist.py: identify current audio input and output device number
  • stt.py: The given Python script utilizes Google Cloud's Text-to-Speech API to convert text into speech and save it as a file.
  • finrecord.py: record audio and send it to server than get the appropriate audio file and output the audio file by speaker

About

Prometheus, Raspberry pi AI speaker team!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%