Voice recording

To communicate with the Voice-assistant, Python package Google speech recognition is used. This package enables the recording of audio-files and the transcription of audio-files into plain-text.

Our implementation of a text-to-speech function takes the audio (in form of a stream) from the microphone and transcribes it as text with a google-recognition-function. Once the audio is in a text format, it can be processed by language-learning-models and be used to formulate responses. This forms the basis of how the voice assistant functions.

Recording Audio:

import speech_recognition as sr r = sr.Recognizer() with sr.Microphone() as source: audio = r.listen(source) # alternatively: audio = r.record(source) text = r.recognize_google(audio)

Google speech recognition provides two functions for capturing audio. It is important to understand their possibilities and limitations to decide what implementation fits the desired behaviour.

Recognizer.listen(source, timeout=None, phrase_time_limit=None)

Recognizer.listen(source, timeout=None, phrase_time_limit=None)

  • capture the audio input as a stream continuously

  • listens to the input and buffers the audio data until it is stopped or a timeout occurs

source

  • audio input via sr.Microphone or sr.AudioFile

timeout

  • timeout=None => waits indefinitely until speech is detected

  • timeout is not None => maximum time (in seconds) to wait to detect speech before raise WaitTimeoutError() occurs

phrase_time_limit

  • phrase_time_limit=None => audio is recorded until pause in speech occurs

  • phrase_time_limit is not None => amount of time in seconds, since a speech is detected until the listening process is stopped

Recognizer.record(source, duration=None, offset=None)

Recognizer.record(source, duration=None, offset=None)

  • capture the audio input as a stream

source

  • audio input via sr.Microphone or sr.AudioFile

duration

  • duration=None => records until break in speech

  • duration is not None => records until break in speech or the desired duration (in seconds) of recording is achieved

offset

  • number of seconds from the beginning of source to wait before starting recording

 

Improving Audio-Input

Eliminating background-noise allows for better and faster recognition. The easiest way to eliminate noise is with the adjust_for_ambient_noise() function in the speech_recognition package.

import speech_recognition as sr r = sr.Recognizer() with sr.Microphone() as source: r.adjust_for_ambient_noise(source) audio = r.listen(source) text = r.recognize_google(audio)

The function adjust_for_ambient_noise() captures a short section of audio-input from the source and dynamically calculates a threshold. It is best practice to have the duration of capture set to at least 0.5, to get a proper input, and at most 1 second, to avoid capturing speech.

Recognizer.adjust_for_ambient_noise(source, duration=None)

Recognizer.adjust_for_ambient_noise(source, duration=None)

  • captures short segment of audio as a baseline

  • filters all audio-input the recognizer receives

source

  • audio input via sr.Microphone or sr.AudioFile

duration

  • duration => time in seconds the function can capture audio-input

Â