Voice recording
To communicate with the Voice-assistant, Python package Google speech recognition is used. This package enables the recording of audio-files and the transcription of audio-files into plain-text.
Our implementation of a text-to-speech function takes the audio (in form of a stream) from the microphone and transcribes it as text with a google-recognition-function. Once the audio is in a text format, it can be processed by language-learning-models and be used to formulate responses. This forms the basis of how the voice assistant functions.
Recording Audio:
import speech_recognition as sr
r = sr.Recognizer()
with sr.Microphone() as source:
audio = r.listen(source) # alternatively: audio = r.record(source)
text = r.recognize_google(audio)
Google speech recognition provides two functions for capturing audio. It is important to understand their possibilities and limitations to decide what implementation fits the desired behaviour.
Recognizer.listen(source, timeout=None, phrase_time_limit=None) | |
---|---|
| |
source |
|
timeout |
|
phrase_time_limit |
|
Recognizer.record(source, duration=None, offset=None) | |
---|---|
| |
source |
|
duration |
|
offset |
|
Improving Audio-Input
Eliminating background-noise allows for better and faster recognition. The easiest way to eliminate noise is with the adjust_for_ambient_noise()
function in the speech_recognition
package.
import speech_recognition as sr
r = sr.Recognizer()
with sr.Microphone() as source:
r.adjust_for_ambient_noise(source)
audio = r.listen(source)
text = r.recognize_google(audio)
The function adjust_for_ambient_noise()
captures a short section of audio-input from the source and dynamically calculates a threshold. It is best practice to have the duration of capture set to at least 0.5, to get a proper input, and at most 1 second, to avoid capturing speech.
Recognizer.adjust_for_ambient_noise(source, duration=None) | |
---|---|
| |
source |
|
duration |
|