语音识别是将音频转换为文本的过程。这通常在语音助手(如Alexa,Siri等)中使用Python提供了一个称为SpeechRecognition的API,可让我们将音频转换为文本以进行进一步处理。在本文中,我们将研究使用Python的SpeechRecognition API将大型或较长的音频文件转换为文本。
处理大型音频文件
当输入的音频文件很长时,语音识别的准确性会降低。此外,Google语音识别API无法正确识别长音频文件。因此,我们需要将音频文件处理成较小的块,然后将这些块馈送到API。这样做可以提高准确性,并使我们能够识别大型音频文件。
根据静音分割音频
处理音频文件的一种方法是将其拆分为固定大小的块。例如,我们可以提取一个10分钟长的音频文件,并将其分成60个块,每个块10秒。然后,我们可以将这些块馈送到API,并通过串联所有这些块的结果将语音转换为文本。此方法不准确。将音频文件分割成固定大小的块可能会打断中间的句子,并且在此过程中我们可能会丢失一些重要的单词。这是因为音频文件可能会在完全说出单词之前结束,而Google将无法识别不完整的单词。
另一种方法是基于静音来分割音频文件。人类在句子之间停顿了很短的时间。如果我们可以基于这些静音将音频文件分成多个块,那么我们可以逐句处理文件并将其连接起来以获得结果。这种方法比前一种方法更准确,因为我们不剪切它们之间的句子,并且音频块将包含整个句子而不会造成任何干扰。这样,我们不需要将其拆分为恒定长度的块。
该方法的缺点是难以确定沉默的时间长度,因为不同的用户说的不同,并且某些用户在句子之间可能会暂停1秒,而有些用户可能会暂停0.5秒。
需要图书馆
Pydub: sudo pip3 install pydub
Speech recognition: sudo pip3 install SpeechRecognition
例子:
Input: peacock.wav
Output:
exporting chunk0.wav
Processing chunk 0
exporting chunk1.wav
Processing chunk 1
exporting chunk2.wav
Processing chunk 2
exporting chunk3.wav
Processing chunk 3
exporting chunk4.wav
Processing chunk 4
exporting chunk5.wav
Processing chunk 5
exporting chunk6.wav
Processing chunk 6
代码:
# importing libraries
import speech_recognition as sr
import os
from pydub import AudioSegment
from pydub.silence import split_on_silence
# a function that splits the audio file into chunks
# and applies speech recognition
def silence_based_conversion(path = "alice-medium.wav"):
# open the audio file stored in
# the local system as a wav file.
song = AudioSegment.from_wav(path)
# open a file where we will concatenate
# and store the recognized text
fh = open("recognized.txt", "w+")
# split track where silence is 0.5 seconds
# or more and get chunks
chunks = split_on_silence(song,
# must be silent for at least 0.5 seconds
# or 500 ms. adjust this value based on user
# requirement. if the speaker stays silent for
# longer, increase this value. else, decrease it.
min_silence_len = 500,
# consider it silent if quieter than -16 dBFS
# adjust this per requirement
silence_thresh = -16
)
# create a directory to store the audio chunks.
try:
os.mkdir('audio_chunks')
except(FileExistsError):
pass
# move into the directory to
# store the audio files.
os.chdir('audio_chunks')
i = 0
# process each chunk
for chunk in chunks:
# Create 0.5 seconds silence chunk
chunk_silent = AudioSegment.silent(duration = 10)
# add 0.5 sec silence to beginning and
# end of audio chunk. This is done so that
# it doesn't seem abruptly sliced.
audio_chunk = chunk_silent + chunk + chunk_silent
# export audio chunk and save it in
# the current directory.
print("saving chunk{0}.wav".format(i))
# specify the bitrate to be 192 k
audio_chunk.export("./chunk{0}.wav".format(i), bitrate ='192k', format ="wav")
# the name of the newly created chunk
filename = 'chunk'+str(i)+'.wav'
print("Processing chunk "+str(i))
# get the name of the newly created chunk
# in the AUDIO_FILE variable for later use.
file = filename
# create a speech recognition object
r = sr.Recognizer()
# recognize the chunk
with sr.AudioFile(file) as source:
# remove this if it is not working
# correctly.
r.adjust_for_ambient_noise(source)
audio_listened = r.listen(source)
try:
# try converting it to text
rec = r.recognize_google(audio_listened)
# write the output to the file.
fh.write(rec+". ")
# catch any errors.
except sr.UnknownValueError:
print("Could not understand audio")
except sr.RequestError as e:
print("Could not request results. check your internet connection")
i += 1
os.chdir('..')
if __name__ == '__main__':
print('Enter the audio file path')
path = input()
silence_based_conversion(path)
输出 :
recognized.txt:
The peacock is the national bird of India. They have colourful feathers, two legs and
a small beak. They are famous for their dance. When a peacock dances it spreads its
feathers like a fan. It has a long shiny dark blue neck. Peacocks are mostly found in
the fields they are very beautiful birds. The females are known as 'Peahen1. Their
feathers are used for making jackets, purses etc. We can see them in a zoo.