I converted my eBook to an audiobook using Microsoft's Azure Cognitive Services Speech Service

Quick link to the free Audiobook
A Sample:
Prerequisites
- Azure subscription [Only the free account needed]
- Create a Speech resource in the Azure portal.
- Get the resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys. For more information about Cognitive Services resources, see Get the keys for your resource.
Text-to-speech Azure Docs
- https://speech.microsoft.com/portal
- https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/get-started-text-to-speech?tabs=macos%2Cterminal&pivots=programming-language-python
Why Microsoft Azure?
I tried another Python module called gTTS
but it was very robotic in the output, and Microsoft was the most human like of the TTS modules I tried.
OK, let's start.
First Clean the files
In order to use Pandoc, I created my chapters as Markdown files originally. When passing in the files to the Text to Speech service [TTS], they took the mark-up as
literal and read them into the file. So, a #
would be read as hash
.
I went through and deleted all my formatting in each of the files manually.
Next, I wrote out a main.py
Python script to parse all the text content from each of the markdown files.
import os
# Path to the directory
directory = './chapters'
# Use scandir to loop through the directories in the directory
with os.scandir(directory) as entries:
for entry in sorted(entries, key=lambda e: e.name):
if entry.is_dir():
# Print the name of the directory
print(f"Directory: {entry.name}")
# Loop through the files in the directory
with os.scandir(entry) as files:
for file in sorted(files, key=lambda f: f.name):
if file.is_file() and file.name != '.DS_Store':
# Print the name of the file
print(f" File: {file.name}")
# Open the file and read its contents
with open(file, 'r') as f:
contents = f.read()
# Print the contents of the file
print(f" Contents: {contents}")
This script will loop through the directories in the directory
variable using os.scandir
, and for each directory it will open a new directory handle using os.scandir
and loop through the files in the directory. Both the entries
and files
iterables are passed through the sorted
function, which uses a lambda function as the key argument to sort the DirEntry
objects by their name
attribute. For each file, it will check if the file's name is .DS_Store
and skip it if it is. Otherwise, it will open the file using the open
function and read its contents using the read
method. Finally, it will print the contents of the file.
This will be the basis for the main scripting as I will parse out the text, then send it over to Microsoft Azure to convert to a MP3.
Next, I needed a script to do the actual work of converting text to speech
I saved this new file as text_to_mp3.py
, that way I can pull it in to the main.py
script and other scripts later on.
The way to use it is to type the following in your terminal:
python text_to_mp3.py -t "Hey" -o "" -n "my_file"
It will send the text "Hey" up to Microsoft Azure, and then save out an MP3 with a nice British fellow saying "Hey".
You will need to read the Azure docs to setup a Resource and get some API keys, it is pretty straightforward and documented well from the links at the start of this article.
import argparse
import os
from pathlib import Path
import sys
import azure.cognitiveservices.speech as speechsdk
def t2m(**kwargs):
""" Converts text to an Mp3 """
text = kwargs.get('text')
output_dir = kwargs.get('output_dir')
new_name = kwargs.get('new_name')
# New file name save path
the_path = f'{Path(__file__).absolute().parent}/{output_dir}/{new_name}.mp3'
print(the_path)
# This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('SPEECH_KEY'), region=os.environ.get('SPEECH_REGION'))
# The language of the voice that speaks.
# TODO add an arg to choose speaker
# en-AU-KimNeural or en-GB-ThomasNeural
speech_config.speech_synthesis_voice_name='en-GB-ThomasNeural'
# Set Customize audio format to MP3
# Audio48Khz192KBitRateMonoMp3
speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Audio48Khz192KBitRateMonoMp3)
audio_config = speechsdk.audio.AudioOutputConfig(filename=the_path)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
# Get text from the console and synthesize to the default speaker.
speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()
if speech_synthesis_result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print("Speech synthesized for text [{}]".format(text))
elif speech_synthesis_result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = speech_synthesis_result.cancellation_details
print("Speech synthesis canceled: {}".format(cancellation_details.reason))
if cancellation_details.reason == speechsdk.CancellationReason.Error:
if cancellation_details.error_details:
print("Error details: {}".format(cancellation_details.error_details))
print("Did you set the speech resource key and region values?")
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-t', action='store', dest='text',
help='Enter the text to convert', required=True)
parser.add_argument('-o', action='store', dest='output_dir',
help='Enter save location directory', required=True)
parser.add_argument('-n', action='store', dest='new_name',
help='Enter new file name', required=True)
args = parser.parse_args()
# Convert the argparse.Namespace to a dictionary: vars(args)
arg_dict = vars(args)
# pass dictionary to main
t2m(**arg_dict)
sys.exit(0)
Next, lets update the main.py
script to add some nice mp3 tags to the file.
To tag an MP3 file as an audiobook using Python, you can use the eyed3
library, which provides a convenient interface for reading and writing metadata to MP3 files.
Here is an example of how you can use eyed3
to set the "audiobook" genre on an MP3 file:
import eyed3
# Load the MP3 file
audiofile = eyed3.load('file.mp3')
# Set the 'genre' tag to 'Audiobook'
audiofile.tag.genre = u'Audiobook'
# Save the changes to the MP3 file
audiofile.tag.save()
This script loads the MP3 file using the eyed3.load
function, sets the genre tag to 'Audiobook', and then saves the changes to the file using the save
method.
I added that into the main.py
script to fill out the artist
, publisher
, publisher_url
, album
, and genre
I also added a fancy function to count the number of markdown files so I could set the track number programmatically.
I also need a fancy cover for our audiobook, so I used open()
to pull in the cover.png
file:
# Open the image file
with open('cover.png', 'rb') as f:
# Read the image file as a bytes object
image_bytes = f.read()
# Add the cover photo to the MP3 file
audiofile.tag.images.set(3, image_bytes, 'image/png', u'Cover')
Finally, I saved the MP3 tag using audiofile.tag.save()
Since I am hitting the TTS API pretty rapidly, I ended up getting rate limited by Azure initially. To combat that I threw a 30 second sleep in the loop.
Azure Speech service has a batch process, but you have to switch from Free to Pay-as-you-go, so I didn't bother (I think that's how it works... Microsoft pricing is as convoluted as ever).
Once all setup the main script runs and outputs a bunch of MP3s to my output directory.
The Final main.py
Script
import os
import eyed3
import time
from text_to_mp3 import t2m
# Path to the directory
directory = './chapters'
output_dir = 'output'
# MP3 Tags
author = 'Jörg Schneider'
book_title = "Blockchains and NFTs: A Beginner's Guide"
publisher = "Fun Internet Things"
publisher_url = "https://funinternetthings.com"
def get_file_count(directory):
""" Fancy function for file counting """
file_count = 0
# Walk through the directory tree
for root, dirs, files in os.walk(directory):
# Count the files in the current directory
# Get rid of MacOS .DS_Store
files.remove('.DS_Store')
file_count += len(files)
# Print the number of files
print(f'Number of files: {file_count}')
return file_count
count = 1
total_files = get_file_count(directory=directory)
# Use scandir to loop through the directories in the directory
with os.scandir(directory) as entries:
for entry in sorted(entries, key=lambda e: e.name):
if entry.is_dir():
# Print the name of the directory
print(f"Directory: {entry.name}")
directory = entry.name
# Loop through the files in the directory
with os.scandir(entry) as files:
for file in sorted(files, key=lambda f: f.name):
if file.is_file() and file.name != '.DS_Store':
# Print the name of the file
print(f" File: {file.name}")
print(f" File: {file.path}")
fname = file.name
filename = fname.strip('.md')
print(filename)
flist = filename.split('-')
base_name = flist[1]
# Create a new name for the mp3 files
# Pad number with zeros
new_name = f"{str(count).zfill(2)}-{base_name}"
# Open the file and read its contents
with open(file, 'r') as f:
contents = f.read()
# Print the contents of the file
print(f" Contents: {contents}")
# Text to MP3
t2m(
text=contents,
output_dir=output_dir,
new_name=new_name
)
# Add mp3 tags to file
# Load the MP3 file
audiofile = eyed3.load(f'./{output_dir}/{new_name}.mp3')
audiofile.initTag()
audiofile.tag.title = f'{filename}'
audiofile.tag.artist = f'{author}'
audiofile.tag.publisher = f'{publisher}'
audiofile.tag.publisher_url = f'{publisher_url}'
audiofile.tag.album = f'{book_title}'
audiofile.tag.genre = 'Audiobook'
audiofile.tag.track_num = (count, total_files)
# Open the image file
with open('cover.png', 'rb') as f:
# Read the image file as a bytes object
image_bytes = f.read()
# Add the cover photo to the MP3 file
audiofile.tag.images.set(3, image_bytes, 'image/png', u'Cover')
# Save the changes to the MP3 file
audiofile.tag.save()
print(f"Saved: {new_name}, time to sleep.")
count += 1
time.sleep(30)
print('FINISHED')
Adding to Apple Books to verify the audiobook works
I copied the output
directory over to Apple Books to see if it will import correctly.

And by golly, IT WORKED!! Wow!
Final Thoughts
As this book is my first, I mainly wanted to get the workflow done and learn how to build out the pieces that I can string together to export from a bunch of markdown files into an audiobook.
This has been a really cool experience overall. I think from the initial writing using ChatGPT, to figuring out a Gumroad store, and then exporting an audiobook using Microsoft Azure and some Python, I really learned a bunch this week.
If you have any questions or comments, please feel free to reach out to me on Mastodon.
Jeremy
Member discussion