7 min read

I converted my eBook to an audiobook using Microsoft's Azure Cognitive Services Speech Service

Robots reading from robots
I converted my eBook to an audiobook using Microsoft's Azure Cognitive Services Speech Service

Quick link to the free Audiobook

Audiobook: Blockchains and NFTs: A Beginner’s Guide
Are you ready to dive into the exciting world of blockchains and NFTs, but don’t have the time to sit down and read a book? We’ve got you covered! Our new audiobook format allows you to listen to the book on the go, in your car, or while doing other tasks.Our audiobook is available in MP3 format, wi…

A Sample:

audio-thumbnail
Chapter 1 Sample
0:00
/1:37

Prerequisites

  • Azure subscription [Only the free account needed]
  • Create a Speech resource in the Azure portal.
  • Get the resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys. For more information about Cognitive Services resources, see Get the keys for your resource.

Text-to-speech Azure Docs

Why Microsoft Azure?

I tried another Python module called gTTS but it was very robotic in the output, and Microsoft was the most human like of the TTS modules I tried.


OK, let's start.

First Clean the files

In order to use Pandoc, I created my chapters as Markdown files originally. When passing in the files to the Text to Speech service [TTS], they took the mark-up as
literal and read them into the file. So, a # would be read as hash.

I went through and deleted all my formatting in each of the files manually.


Next, I wrote out a main.py Python script to parse all the text content from each of the markdown files.

import os

# Path to the directory
directory = './chapters'

# Use scandir to loop through the directories in the directory
with os.scandir(directory) as entries:
    for entry in sorted(entries, key=lambda e: e.name):
        if entry.is_dir():
            # Print the name of the directory
            print(f"Directory: {entry.name}")
            # Loop through the files in the directory
            with os.scandir(entry) as files:
                for file in sorted(files, key=lambda f: f.name):
                    if file.is_file() and file.name != '.DS_Store':
                        # Print the name of the file
                        print(f"  File: {file.name}")
                        # Open the file and read its contents
                        with open(file, 'r') as f:
                            contents = f.read()
                        # Print the contents of the file
                        print(f"    Contents: {contents}")

This script will loop through the directories in the directory variable using os.scandir, and for each directory it will open a new directory handle using os.scandir and loop through the files in the directory. Both the entries and files iterables are passed through the sorted function, which uses a lambda function as the key argument to sort the DirEntry objects by their name attribute. For each file, it will check if the file's name is .DS_Store and skip it if it is. Otherwise, it will open the file using the open function and read its contents using the read method. Finally, it will print the contents of the file.

This will be the basis for the main scripting as I will parse out the text, then send it over to Microsoft Azure to convert to a MP3.


Next, I needed a script to do the actual work of converting text to speech

I saved this new file as text_to_mp3.py, that way I can pull it in to the main.py script and other scripts later on.

The way to use it is to type the following in your terminal:

python text_to_mp3.py -t "Hey" -o "" -n "my_file"

It will send the text "Hey" up to Microsoft Azure, and then save out an MP3 with a nice British fellow saying "Hey".

audio-thumbnail
My file
0:00
/0:01

You will need to read the Azure docs to setup a Resource and get some API keys, it is pretty straightforward and documented well from the links at the start of this article.

import argparse
import os
from pathlib import Path
import sys
import azure.cognitiveservices.speech as speechsdk

def t2m(**kwargs):
    """ Converts text to an Mp3 """
    text = kwargs.get('text')
    output_dir = kwargs.get('output_dir')
    new_name = kwargs.get('new_name')

    # New file name save path
    the_path = f'{Path(__file__).absolute().parent}/{output_dir}/{new_name}.mp3'
    print(the_path)
    # This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
    speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('SPEECH_KEY'), region=os.environ.get('SPEECH_REGION'))

    # The language of the voice that speaks.
    # TODO add an arg to choose speaker
    # en-AU-KimNeural or en-GB-ThomasNeural
    speech_config.speech_synthesis_voice_name='en-GB-ThomasNeural'

    # Set Customize audio format to MP3
    # Audio48Khz192KBitRateMonoMp3
    speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Audio48Khz192KBitRateMonoMp3)
    audio_config = speechsdk.audio.AudioOutputConfig(filename=the_path)

    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

    # Get text from the console and synthesize to the default speaker.

    speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()

    if speech_synthesis_result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
        print("Speech synthesized for text [{}]".format(text))
    elif speech_synthesis_result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = speech_synthesis_result.cancellation_details
        print("Speech synthesis canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            if cancellation_details.error_details:
                print("Error details: {}".format(cancellation_details.error_details))
                print("Did you set the speech resource key and region values?")



if __name__ == '__main__':
    parser = argparse.ArgumentParser()


    parser.add_argument('-t', action='store', dest='text',
                        help='Enter the text to convert', required=True)

    parser.add_argument('-o', action='store', dest='output_dir',
                        help='Enter save location directory', required=True)

    parser.add_argument('-n', action='store', dest='new_name',
                        help='Enter new file name', required=True)

    args = parser.parse_args()

    # Convert the argparse.Namespace to a dictionary: vars(args)
    arg_dict = vars(args)
    # pass dictionary to main
    t2m(**arg_dict)
    sys.exit(0)


Next, lets update the main.py script to add some nice mp3 tags to the file.

To tag an MP3 file as an audiobook using Python, you can use the eyed3 library, which provides a convenient interface for reading and writing metadata to MP3 files.

Here is an example of how you can use eyed3 to set the "audiobook" genre on an MP3 file:

import eyed3

# Load the MP3 file
audiofile = eyed3.load('file.mp3')

# Set the 'genre' tag to 'Audiobook'
audiofile.tag.genre = u'Audiobook'

# Save the changes to the MP3 file
audiofile.tag.save()

This script loads the MP3 file using the eyed3.load function, sets the genre tag to 'Audiobook', and then saves the changes to the file using the save method.

I added that into the main.py script to fill out the artist, publisher, publisher_url, album, and genre

I also added a fancy function to count the number of markdown files so I could set the track number programmatically.

I also need a fancy cover for our audiobook, so I used open() to pull in the cover.png file:

# Open the image file
with open('cover.png', 'rb') as f:
    # Read the image file as a bytes object
    image_bytes = f.read()
    # Add the cover photo to the MP3 file
    audiofile.tag.images.set(3, image_bytes, 'image/png', u'Cover')

Finally, I saved the MP3 tag using audiofile.tag.save()

Since I am hitting the TTS API pretty rapidly, I ended up getting rate limited by Azure initially. To combat that I threw a 30 second sleep in the loop.

Azure Speech service has a batch process, but you have to switch from Free to Pay-as-you-go, so I didn't bother (I think that's how it works... Microsoft pricing is as convoluted as ever).

Once all setup the main script runs and outputs a bunch of MP3s to my output directory.

The Final main.py Script

import os
import eyed3
import time

from text_to_mp3 import t2m


# Path to the directory
directory = './chapters'
output_dir = 'output'

# MP3 Tags
author = 'Jörg Schneider'
book_title = "Blockchains and NFTs: A Beginner's Guide"
publisher = "Fun Internet Things"
publisher_url = "https://funinternetthings.com"


def get_file_count(directory):
    """ Fancy function for file counting """
    file_count = 0
    # Walk through the directory tree
    for root, dirs, files in os.walk(directory):
        # Count the files in the current directory
        # Get rid of MacOS .DS_Store
        files.remove('.DS_Store')
        file_count += len(files)
    # Print the number of files
    print(f'Number of files: {file_count}')
    return file_count

count = 1
total_files = get_file_count(directory=directory)


# Use scandir to loop through the directories in the directory
with os.scandir(directory) as entries:
    for entry in sorted(entries, key=lambda e: e.name):
        if entry.is_dir():
            # Print the name of the directory
            print(f"Directory: {entry.name}")
            directory = entry.name
            # Loop through the files in the directory
            with os.scandir(entry) as files:
                for file in sorted(files, key=lambda f: f.name):
                    if file.is_file() and file.name != '.DS_Store':
                        # Print the name of the file
                        print(f"  File: {file.name}")
                        print(f"  File: {file.path}")
                        fname = file.name
                        filename = fname.strip('.md')
                        print(filename)
                        flist = filename.split('-')
                        base_name = flist[1]
                        # Create a new name for the mp3 files
                        # Pad number with zeros
                        new_name = f"{str(count).zfill(2)}-{base_name}"
                        # Open the file and read its contents
                        with open(file, 'r') as f:
                            contents = f.read()
                        # Print the contents of the file
                        print(f"    Contents: {contents}")

                        # Text to MP3
                        t2m(
                            text=contents,
                            output_dir=output_dir,
                            new_name=new_name
                        )

                        # Add mp3 tags to file
                        # Load the MP3 file
                        audiofile = eyed3.load(f'./{output_dir}/{new_name}.mp3')
                        audiofile.initTag()
                        audiofile.tag.title = f'{filename}'
                        audiofile.tag.artist = f'{author}'
                        audiofile.tag.publisher = f'{publisher}'
                        audiofile.tag.publisher_url = f'{publisher_url}'
                        audiofile.tag.album = f'{book_title}'
                        audiofile.tag.genre = 'Audiobook'
                        audiofile.tag.track_num = (count, total_files)
                        # Open the image file
                        with open('cover.png', 'rb') as f:
                            # Read the image file as a bytes object
                            image_bytes = f.read()
                            # Add the cover photo to the MP3 file
                            audiofile.tag.images.set(3, image_bytes, 'image/png', u'Cover')
                        # Save the changes to the MP3 file
                        audiofile.tag.save()
                        print(f"Saved: {new_name}, time to sleep.")
                        count += 1
                        time.sleep(30)

print('FINISHED')



Adding to Apple Books to verify the audiobook works

I copied the output directory over to Apple Books to see if it will import correctly.

And by golly, IT WORKED!! Wow!


Final Thoughts

As this book is my first, I mainly wanted to get the workflow done and learn how to build out the pieces that I can string together to export from a bunch of markdown files into an audiobook.

This has been a really cool experience overall. I think from the initial writing using ChatGPT, to figuring out a Gumroad store, and then exporting an audiobook using Microsoft Azure and some Python, I really learned a bunch this week.


If you have any questions or comments, please feel free to reach out to me on Mastodon.

Jeremy


Mastodon 109599539383024292