In this article, we are going to add a text-to-speech button in all the WordPress articles using the Web Speech API and standard WordPress hooks. At the end of the article, a solution that involves using an existing text-to-speech plugin will also be covered.

Advantages of the Web Speech API

The Web Speech API allows you to convert text to speech directly using the browser. It’s free and easy to use.

The Web Speech API interfaces used in our example, SpeechSynthesis, and SpeechSynthesisUterrance, are also well supported by all major web browsers. This means that you can confidently use these APIs in any production application.

Add the text-to-speech Button in the Post With the Proper Hooks

Using the the_content hook, we can include the HTML of the text-to-speech button at the beginning of the post. This is where text-to-speech players are usually positioned.

//Add the button HTML at the end of the post content
add_filter( 'the_content', 'add_content_before');

function add_content_before($content){

    //Put the content to speech in a javascript variable
    $script = '<script>var ss_player_content = ' . json_encode($content) . ';</script>';

    //Generate the button
    $play_button = '<button class="ss-button" id="ss-play-content" data-post-id="' . get_the_ID() . '">Play Content</button>';

    return $script . $play_button . $content;

}

The script above does what follows:

  1. It stores the post content in a JavaScript variable. The JavaScript part will use this variable to retrieve the content.
  2.  It adds to the HTML of the post a “Play” button. The visitors will be able to play the spoken version of the post by clicking on this button.

To conclude, enqueue a JavaScript file that will be used to create and player the spoken version of the post:

//Load public js
add_action( 'wp_enqueue_scripts', 'enqueue_scripts' );

function enqueue_scripts()
{

    wp_enqueue_script('ss-player',
        plugin_dir_url( __FILE__ ) . 'ss-player.js',
        array('jquery'),
        '1.00', true);

}

At this point, clicking the “Play” button will not produce results. In the next section, we’ll be creating the JavaScript implementation.

Perform the Conversion With the Web Speech API

The Web Speech API allows web developers to integrate text-to-speech functionality into their web applications. This API enables you to convert text into spoken words, making web content more accessible to users with visual impairments or for various other use cases where audio output is preferred.

Now, let’s create a function that plays the given strings using the SpeechSynthesis and SpeechSynthesisUterrance interfaces.

function speak(phrase) {

    // Create a new SpeechSynthesisUtterance object
    const utterance = new SpeechSynthesisUtterance();

    // Set the text to be spoken
    utterance.text = phrase;

    // Use the default speech synthesis voice
    const voices = speechSynthesis.getVoices();
    utterance.voice = voices[0]; // You can change the index to use a different voice

    // Speak the phrase
    speechSynthesis.speak(utterance);

}

This function does what follows:

  1. Creates a new SpeechSynthesis utterance. This interface represents the speech synthesis request, including the text to be spoken, voice selection, pitch, rate, and more.
  2.  Sets the text to be spoken using the text property of the utterance.
  3.  Using again the utterance interface, the script configures a voice from the available voices. Specifically, the voice with index 0 is the voice of an adult male.
  4.  It speaks the phrase using the speak() method.

The final step in the process is to handle the clicks on the play button and alternatively speak the post content when speechSynthesis is available in the browser or print a message when speechSynthesis is not available in the browser (for example, with old browser versions).

//handle click event listener with pure javascript
document.getElementById('ss-play-content').addEventListener('click', function() {

    'use strict';

    const phrase = window.ss_player_content;

    if ('speechSynthesis' in window) {

        // Usage example:
        speak(phrase);
        
    } else {
        console.log("SpeechSynthesis API is not supported in this browser");
    }

});

Clean the Post Content

The current implementation speaks the entire HTML of the content. This means that when, for example, paragraph opening and closing tags are encountered, the letter “p” is pronounced. When shortcode tags are encountered, the shortcode name is pronounced, etc.

We clearly don’t want that. As a consequence, we remove all the tags and shortcodes:

function add_content_before_enhanced($content){

    $cleaned_content = strip_shortcodes($content);
    $cleaned_content = wp_strip_all_tags($cleaned_content, true);

    //Put the content to speech in a javascript variable
    $script = '<script>var ss_player_content = ' . json_encode($cleaned_content) . ';</script>';

    //Generate the button
    $play_button = '<button class="ss-button" id="ss-play-content" data-post-id="' . get_the_ID() . '">Play Content</button>';

    return $script . $play_button . $content;

}

For a real-world implementation, consider replacing specific characters or strings with their speakable counterparts. For example, you might want to replace the &gt; html entity with the “greater than” string, the &lt; HTML entity with the “less than” string, etc.

In general, you should decide which elements you want to remove based on the specific website needs and after testing your custom text-to-speech implementation with articles of your website.

Create Your Custom Player

You can build your custom audio player by creating custom HTML elements and by using dedicated SpeechSynthesis methods in the events callbacks:

  • pause() – It pauses the SpeechSynthesis object.
  • resume() – It resumes a paused SpeechSynthesis object.
  • speak() – It adds the utterance, and then the configured text is spoken.

This implementation requires experience with JavaScript, HTML, and CSS. In general, the steps to perform this process are:

  1. Create a player using HTML and CSS.
  2.  Create event listeners for the clicks on the controls of the custom audio player, specifically the play and pause buttons.
  3. Run the methods of the SpeechSynthesis interface in the callback of the event listeners.

In this basic example, the pause() method is used to pause an utterance being spoken:

const ss = window.speechSynthesis;

const utterance1 = new SpeechSynthesisUtterance("Hello world.");

//Speak the utterance
ss.speak(utterance1);

//Pause the utterance
ss.pause();

Using the Web Speech API with WordPress using an existing plugin

There are many text-to-speech plugins for WordPress. One plugin that uses the Web Speech API is Real Voice.

The settings of this plugin have customization options that reflect the Web Speech API options. Specifically, you can set voice language, voice pitch, voice speed, and voice volume.

The "SpeechSynthesis" section of the Real Voice plugin with options to control the type of generated voice.
The “SpeechSynthesis” section of the Real Voice plugin with options to control the type of generated voice.

The plugin also allows you to configure to which post type the TTS player should be added. For example, if you want to add the text-to-speech button only on your blog articles, select “post” from the Post Type option.

The general plugin settings with the option used to configure the post type where the text to speech button should be applied.
The “Audio Player Location” settings section with the option used to configure the post type where the text-to-speech button should be applied.

Alternatives to the Web Speech API

SpeechSynthesis allows you to perform conversions directly in the browser without using external API. However, for better quality in the speech, support for SSML, and more customization options, consider dedicated text-to-speech services available on the web.

Text-to-speech from major web companies are:

There are also many other alternatives from standard companies that are usually easier to use.

I have recently tried ElevenLabs, a text-to-speech service with impressive AI-based voices. It requires just a simple API call using the provided token to convert any text string to an audio file. See the ElevenLabs API Documentation for more information.

Note that compared to the Web Speech API, the disadvantage of these services is that they have a cost, usually per converted character.