Synthesizing Audio from Text using Bark

ℹ️

The texts in this article were partly generated by artificial intelligence and corrected and revised by us.

Introduction

In an increasingly digital world, websites are the primary portal to information and engagement for billions of individuals worldwide. However, for many users, particularly those with visual impairments, traditional web content can be challenging to consume. Text-to-speech synthesis is rapidly becoming a crucial element in bridging this gap, offering the potential to transform websites into truly accessible experiences. Audio descriptions – accompanying audio content that narrates visuals – are no longer a luxury but a fundamental requirement under European Law, specifically the Web Accessibility Directive and its subsequent amendments. These laws mandate that websites must be accessible to all users, regardless of their abilities, and this includes providing meaningful alternatives to visual content for those who cannot see it. Audio descriptions, delivered through text-to-speech, are a vital component of this effort, ensuring that individuals with visual impairments can fully participate in online activities, fostering inclusivity and ensuring equal opportunity for everyone navigating the digital landscape.

In todays post, we will go into how to generate audio descriptions of text using the open-sourced neural model Bark from Suna. Let's dive into it!

Speech Synthesis

Before we go into how to produce audio files that include certain spoken texts, we want to quickly summarize the definition of the term Speech Synthesis.

When we talk about Speech Synthesis (also known as Text-to-Speech), we refer to a technology that converts written text into spoken words. It uses algorithms and models to analyze text and generate audio output – essentially, creating speech from written words.

Here’s a breakdown of key aspects:

Conversion Process: It works by analyzing the text’s structure (words, punctuation, etc.) and mapping those elements to specific phonetic sounds and intonation patterns.
Algorithms & Models: Modern speech synthesis relies on sophisticated algorithms like Hidden Markov Models and deep learning techniques that learn to mimic human speech.
Variations: There’s a range of speech synthesis technologies – from simple, robotic voices to more natural-sounding voices with varying accents and intonation.

Essentially, it's the process of transforming textual information into audible form. (Source 1, Source 2)

Generating Texts

In order to utilize the neural network, we begin by including our imports into the script we're going to execute later on:

import scipy
import torch

import numpy as np

from IPython.display import Audio
from transformers import BarkModel
from transformers import AutoProcessor

bark.ipynb

Nextup, we can begin downloading the models using the huggingface API:

device = "cuda:0" if torch.cuda.is_available() else "cpu"
print("Target Device: ", device)

model = BarkModel.from_pretrained("suno/bark-small")
model = model.to(device)
processor = AutoProcessor.from_pretrained("suno/bark", device=device)

bark.ipynb

Note: We’ve utilized the Bark model and processor to leverage a GPU when available, but this process isn’t always reliable. The processor often generates a diverse set of tensors, which require manual verification to ensure they are assigned to the GPU. Our code is automatically addressing this issue.

Now, Bark offers us the options:

Manually assign a speaker voice for the text given or
Automatically detect a voice to utilize for the text given.

We will go over both options, starting with the automatic option.

Automatically assigned Speakers

In order to generate text with an automatically assigned speaker, we can utilize the following function:

def GenerateAudioOutputWithoutSpeaker(prompt):
    inputs = processor(prompt, return_attention_mask="pt")
    attention_mask = inputs["attention_mask"]
    speech_output = model.generate(
        input_ids=inputs["input_ids"].to(device),
        attention_mask=attention_mask.to(device),
        pad_token_id=processor.tokenizer.pad_token_id)
    sampling_rate = model.generation_config.sample_rate
    return Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

bark.ipynb

Now, we can begin generating text. The generated audio will then be playable from the notebook.

We provide a variety of samples for outputs. We begin with a text from this source.

When Republika Srpska’s National Assembly passed its controversial “foreign agent” law earlier this year, it did not just target civil society, it drew a direct line under the independence of the media.

NationalAssembly

0:00

/11.653333

We also pulled an alternative text from this source:

In recent years, investigative journalists have made a significant impact exposing labor and environmental abuses by the fossil fuel and palm oil industries. But noticeably less scrutiny has been given to extractive industries providing the resources for solar panels, electric car batteries, and other alternative fuel components.

GreenEnergy

0:00

/14.426667

Currently, we’re achieving relatively good results. Sound quality can sometimes be inconsistent, but overall the language is clear. Furthermore, speed can be somewhat uncomfortable depending on the listener’s language proficiency.

Beyond these issues, Bark’s has a maximum audio duration of 13 seconds. While workarounds exist – splitting text into smaller parts and concatenating them – these methods don’t readily address the need for a variable speaking pace and a single speaker across the audio output.

In order to showcase this issue, we want to provide you with a sample of a text that cannot be said in under ~13 seconds: (Source)

These harms range from less obvious risks, such as toxic waste tailings dams perilously located just upstream of villages inside earthquake zones, to more brazen — including the forced labor practices behind some solar cell components, the deforestation connection to new car models, and a loss of clean water access for Indigenous communities. In addition, end users of these affected supply chains typically have little to no visibility of the human costs and problematic origins of the green energy products they are enthusiastically embracing. Sometimes these distant consumers may even be the targets of savvy greenwashing campaigns, pushing misleading information about the extraction of critical raw materials such as bauxite, cobalt, rubber, and lithium.

HarmsGreenEnergy

0:00

/12.36

We can clearly hear that our audio synthesis is cut off from the rest of the text. Sometimes, the model even begins to hallucinate additional information, however we cannot observe such behavior in this example.

def GenerateAudioWithMultiplePrompts(prompts):
    audioOutputs = []

    for prompt in prompts:
        inputs = processor(prompt, return_attention_mask="pt")
        attention_mask = inputs["attention_mask"]
        speech_output = model.generate(
            input_ids=inputs["input_ids"].to(device),
            attention_mask=attention_mask.to(device),
            # history_prompt="en_speaker_2",
            pad_token_id=processor.tokenizer.pad_token_id)
        sampling_rate = model.generation_config.sample_rate
        audioOutputs.append(speech_output[0].cpu().numpy())
    
    audioOutputs = np.concatenate(audioOutputs)
    return Audio(audioOutputs, rate=sampling_rate)

bark.ipynb

With this code we can generate longer texts into a single audio file by splitting up the prompt into smaller pieces. This is how the same text can be generated when split into multiple pieces:

HarmsGreenEnergyParts

0:00

/57.786667

To summarize, the mode where speakers are automatically assigned to texts, seems beneficial for us in the following situations:

The text given is the English language - otherwise language detection might be faulty.
The text is not too long and does not require us to generate it in pieces - otherwise speakers will vary across generation steps.
We do not have a particular speaker in mind and are interested to hear how this text might be highlighted by changing tone and speed.

Manually selecting Speakers

In the instance that we are not interested in having a speaker be selected for us, we also have the option to assign one ourselves. We can do that, using the following function:

def GenerateAudioOutputWithSpeaker(prompt, speaker):
    inputs = processor(prompt, voice_preset=speaker, return_tensors="pt")
    
    inputIds = inputs["input_ids"]
    attentionMask = inputs["attention_mask"]

    speech_output = model.generate(
        input_ids=inputIds.to(device),
        attention_mask=attentionMask.to(device),
        pad_token_id=processor.tokenizer.pad_token_id)
    sampling_rate = model.generation_config.sample_rate
    return Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

bark.ipynb

In this function, we provide a prompt alongside one of the identifiers for the speaker voices for the Bark model. For example, we can utilize the speaker "v2/en_speaker_6" when generating for English audience, or alternatively "v2/de_speaker_9" when aiming for a German audience.

Note: While the list presents a range of characteristics for each speaker, potential variations in tone and gender may occur.

In order to showcase the capabilities of Bark, we will not generate a text from above using the speaker "v2/en_speaker_5".

When Republika Srpska’s National Assembly passed its controversial “foreign agent” law earlier this year, it did not just target civil society, it drew a direct line under the independence of the media.

NationalAssemblySpeaker

0:00

/12.92

Alternatively, we can now also specify speakers from other languages to generate audio descriptions in other language. Here is a sample for the German language: (Source)

Um verteidigungsfähig zu werden, brauche die Bundeswehr bis zu 60.000 neue Soldaten, heißt es immer wieder. Der geplante Ausbau von Unterkünften deckt aber nur die Hälfte davon – fast doppelt so viele wären nötig.

BundeswehrSpeaker

0:00

/13.466667

Now we can also showcase how we can create long descriptions. For this, we can use the following code:

def GenerateAudioWithMultiplePromptsSpeakerAndSaveToFile(prompts, speaker, fileName):
    audioOutputs = []

    for prompt in prompts:
        inputs = processor(prompt, voice_preset=speaker, return_tensors="pt")
        inputIds = inputs["input_ids"]
        attentionMask = inputs["attention_mask"]

        speechOutput = model.generate(
            input_ids=inputIds.to(device),
            attention_mask=attentionMask.to(device),
            pad_token_id=processor.tokenizer.pad_token_id)
        audioOutputs.append(speechOutput[0].cpu().numpy())
    
    samplingRate = model.generation_config.sample_rate
    audioOutputs = np.concatenate(audioOutputs)
    return Audio(audioOutputs, rate=sampling_rate)

We used the following text to showcase this performance: (Source)

“I never thought this would happen here,” says Bosnian legal expert and activist Azra Berbic. “But now, we are all foreign agents in our own country.”

Their fear is well-founded. The law, passed in late February and now under the review of the Bosnian Constitutional Court, mandates that any organization or media outlet receiving foreign support and deemed to be engaged in “political activity” must submit exhaustive financial disclosures and visibly mark all its work as the product of a foreign agent. What qualifies as “political activity” is left intentionally vague, giving authorities sweeping discretion.

“We all know what it means when someone is labeled a ‘foreign agent.’ It’s an attack on legitimacy, on public trust.” — Elvir Padalovic from the independent outlet Buka

Almost immediately after the law’s adoption, police raided the offices of Capital.ba, a leading investigative outlet known for its critical reporting. For its editor-in-chief, Sinisa Vukelic, the message was unmistakable. “It’s not about transparency,” Vukelic told Gerila media. “It’s about control.”

NationalAssemblySpeakerParts

0:00

/119.72

In this piece of audio, we are able to identify the range of characteristics and limits of the model quite clearly. Despite the fact that we specified a single speaker, the model still generates voices from different people and different genders. Alongside this, we have sound hallucinations inside the output.

Note: We recommend sticking to single sentences for each generation to avoid unnecessary confusion while this model is not capable of generating longer audio descriptions.

TL;DR

In this article, we showcased Bark, an open-source neural model, to generate audio descriptions from text. Bark offers voice selection for each speaker, but can suffer inconsistencies in audio quality and speed, potentially causing discomfort for listeners.
The model sometimes produces flawed audio output, leading to hallucinations. We also explored ways to enhance this with multiple prompts and targeted speaker selection. This approach offers a promising path towards creating truly accessible online content, despite its current limitations. Overall, we think Bark marks a promising step for open-source text to speech models, which required manual intervention or selection in cases, where the context is longer than a sentence per generation.