Last updated: February 5, 2025

Synthesize speech in documents using Ruby

Tim Koschützki

Co-founder · Berlin, Germany · Show bio ·

Converting text documents into speech can enhance accessibility and offer users new ways to engage with your content. In this DevTip, we explore how to synthesize speech in documents using Ruby and the Google Cloud Text-to-Speech library, complete with practical examples and best practices.

Setting up Google cloud text-to-speech

Google Cloud Text-to-Speech provides high-quality voices, extensive language support, and seamless integration. Set up your Ruby environment by adding the gem to your Gemfile:

gem 'google-cloud-text_to_speech'

Or install it directly:

gem install google-cloud-text_to_speech

Set up authentication by creating a service account and downloading the credentials file. Then, set the environment variable:

export GOOGLE_APPLICATION_CREDENTIALS="path/to/service-account-key.json"

Basic document narration script

Below is a simple script that converts text documents into audio files using Google Cloud Text-to-Speech. The script supports both plain text and HTML documents.

require "google/cloud/text_to_speech"
require "nokogiri"

class DocumentNarrator
  def initialize(credentials_path = nil)
    ENV["GOOGLE_APPLICATION_CREDENTIALS"] = credentials_path if credentials_path
    @client = Google::Cloud::TextToSpeech.text_to_speech
  end

  def narrate_text(text, output_path, language_code: "en-US")
    input = { text: text }
    voice = { language_code: language_code, ssml_gender: "FEMALE" }
    audio_config = { audio_encoding: "MP3", speaking_rate: 1.0, pitch: 0.0 }

    response = @client.synthesize_speech(
      input: input,
      voice: voice,
      audio_config: audio_config
    )

    File.open(output_path, "wb") do |file|
      file.write(response.audio_content)
    end
  end

  def narrate_file(input_path, output_path, language_code: "en-US")
    text = extract_text(input_path)
    narrate_text(text, output_path, language_code: language_code)
  end

  private

  def extract_text(file_path)
    case File.extname(file_path).downcase
    when ".txt"
      File.read(file_path)
    when ".html", ".htm"
      doc = Nokogiri::HTML(File.read(file_path))
      doc.xpath("//text()").map(&:text).join(" ")
    else
      raise ArgumentError, "Unsupported file format"
    end
  end
end

Advanced features and error handling

Enhance your narrator with robust error handling and additional configuration options. This version supports multi-language selection, custom audio configuration, and detailed error reporting.

require "google/cloud/text_to_speech"
require "nokogiri"

class DocumentNarratorError < StandardError; end

class DocumentNarrator
  SUPPORTED_LANGUAGES = {
    "en-US" => "English (US)",
    "en-GB" => "English (UK)",
    "fr-FR" => "French",
    "de-DE" => "German",
    "es-ES" => "Spanish"
  }

  def initialize(credentials_path = nil, config = {})
    ENV["GOOGLE_APPLICATION_CREDENTIALS"] = credentials_path if credentials_path
    @client = Google::Cloud::TextToSpeech.text_to_speech
    @config = default_config.merge(config)
  end

  def narrate_text(text, output_path, options = {})
    validate_text(text)
    validate_output_path(output_path)

    input = { text: text }
    voice = build_voice_config(options)
    audio_config = build_audio_config(options)

    begin
      response = @client.synthesize_speech(
        input: input,
        voice: voice,
        audio_config: audio_config
      )

      File.open(output_path, "wb") do |file|
        file.write(response.audio_content)
      end
    rescue Google::Cloud::Error => e
      raise DocumentNarratorError, "Synthesis failed: #{e.message}"
    end
  end

  private

  def default_config
    {
      language_code: "en-US",
      speaking_rate: 1.0,
      pitch: 0.0,
      audio_encoding: "MP3"
    }
  end

  def build_voice_config(options)
    {
      language_code: options[:language_code] || @config[:language_code],
      ssml_gender: options[:gender] || "FEMALE"
    }
  end

  def build_audio_config(options)
    {
      audio_encoding: options[:audio_encoding] || @config[:audio_encoding],
      speaking_rate: options[:speaking_rate] || @config[:speaking_rate],
      pitch: options[:pitch] || @config[:pitch]
    }
  end

  def validate_text(text)
    raise DocumentNarratorError, "Text cannot be empty" if text.nil? || text.strip.empty?
    raise DocumentNarratorError, "Text too long" if text.length > 5000
  end

  def validate_output_path(path)
    dir = File.dirname(path)
    raise DocumentNarratorError, "Invalid output directory" unless Dir.exist?(dir)
  end
end

Best practices for speech synthesis

Consider these best practices when implementing text-to-speech functionality:

Text preprocessing:
- Break long text into manageable chunks (5000 characters or less).
- Remove unnecessary whitespace and special characters.
- Expand abbreviations and numbers for improved pronunciation.
Voice configuration:
- Select the appropriate language and voice for your content.
- Adjust speaking rate and pitch for natural-sounding audio.
- Leverage SSML markup for fine-grained control, including pauses and emphasis.
Performance optimization:
- Implement caching to reduce redundant API calls.
- Use batch processing for large documents to enhance efficiency.
- Consider streaming methods for long audio files.
Security considerations:
- Securely store and manage your API credentials.
- Implement rate limiting and input validation to prevent abuse.
- Monitor API usage and set alerts to control costs.

Alternative solutions

While Google Cloud Text-to-Speech is a robust solution, you might consider other options:

Amazon Polly:

require "aws-sdk-polly"

polly = Aws::Polly::Client.new(region: "us-west-2")
response = polly.synthesize_speech({
  text: "Hello World",
  output_format: "mp3",
  voice_id: "Joanna"
})

Microsoft Azure Cognitive Services:

require "azure_cognitiveservices_speech"

speech_config = SpeechConfig.from_subscription("your-key", "your-region")
synthesizer = SpeechSynthesizer.new(speech_config)
result = synthesizer.speak_text_async("Hello World").get

Open-source Coqui TTS:

require "net/http"
require "uri"
require "json"

uri = URI("http://localhost:5002/api/tts")
request = Net::HTTP::Post.new(uri, "Content-Type" => "application/json")
request.body = { text: "Hello World" }.to_json

response = Net::HTTP.start(uri.hostname, uri.port) do |http|
  http.request(request)
end

File.open("output.wav", "wb") { |file| file.write(response.body) }

Conclusion

Modern text-to-speech solutions empower you to create engaging, accessible audio content from documents using Ruby. Google Cloud Text-to-Speech, along with other alternatives, offers high-quality voices and multi-language support. For a simplified solution with excellent quality and easy integration, consider using Transloadit's Text to Speech Robot with Uppy and Tus.

#ruby #speech-synthesis #text-to-speech #artificial-intelligence-service