Converting text documents into speech can enhance accessibility and offer new ways for users to engage with your content. In this DevTip, we'll explore how to synthesize speech in documents using Ruby, implementing text-to-speech functionality for document narration.

Essential Ruby libraries for speech synthesis

Ruby offers libraries for implementing text-to-speech functionality. We'll use the espeak gem, which provides a Ruby interface to the open-source eSpeak speech synthesizer.

Setting up the environment

First, let's set up our Ruby environment with the necessary dependencies. We'll use the espeak gem for this example, as it's open-source and works offline:

Add the gem to your Gemfile:

gem 'espeak'

Or install it directly:

gem install espeak

Make sure you have the eSpeak synthesizer installed on your system:

For Ubuntu/Debian:

sudo apt-get install espeak

For macOS:

brew install espeak

Basic document narration script

Here's a simple script that converts text documents to speech:

require 'espeak'

include ESpeak

text = File.read('input.txt')
speech = Speech.new(text)
speech.voice = 'en' # Set the voice/language
speech.speed = 120  # Words per minute
speech.pitch = 50   # Voice pitch
speech.save('output.wav')

This script reads the text from input.txt, synthesizes speech using the eSpeak engine, and saves the output to output.wav. You can adjust the voice, speed, and pitch parameters to modify the output.

Enhanced multi-language support

Let's extend our script to handle multiple languages and provide more configuration options:

require 'espeak'
require 'nokogiri'

include ESpeak

class DocumentNarrator
  SUPPORTED_LANGUAGES = {
    'en' => 'English',
    'fr' => 'French',
    'de' => 'German',
    'es' => 'Spanish'
  }

  def initialize(config = {})
    @config = {
      voice: 'en',
      speed: 120,
      pitch: 50
    }.merge(config)
  end

  def narrate_document(input_path, output_path)
    text = extract_text(input_path)
    speech = Speech.new(text, voice: @config[:voice], speed: @config[:speed], pitch: @config[:pitch])
    speech.save(output_path)
  end

  private

  def extract_text(file_path)
    case File.extname(file_path)
    when '.txt'
      File.read(file_path)
    when '.html', '.htm'
      doc = Nokogiri::HTML(File.read(file_path))
      doc.xpath('//text()').map(&:text).join(' ')
    else
      raise "Unsupported file format: #{File.extname(file_path)}"
    end
  end
end

# Usage example with configuration
narrator = DocumentNarrator.new(
  voice: 'es',   # Spanish
  speed: 130,
  pitch: 60
)

narrator.narrate_document('document.html', 'narration.wav')

In this script, we've created a DocumentNarrator class that supports multiple languages and can extract text from both plain text and HTML files. The extract_text method handles different file types, and the configuration options allow you to customize the voice settings.

Best practices for speech synthesis

  1. Text Preprocessing:

    • Normalize text: Clean up unnecessary whitespace and remove special characters.

    • Handle abbreviations and numbers: Expand abbreviations and spell out numbers for clearer pronunciation.

  2. Audio Quality:

    • Adjust speech rate and pitch: Fine-tuning these settings can result in more natural-sounding audio.

    • Use high-quality voices: If higher quality voices are needed, consider using premium services or more advanced open-source tools.

  3. Performance Optimization:

    • Process in chunks: For large documents, process text in smaller chunks to manage memory usage.

    • Caching: Implement caching mechanisms if you're converting the same text multiple times.

Error handling and validation

Implement robust error handling to manage common issues:

class DocumentNarratorError < StandardError; end

class DocumentNarrator
  # ... [previous code] ...

  def narrate_document(input_path, output_path)
    validate_file(input_path)
    validate_output_path(output_path)
    text = extract_text(input_path)
    speech = Speech.new(text, voice: @config[:voice], speed: @config[:speed], pitch: @config[:pitch])
    speech.save(output_path)
  end

  private

  def validate_file(file_path)
    raise DocumentNarratorError, 'File not found' unless File.exist?(file_path)
    raise DocumentNarratorError, 'File is empty' if File.zero?(file_path)
    # Additional validations as needed
  end

  def validate_output_path(path)
    dir = File.dirname(path)
    raise DocumentNarratorError, 'Invalid output directory' unless Dir.exist?(dir)
    # Additional validations as needed
  end

  # ... [rest of the class] ...
end

Conclusion

Implementing speech synthesis in Ruby provides a powerful way to add audio narration capabilities to your document processing workflows. By leveraging the espeak gem and the eSpeak synthesizer, you can create customizable text-to-speech solutions. For production environments requiring high-quality speech synthesis with support for multiple languages and voices, consider using Transloadit's Text to Speech Robot.