Synthesize speech in documents using Ruby

Converting text documents into speech can enhance accessibility and offer users new ways to engage with your content. In this DevTip, we explore how to synthesize speech in documents using Ruby and the Google Cloud Text-to-Speech library, complete with practical examples and best practices.
Setting up Google cloud text-to-speech
Google Cloud Text-to-Speech provides high-quality voices, extensive language support, and seamless integration. Set up your Ruby environment by adding the gem to your Gemfile:
gem 'google-cloud-text_to_speech'
Or install it directly:
gem install google-cloud-text_to_speech
Set up authentication by creating a service account and downloading the credentials file. Then, set the environment variable:
export GOOGLE_APPLICATION_CREDENTIALS="path/to/service-account-key.json"
Basic document narration script
Below is a simple script that converts text documents into audio files using Google Cloud Text-to-Speech. The script supports both plain text and HTML documents.
require "google/cloud/text_to_speech"
require "nokogiri"
class DocumentNarrator
def initialize(credentials_path = nil)
ENV["GOOGLE_APPLICATION_CREDENTIALS"] = credentials_path if credentials_path
@client = Google::Cloud::TextToSpeech.text_to_speech
end
def narrate_text(text, output_path, language_code: "en-US")
input = { text: text }
voice = { language_code: language_code, ssml_gender: "FEMALE" }
audio_config = { audio_encoding: "MP3", speaking_rate: 1.0, pitch: 0.0 }
response = @client.synthesize_speech(
input: input,
voice: voice,
audio_config: audio_config
)
File.open(output_path, "wb") do |file|
file.write(response.audio_content)
end
end
def narrate_file(input_path, output_path, language_code: "en-US")
text = extract_text(input_path)
narrate_text(text, output_path, language_code: language_code)
end
private
def extract_text(file_path)
case File.extname(file_path).downcase
when ".txt"
File.read(file_path)
when ".html", ".htm"
doc = Nokogiri::HTML(File.read(file_path))
doc.xpath("//text()").map(&:text).join(" ")
else
raise ArgumentError, "Unsupported file format"
end
end
end
Advanced features and error handling
Enhance your narrator with robust error handling and additional configuration options. This version supports multi-language selection, custom audio configuration, and detailed error reporting.
require "google/cloud/text_to_speech"
require "nokogiri"
class DocumentNarratorError < StandardError; end
class DocumentNarrator
SUPPORTED_LANGUAGES = {
"en-US" => "English (US)",
"en-GB" => "English (UK)",
"fr-FR" => "French",
"de-DE" => "German",
"es-ES" => "Spanish"
}
def initialize(credentials_path = nil, config = {})
ENV["GOOGLE_APPLICATION_CREDENTIALS"] = credentials_path if credentials_path
@client = Google::Cloud::TextToSpeech.text_to_speech
@config = default_config.merge(config)
end
def narrate_text(text, output_path, options = {})
validate_text(text)
validate_output_path(output_path)
input = { text: text }
voice = build_voice_config(options)
audio_config = build_audio_config(options)
begin
response = @client.synthesize_speech(
input: input,
voice: voice,
audio_config: audio_config
)
File.open(output_path, "wb") do |file|
file.write(response.audio_content)
end
rescue Google::Cloud::Error => e
raise DocumentNarratorError, "Synthesis failed: #{e.message}"
end
end
private
def default_config
{
language_code: "en-US",
speaking_rate: 1.0,
pitch: 0.0,
audio_encoding: "MP3"
}
end
def build_voice_config(options)
{
language_code: options[:language_code] || @config[:language_code],
ssml_gender: options[:gender] || "FEMALE"
}
end
def build_audio_config(options)
{
audio_encoding: options[:audio_encoding] || @config[:audio_encoding],
speaking_rate: options[:speaking_rate] || @config[:speaking_rate],
pitch: options[:pitch] || @config[:pitch]
}
end
def validate_text(text)
raise DocumentNarratorError, "Text cannot be empty" if text.nil? || text.strip.empty?
raise DocumentNarratorError, "Text too long" if text.length > 5000
end
def validate_output_path(path)
dir = File.dirname(path)
raise DocumentNarratorError, "Invalid output directory" unless Dir.exist?(dir)
end
end
Best practices for speech synthesis
Consider these best practices when implementing text-to-speech functionality:
-
Text preprocessing:
- Break long text into manageable chunks (5000 characters or less).
- Remove unnecessary whitespace and special characters.
- Expand abbreviations and numbers for improved pronunciation.
-
Voice configuration:
- Select the appropriate language and voice for your content.
- Adjust speaking rate and pitch for natural-sounding audio.
- Leverage SSML markup for fine-grained control, including pauses and emphasis.
-
Performance optimization:
- Implement caching to reduce redundant API calls.
- Use batch processing for large documents to enhance efficiency.
- Consider streaming methods for long audio files.
-
Security considerations:
- Securely store and manage your API credentials.
- Implement rate limiting and input validation to prevent abuse.
- Monitor API usage and set alerts to control costs.
Alternative solutions
While Google Cloud Text-to-Speech is a robust solution, you might consider other options:
- Amazon Polly:
require "aws-sdk-polly"
polly = Aws::Polly::Client.new(region: "us-west-2")
response = polly.synthesize_speech({
text: "Hello World",
output_format: "mp3",
voice_id: "Joanna"
})
- Microsoft Azure Cognitive Services:
require "azure_cognitiveservices_speech"
speech_config = SpeechConfig.from_subscription("your-key", "your-region")
synthesizer = SpeechSynthesizer.new(speech_config)
result = synthesizer.speak_text_async("Hello World").get
- Open-source Coqui TTS:
require "net/http"
require "uri"
require "json"
uri = URI("http://localhost:5002/api/tts")
request = Net::HTTP::Post.new(uri, "Content-Type" => "application/json")
request.body = { text: "Hello World" }.to_json
response = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(request)
end
File.open("output.wav", "wb") { |file| file.write(response.body) }
Conclusion
Modern text-to-speech solutions empower you to create engaging, accessible audio content from documents using Ruby. Google Cloud Text-to-Speech, along with other alternatives, offers high-quality voices and multi-language support. For a simplified solution with excellent quality and easy integration, consider using Transloadit's Text to Speech Robot with Uppy and Tus.