Filter files in Ruby: a practical guide

Efficient file filtering is essential for everything from log processing to media management. Ruby provides robust tools for file selection that go beyond simple pattern matching. Let's explore practical techniques that scale from basic to advanced scenarios.
Requirements
This guide requires Ruby 2.0 or later. The examples presented have been verified using Ruby 3.2.3 and mime-types 3.6.0. For MIME type detection, you'll need the mime-types gem:
# Add to your gemfile
gem 'mime-types', '~> 3.6'
# Or install directly
gem install mime-types
Note: The mime-types gem employs modified semantic versioning to track both API changes and registry data updates. For further details, refer to the mime-types documentation.
Core filtering methods
Ruby's standard library offers several immediate solutions for file filtering:
# Filter by extension
pdf_files = Dir.children('/docs').select { |f| File.extname(f) == '.pdf' }
# Filter by size (1mb threshold)
large_files = Dir.glob('*').select do |f|
begin
File.size(f) > 1_000_000
rescue Errno::ENOENT, Errno::EACCES => e
warn "Error accessing #{f}: #{e.message}"
false
end
end
# Filter by modification time (last 24 hours)
recent_files = Dir.glob('*').select do |f|
begin
File.mtime(f) > (Time.now - 86400)
rescue Errno::ENOENT, Errno::EACCES => e
warn "Error accessing #{f}: #{e.message}"
false
end
end
These methods use the File
class utilities for quick checks without loading file contents.
Advanced pattern matching with glob
Ruby's Dir.glob
supports UNIX-style pattern matching with some Ruby-specific enhancements:
# Match nested Markdown files
markdown_files = Dir.glob('**/*.md')
# Filter files modified in January 2024
jan_files = Dir.glob('*').grep(/(2024-01-\d{2})/)
# Combined size and type filter
big_images = Dir.glob('*.{jpg,png}').select do |f|
begin
File.size(f) > 500_000
rescue Errno::ENOENT, Errno::EACCES => e
warn "Error accessing #{f}: #{e.message}"
false
end
end
Use double star (**
) for recursive directory traversal and brace expansion for multiple
extensions.
Mime type detection
For more reliable type checking than extensions alone, use the mime-types
gem:
require 'mime/types'
def media_files(dir)
Dir.children(dir).select do |f|
begin
mime = MIME::Types.type_for(f).first
mime&.media_type == 'image' || mime&.media_type == 'video'
rescue StandardError => e
warn "Error processing #{f}: #{e.message}"
false
end
end
end
# Usage:
visual_assets = media_files('/content/assets')
This approach checks the actual MIME type registry rather than relying solely on file extensions. For additional details, see the mime-types gem documentation.
Metadata filtering
Combine multiple metadata points for precise selection:
def recent_documents(path)
Dir.glob("#{path}/*").select do |f|
begin
next unless File.file?(f)
ext = File.extname(f).downcase
size = File.size(f)
modified = File.mtime(f)
(ext == '.pdf' || ext == '.docx') &&
size.between?(10_000, 5_000_000) &&
modified > (Time.now - 7*86400)
rescue StandardError => e
warn "Error processing #{f}: #{e.message}"
false
end
end
end
This selects PDF/DOCX files modified in the last week between 10 KB and 5 MB.
Performance considerations
When processing large directories:
-
Lazy Evaluation: Use
lazy
with large result setsDir.glob('**/*').lazy .select { |f| File.size(f) > 1_000_000 } .first(10)
-
Early Exit: Fail fast with
break
when possibleDir.children('/logs').each do |f| begin next unless f.end_with?('.log') break if File.size(f) > 1_000_000_000 # Stop at first huge log process_log(f) rescue StandardError => e warn "Error processing #{f}: #{e.message}" end end
-
Metadata Caching: Store frequently accessed data
# Note: As of mime-types 3.6.0, caching is based on the data gem version instead of the mime-types version. file_cache = {} Dir.glob('*').each do |f| begin file_cache[f] = { mtime: File.mtime(f), size: File.size(f) } rescue StandardError => e warn "Error caching #{f}: #{e.message}" end end
Production-grade example
Here's a complete filtering module with error handling:
require 'mime/types'
class FileFilter
def initialize(root_dir)
@root = root_dir
end
def find_files(extensions: [], min_size: 0, max_age: Float::INFINITY)
Dir.glob(File.join(@root, '**', '*')).lazy.select do |path|
begin
next unless File.file?(path)
valid_extension = extensions.empty? || extensions.include?(File.extname(path))
valid_size = File.size(path) >= min_size
valid_age = (Time.now - File.mtime(path)) < max_age
valid_extension && valid_size && valid_age
rescue StandardError => e
warn "Error processing #{path}: #{e.message}"
false
end
end
rescue Errno::ENOENT => e
warn "Directory error: #{e.message}"
[].lazy
end
end
# Usage:
filter = FileFilter.new('/user/uploads')
recent_images = filter.find_files(
extensions: ['.jpg', '.png'],
min_size: 100_000,
max_age: 3600 # 1 hour
).take(100)
These examples illustrate practical ways to filter files in Ruby with robust error handling. For more advanced file processing solutions, consider exploring Transloadit's API—visit our documentation for additional details.
Handling file encodings
Ruby typically handles file name encodings automatically when using UTF-8. However, if you encounter issues with non-ASCII characters in file names, you can enforce UTF-8 encoding as follows:
# Convert file names to utf-8 encoding
utf8_files = Dir.glob('*').map { |f| f.force_encoding('UTF-8') }
This simple approach ensures that all file names are treated as UTF-8, minimizing potential encoding-related issues.
Conclusion
Throughout this guide, we explored practical techniques to filter files in Ruby—from simple extension checks and glob patterns to MIME type detection and metadata filtering. By combining these methods, you can build efficient file processing pipelines tailored to your needs. For large-scale, production-ready solutions, consider exploring the robust file processing API offered by Transloadit. Visit our documentation to learn more.