Efficient file filtering is essential for everything from log processing to media management. Ruby provides robust tools for file selection that go beyond simple pattern matching. Let's explore practical techniques that scale from basic to advanced scenarios.

Requirements

This guide requires Ruby 2.0 or later. The examples presented have been verified using Ruby 3.2.3 and mime-types 3.6.0. For MIME type detection, you'll need the mime-types gem:

# Add to your gemfile
gem 'mime-types', '~> 3.6'

# Or install directly
gem install mime-types

Note: The mime-types gem employs modified semantic versioning to track both API changes and registry data updates. For further details, refer to the mime-types documentation.

Core filtering methods

Ruby's standard library offers several immediate solutions for file filtering:

# Filter by extension
pdf_files = Dir.children('/docs').select { |f| File.extname(f) == '.pdf' }

# Filter by size (1mb threshold)
large_files = Dir.glob('*').select do |f|
  begin
    File.size(f) > 1_000_000
  rescue Errno::ENOENT, Errno::EACCES => e
    warn "Error accessing #{f}: #{e.message}"
    false
  end
end

# Filter by modification time (last 24 hours)
recent_files = Dir.glob('*').select do |f|
  begin
    File.mtime(f) > (Time.now - 86400)
  rescue Errno::ENOENT, Errno::EACCES => e
    warn "Error accessing #{f}: #{e.message}"
    false
  end
end

These methods use the File class utilities for quick checks without loading file contents.

Advanced pattern matching with glob

Ruby's Dir.glob supports UNIX-style pattern matching with some Ruby-specific enhancements:

# Match nested Markdown files
markdown_files = Dir.glob('**/*.md')

# Filter files modified in January 2024
jan_files = Dir.glob('*').grep(/(2024-01-\d{2})/)

# Combined size and type filter
big_images = Dir.glob('*.{jpg,png}').select do |f|
  begin
    File.size(f) > 500_000
  rescue Errno::ENOENT, Errno::EACCES => e
    warn "Error accessing #{f}: #{e.message}"
    false
  end
end

Use double star (**) for recursive directory traversal and brace expansion for multiple extensions.

Mime type detection

For more reliable type checking than extensions alone, use the mime-types gem:

require 'mime/types'

def media_files(dir)
  Dir.children(dir).select do |f|
    begin
      mime = MIME::Types.type_for(f).first
      mime&.media_type == 'image' || mime&.media_type == 'video'
    rescue StandardError => e
      warn "Error processing #{f}: #{e.message}"
      false
    end
  end
end

# Usage:
visual_assets = media_files('/content/assets')

This approach checks the actual MIME type registry rather than relying solely on file extensions. For additional details, see the mime-types gem documentation.

Metadata filtering

Combine multiple metadata points for precise selection:

def recent_documents(path)
  Dir.glob("#{path}/*").select do |f|
    begin
      next unless File.file?(f)

      ext = File.extname(f).downcase
      size = File.size(f)
      modified = File.mtime(f)

      (ext == '.pdf' || ext == '.docx') &&
        size.between?(10_000, 5_000_000) &&
        modified > (Time.now - 7*86400)
    rescue StandardError => e
      warn "Error processing #{f}: #{e.message}"
      false
    end
  end
end

This selects PDF/DOCX files modified in the last week between 10 KB and 5 MB.

Performance considerations

When processing large directories:

  1. Lazy Evaluation: Use lazy with large result sets

    Dir.glob('**/*').lazy
      .select { |f| File.size(f) > 1_000_000 }
      .first(10)
    
  2. Early Exit: Fail fast with break when possible

    Dir.children('/logs').each do |f|
      begin
        next unless f.end_with?('.log')
        break if File.size(f) > 1_000_000_000  # Stop at first huge log
        process_log(f)
      rescue StandardError => e
        warn "Error processing #{f}: #{e.message}"
      end
    end
    
  3. Metadata Caching: Store frequently accessed data

    # Note: As of mime-types 3.6.0, caching is based on the data gem version instead of the mime-types version.
    file_cache = {}
    Dir.glob('*').each do |f|
      begin
        file_cache[f] = {
          mtime: File.mtime(f),
          size: File.size(f)
        }
      rescue StandardError => e
        warn "Error caching #{f}: #{e.message}"
      end
    end
    

Production-grade example

Here's a complete filtering module with error handling:

require 'mime/types'

class FileFilter
  def initialize(root_dir)
    @root = root_dir
  end

  def find_files(extensions: [], min_size: 0, max_age: Float::INFINITY)
    Dir.glob(File.join(@root, '**', '*')).lazy.select do |path|
      begin
        next unless File.file?(path)

        valid_extension = extensions.empty? || extensions.include?(File.extname(path))
        valid_size = File.size(path) >= min_size
        valid_age = (Time.now - File.mtime(path)) < max_age

        valid_extension && valid_size && valid_age
      rescue StandardError => e
        warn "Error processing #{path}: #{e.message}"
        false
      end
    end
  rescue Errno::ENOENT => e
    warn "Directory error: #{e.message}"
    [].lazy
  end
end

# Usage:
filter = FileFilter.new('/user/uploads')
recent_images = filter.find_files(
  extensions: ['.jpg', '.png'],
  min_size: 100_000,
  max_age: 3600 # 1 hour
).take(100)

These examples illustrate practical ways to filter files in Ruby with robust error handling. For more advanced file processing solutions, consider exploring Transloadit's API—visit our documentation for additional details.

Handling file encodings

Ruby typically handles file name encodings automatically when using UTF-8. However, if you encounter issues with non-ASCII characters in file names, you can enforce UTF-8 encoding as follows:

# Convert file names to utf-8 encoding
utf8_files = Dir.glob('*').map { |f| f.force_encoding('UTF-8') }

This simple approach ensures that all file names are treated as UTF-8, minimizing potential encoding-related issues.

Conclusion

Throughout this guide, we explored practical techniques to filter files in Ruby—from simple extension checks and glob patterns to MIME type detection and metadata filtering. By combining these methods, you can build efficient file processing pipelines tailored to your needs. For large-scale, production-ready solutions, consider exploring the robust file processing API offered by Transloadit. Visit our documentation to learn more.