In the world of software development, efficient file archiving is essential for managing projects and optimizing workflows. The tar command is a powerful utility that allows developers to bundle multiple files and directories into a single archive file. By mastering advanced tar techniques, we can enhance our file archiving processes, save time, and streamline our development workflows.

Understanding tar and its importance

The tar (tape archive) command is a staple in Unix-like systems for creating and manipulating archive files. It is widely used for backup purposes, software distribution, and combining multiple files into one for easier handling. This guide uses GNU tar 1.34 and pigz 2.8.

Basic tar commands

  • Creating an archive:

    tar -cf archive.tar /path/to/directory
    
  • Extracting an archive:

    tar -xf archive.tar
    
  • Listing contents of an archive:

    tar -tf archive.tar
    

Advanced compression options with tar

Compressing archives reduces storage space and speeds up transfer times. tar supports various compression methods, allowing you to balance between compression speed and compression ratio.

Using gzip

The most common compression method with tar is using gzip:

tar -czf archive.tar.gz /path/to/directory

The -z option tells tar to compress the archive using gzip.

Using bzip2

For better compression at the cost of speed, use bzip2:

tar -cjf archive.tar.bz2 /path/to/directory

The -j option uses bzip2 for compression.

Using xz

For maximum compression ratio:

tar -cJf archive.tar.xz /path/to/directory

The -J option uses xz, which provides higher compression ratios than gzip or bzip2, albeit with slower compression speed.

Accelerating compression with pigz

Traditional compression tools like gzip utilize a single CPU core, which can be a bottleneck on modern multi-core systems. pigz (Parallel Implementation of GZip) addresses this by using multiple cores for compression.

Installing pigz

On Ubuntu/Debian

sudo apt-get update
sudo apt-get install pigz

On macOS (using homebrew)

brew install pigz

On centos/rhel

sudo yum update
sudo yum install pigz

Using tar with pigz

To use pigz with tar, specify it as the compression program:

tar -I pigz -cf archive.tar.gz /path/to/directory

For maximum compression, you can specify the compression level:

tar -I 'pigz -9' -cf archive.tar.gz /path/to/directory

Excluding files and directories

When creating archives, you might want to exclude certain files or directories that are unnecessary or too large.

Excluding a single file or directory

tar -czf archive.tar.gz /path/to/directory --exclude='*.log'

This command excludes all files ending with .log.

Using an exclude file

Create a file exclude.txt containing patterns to exclude:

*.log
node_modules
.git

Then use the --exclude-from option:

tar -czf archive.tar.gz /path/to/directory --exclude-from='exclude.txt'

Incremental backups with tar

tar can perform incremental backups by archiving only files that have changed since the last backup. The snapshot file, such as backup.snar, must be preserved between backups to maintain the incremental history.

Creating a full backup

tar --listed-incremental=backup.snar -czf backup-full.tar.gz /path/to/directory

Performing an incremental backup

tar --listed-incremental=backup.snar -czf backup-incremental-$(date +%F).tar.gz /path/to/directory

Note: Ensure that the snapshot file (e.g., backup.snar) is not deleted between backups; removing it will cause subsequent backups to be full backups instead of incremental ones.

Archiving over ssh: remote backups

You can create archives on remote systems or transfer archives over the network using SSH.

Archiving a remote directory locally

Create an archive of a remote directory with maximum compression and save it locally:

ssh user@remote "tar -I 'pigz -9' -cf - /path/to/remote/directory" > archive.tar.gz

For progress indication, if you have pv (Pipe Viewer) installed, you can monitor the transfer:

ssh user@remote "tar -I 'pigz -9' -cf - /path/to/remote/directory" | pv > archive.tar.gz

Archiving a local directory to a remote host

Create an archive of a local directory and save it on a remote host:

tar -I 'pigz -9' -cf - /path/to/directory | ssh user@remote "cat > /path/to/save/archive.tar.gz"

Or, with progress monitoring using pv:

tar -I 'pigz -9' -cf - /path/to/directory | pv | ssh user@remote "cat > /path/to/archive.tar.gz"

Combining tar with find for selective archiving

Using find, you can selectively include files in an archive based on criteria like modification time or size.

Example: archiving files modified in the last 7 days

find /path/to/directory -type f -mtime -7 -print0 | tar -czf archive.tar.gz --null -T -

The --null -T - options tell tar to read file names from the standard input, separated by null characters.

Splitting large archives into smaller parts

When dealing with very large archives, you might need to split them into smaller chunks for storage or transfer.

Splitting an archive

tar -czf - /path/to/directory | split -b 500M - archive_part_

This command creates compressed archive parts of 500MB each, named archive_part_aa, archive_part_ab, etc.

Reassembling the archive

cat archive_part_* > archive.tar.gz

Automating tar tasks in your development workflow

Automating archiving tasks saves time and ensures consistency.

Bash script example

Create a script backup.sh:

#!/bin/bash

TIMESTAMP=$(date +%F)
BACKUP_DIR="/path/to/backup"
SOURCE_DIR="/path/to/directory"
EXCLUDE_FILE="/path/to/exclude.txt"

tar -I 'pigz -9' -cf "$BACKUP_DIR/backup-$TIMESTAMP.tar.gz" \
    --exclude-from="$EXCLUDE_FILE" "$SOURCE_DIR"

if [ $? -ne 0 ]; then
  echo "Backup failed" >&2
  exit 1
fi

echo "Backup successful: $BACKUP_DIR/backup-$TIMESTAMP.tar.gz"

Make the script executable:

chmod +x backup.sh

Scheduling with cron

Schedule the script to run daily at midnight by editing your crontab:

crontab -e

Then add the following line:

0 0 * * * /path/to/backup.sh

Best practices for efficient file archiving

  • Regular Backups: Schedule backups regularly to protect against data loss.
  • Exclude Unnecessary Files: Use --exclude options to avoid archiving files that are not needed.
  • Monitor Backup Processes: Check logs or set up alerts to ensure backups complete successfully.
  • Store Backups Securely: Save backups in secure, redundant locations.
  • Verify Archives: After creating an archive, verify its integrity using a command like tar -tf archive.tar.gz.
  • Use Parallel Compression: Leverage pigz on multi-core systems for faster compression.
  • Memory Considerations: High compression levels with pigz can increase memory usage. Adjust the compression level (e.g., from -9 to a lower level) if memory is constrained.
  • Document Procedures: Maintain clear documentation for backup and restore processes.

By implementing these advanced tar techniques, you can create efficient, automated backup solutions that protect your data while optimizing system resources.