Skip to main content

Command Palette

Search for a command to run...

Database Data Archiving Strategy

Published
5 min read

As databases grow, performance often degrades while storage costs rise. Data Archiving is the process of moving inactive ("cold") data from a high-performance ("hot") production environment to a cost-effective secondary storage system. This strategy ensures strict compliance with data retention policies while optimizing the performance and cost-efficiency of the active application.

1. Objectives

  • Performance Optimization: Reduce the size of active tables to speed up queries, indexing, and backups.

  • Cost Reduction: Move terabytes of historical data from expensive SSD-based "Hot" storage to cheaper HDD or Object-based "Cold" storage.

  • Compliance: Ensure data is retained for legal/regulatory periods (e.g., 7 years for financial records) without cluttering the live system.

  • Maintainability: Reduce maintenance windows (e.g., vacuuming, index rebuilding) for the production database.


2. Data Classification: What to Archive?

Before implementation, data must be classified. This is a business decision.

ClassificationDefinitionExampleAction
Hot DataFrequently accessed, mission-critical, read/write heavy.Orders from the last 12 months; Active user profiles.Keep in Production DB (SSD)
Warm DataOccasionally accessed, read-only mostly.Orders 1-2 years old; Closed projects.Keep in Partition or Separate DB
Cold DataRarely accessed, retained for compliance/history.Orders > 2 years old; Logs > 6 months.Move to Archive Storage (HDD/S3)

Criteria for Archiving

  • Time-Based: creation_date < NOW() - 2 YEARS

  • Status-Based: status = 'CLOSED' AND updated_at < NOW() - 1 YEAR

  • Event-Based: User account deletion (archive all related data after 30 days).


3. Archiving Architecture

This strategy employs a Tiered Storage Architecture.

Tier 1: Hot Database (Production)

  • Technology: PostgreSQL / MySQL / SQL Server on High-IOPS SSD (e.g., AWS RDS).

  • Goal: Maximum throughput and low latency.

  • Cost: High ($$$).

Tier 2: Cold Archive (The Destination)

  • Technology:

    • Option A (Analytics): Data Warehouse (Snowflake, BigQuery).

    • Option B (Cost-Saver): Object Storage (AWS S3 Glacier, Azure Blob Archive).

  • Goal: Massive scalability and lowest cost.

  • Cost: Low ($).

  • Access Pattern: High latency (seconds to hours for retrieval).


4. Archiving Strategies

Strategy A: Partitioning (In-Place Archiving)

  • Best For: Huge tables where recent data is hot and old data is rarely queried but must remain "online".

  • Mechanism: Use Table Partitioning (e.g., by Year).

  • Process:

    1. Detach the partition orders_2020.

    2. Move the partition to a slower tablespace (on cheaper disks) or keep it detached but available.

  • Pros: Transparent to application; extremely fast "deletes".

  • Cons: Does not reduce total database size if kept attached.

Strategy B: Separate Archive Database

  • Best For: When you need to run SQL queries on old data without impacting production.

  • Mechanism: Replicate structure to a second DB server on cheaper hardware.

  • Process (ETL):

    1. INSERT INTO archive_db.orders SELECT * FROM prod_db.orders WHERE date < '2022-01-01'

    2. DELETE FROM prod_db.orders WHERE date < '2022-01-01'

  • Pros: Offloads storage impact completely. SQL compatible.

  • Cons: Managing a second database server.

Strategy C: Cold Storage (The "Data Lake" Approach)

  • Best For: Long-term retention (5+ years), massive datasets, and lowest cost.

  • Mechanism: Extract data to Parquet/CSV files and upload to AWS S3/Azure Blob.

  • Process:

    1. Run a script to dump old records to a .parquet file.

    2. Upload file to S3 Glacier.

    3. Verify checksums.

    4. Delete from Production.

  • Pros: Cheapest option. Infinite scale.

  • Cons: Hard to query single records (requires tools like Amazon Athena).


5. Application Implementation: Accessing Archived Data

As discussed, accessing archived data requires an architectural pattern similar to Sharding, where the application must be "aware" of data location.

Pattern 1: The Fallback (Look-aside)

Use this when users rarely request archived data (e.g., "View Order #123").

  1. Check Hot: App queries Production DB.

  2. Found? Return data.

  3. Not Found? App switches connection string or API endpoint to query the Archive System (e.g., Athena or Archive DB).

  4. Return: Data is displayed with a "Archived/ReadOnly" flag.

Pattern 2: The Union (Reporting)

Use this for "Show all my history" features.

  • Logic: The application (or a middleware/reporting tool) runs two queries in parallel and merges the results in memory.

      -- Application Logic Representation
      results = query(HotDB, "SELECT * FROM orders WHERE user_id=1")
                + query(ColdDB, "SELECT * FROM orders WHERE user_id=1")
      return sort_by_date(results)
    

6. Step-by-Step Implementation Plan

  1. Define Policy: Agree on the retention period (e.g., "Move data older than 2 years").

  2. Analyze Dependencies: Map all Foreign Keys. You cannot archive a Parent record (Order) without archiving Children (OrderItems).

  3. Develop Scripts: Write the ETL scripts (Extract, Verify, Delete).

  4. Dry Run: Execute on a staging copy of production. Measure timing—large deletes can lock tables!

  5. Schedule: Set up a Cron job / Airflow DAG to run this monthly during low-traffic windows.

  6. Automate Deletion: Use BATCH deletes (e.g., delete 1000 rows at a time) to avoid locking the production DB.


7. Best Practices & Pitfalls

Do's

  • Batch Your Deletes: Never run DELETE FROM orders WHERE date < '2020-01-01' on a massive table. It will lock the table and fill the transaction log. Delete in chunks (e.g., LIMIT 5000).

  • Archive Child First: Always move/delete OrderItems before Orders to satisfy Foreign Keys.

  • Use Checksums: Verify row counts or data hashes in the archive before deleting from production.

  • Compress Data: Cold storage is cheap, but compressed Parquet/Avro is even cheaper and faster to query than CSV.

  • Add Metadata: Add a column archived_at to the archived records for audit trails.

Don'ts

  • Don't Archive "Live" Data: Ensure the data is truly static. Archiving a record that might still receive updates creates data consistency nightmares.

  • Don't Ignore Restore: Test your ability to pull data back from the archive into production. This is your "Undo" button.

  • Don't Forget Indexes: Your archive database (if using Strategy B) needs indexes too, or historical queries will be painfully slow.


8. Summary of Cost & Roles

ComponentResponsibilityProvider
Hot StorageServing live traffic, instant IO.AWS RDS, Google Cloud SQL, Azure SQL.
Cold StorageStoring history cheaply.AWS S3 Glacier, Azure Archive Blob.
Archival EngineThe script moving the data.AWS Glue, Custom Python Script, Cron Job.