How a Simple Migration Took Down Our System for 20 Minutes — And How to Never Let That Happen Again

“ALTER TABLE looked harmless. On staging, it ran in 0.4s. In production, it locked the system for 20 minutes.”

This post walks through how one tiny change brought down a Rails app backed by PostgreSQL — and how you can prevent it using strong_migrations, PostgreSQL timeout settings, and safe retry logic.

⚠️ The Incident: When ALTER TABLE Took Down Production

We run a high-traffic news system with over 1 billion rows in the posts table.

One day, we decided to enforce a new constraint by making the category column non-nullable:

ALTER TABLE posts
ALTER COLUMN category SET NOT NULL;

On staging (30K rows), it took 0.4 seconds. Easy.

In production?

💥 20 minutes of downtime.

Here’s what happened:

  • PostgreSQL locked the table with an ACCESS EXCLUSIVE lock.
  • It scanned every row to ensure no NULLs — and ignored the index.
  • All CRUD operations (SELECT, INSERT, UPDATE, DELETE) were blocked.
  • After 2 minutes, queries started timing out.
  • DevOps were paged. Panic. Rollback. Post-mortem.

We were lucky. Thanks to statement_timeout, queries didn’t hang forever. And we had retry logic in place to avoid data loss.


Why Postgres Locks on ALTER TABLE SET NOT NULL

🧠 Locking & Scanning Mechanics

  • ACCESS EXCLUSIVE lock: the strictest lock level—blocks all other operations.
  • SET NOT NULL: requires scanning every row to validate the constraint.
  • Why no index? PostgreSQL doesn’t use indexes for this check—it ensures data integrity via full table scan.
  • Staging immune: small datasets hide the risk.

📈 How to Detect This in Production

Use the built-in catalog:

SELECT pid, state, wait_event_type, wait_event
FROM pg_stat_activity
WHERE wait_event_type LIKE 'Lock%';

You’ll see locking conflicts when DDL blocks queries.

🛡️ The Fix: Use strong_migrations in Rails

strong_migrations is a Ruby gem that proactively warns or blocks unsafe migrations like:

  • add_column with default + null: false
  • add_index without algorithm: :concurrently
  • change_column_null on big tables
  • change_column_default that rewrites data

🔧 How It Works

  • Hooks into ActiveRecord migration methods using Ruby metaprogramming.
  • Analyzes your intended change.
  • If it’s unsafe, it warns or raises, and suggests a safe alternative.

🚫 Dangerous Migration

add_column :posts, :category, :string, default: "news", null: false

✅ Safe (with strong_migrations)

add_column :posts, :category, :string
change_column_default :posts, :category, "news"
execute <<-SQL
  UPDATE posts SET category = 'news'
  WHERE category IS NULL
SQL
change_column_null :posts, :category, false
The gem won't let you run the first version and will prompt the second.

✅ How to Safely Add a NOT NULL Column in PostgreSQL

Here’s the right way to do it without blocking your table:


-- 1. Add the column as nullable
ALTER TABLE posts ADD COLUMN category TEXT;

-- 2. Backfill existing rows in batches
UPDATE posts SET category = 'Uncategorized' WHERE category IS NULL;

-- 3. Add a NOT NULL constraint (but not validated yet)
ALTER TABLE posts ADD CONSTRAINT posts_category_not_null CHECK (category IS NOT NULL) NOT VALID;

-- 4. Validate the constraint without locking writes
ALTER TABLE posts VALIDATE CONSTRAINT posts_category_not_null;

-- 5. (Optional) Make the column officially NOT NULL
ALTER TABLE posts ALTER COLUMN category SET NOT NULL;

You can skip step 5 — the constraint already enforces the rule.

strong_migrations can help you generate these steps.

⏱️ Use Timeouts to Avoid Production Blackouts

Even with safe migrations, it’s best to configure defensive timeouts.

🔧 Recommended Settings

  • lock_timeout: cancels queries waiting too long for a lock
  • statement_timeout: kills long-running queries
  • idle_in_transaction_session_timeout: cleans up stuck transactions

In rails:

# database.yml
production:
  variables:
    statement_timeout: 2000          # 2s query timeout
    lock_timeout: 1000               # 1s for lock wait
    idle_in_transaction_session_timeout: 10000 # 10s for stuck txns

You can set by each session:

SET LOCAL lock_timeout = '1s';
SET LOCAL statement_timeout = '2s';

OR set in postgres config:

-- In postgresql.conf or via ALTER SYSTEM + reload
lock_timeout = '5s'
statement_timeout = '1min'
idle_in_transaction_session_timeout = '60s'

🔁 Add Retry Logic for Transient Failures

When a query fails due to timeout or lock, retrying might help.

Here’s a simple Ruby helper for your services or jobs:

def with_retry_on_timeout(max_retries: 3)
  retries = 0
  begin
    yield
  rescue ActiveRecord::QueryCanceled => e
    if retries < max_retries
      retries += 1
      sleep(2 ** retries) # exponential backoff
      retry
    else
      raise e
    end
  end
end

# Example usage
with_retry_on_timeout do
  Post.where(category: nil).update_all(category: 'Uncategorized')
end

🧾 Summary

✅ Best Practices🚫 Don’t Do This
Use strong_migrations for safe migrationsRun ALTER TABLE SET NOT NULL on large tables without care
Add nullable columns → backfill → add constraintAdd default + NOT NULL in a single step
Use lock_timeout, statement_timeout, idle_timeoutAssume staging test catches production behavior
Implement retry logic for canceled queriesLet queries hang and crash your API
Backfill in batches (in_batches, chunked updates)Update billions of rows in one huge query

🔚 Final Thoughts

Migrations can be more dangerous than code changes — especially in high-traffic production systems.

Use this checklist:

  • ✅ Do you understand the lock impact?
  • ✅ Have you tested with realistic data volumes?
  • ✅ Is strong_migrations helping you?
  • ✅ Are your DB timeouts set properly?
  • ✅ Is retry logic in place?

📢 If you found this helpful…

Share it with your team. Help your future self avoid a 20-minute outage from a one-line migration.

And remember:

Being a senior doesn’t mean you don’t make mistakes — it means you build systems that forgive them.

Post Comment