How a Simple Migration Took Down Our System for 20 Minutes — And How to Never Let That Happen Again
“ALTER TABLE looked harmless. On staging, it ran in 0.4s. In production, it locked the system for 20 minutes.”
This post walks through how one tiny change brought down a Rails app backed by PostgreSQL — and how you can prevent it using strong_migrations
, PostgreSQL timeout settings, and safe retry logic.
⚠️ The Incident: When ALTER TABLE Took Down Production
We run a high-traffic news system with over 1 billion rows in the posts
table.
One day, we decided to enforce a new constraint by making the category
column non-nullable:
ALTER TABLE posts
ALTER COLUMN category SET NOT NULL;
On staging (30K rows), it took 0.4 seconds. Easy.
In production?
💥 20 minutes of downtime.
Here’s what happened:
- PostgreSQL locked the table with an
ACCESS EXCLUSIVE
lock. - It scanned every row to ensure no NULLs — and ignored the index.
- All CRUD operations (
SELECT
,INSERT
,UPDATE
,DELETE
) were blocked. - After 2 minutes, queries started timing out.
- DevOps were paged. Panic. Rollback. Post-mortem.
We were lucky. Thanks to statement_timeout
, queries didn’t hang forever. And we had retry logic in place to avoid data loss.
Why Postgres Locks on ALTER TABLE SET NOT NULL

🧠 Locking & Scanning Mechanics
ACCESS EXCLUSIVE lock
: the strictest lock level—blocks all other operations.SET NOT NULL
: requires scanning every row to validate the constraint.- Why no index? PostgreSQL doesn’t use indexes for this check—it ensures data integrity via full table scan.
- Staging immune: small datasets hide the risk.
📈 How to Detect This in Production
Use the built-in catalog:
SELECT pid, state, wait_event_type, wait_event
FROM pg_stat_activity
WHERE wait_event_type LIKE 'Lock%';
You’ll see locking conflicts when DDL blocks queries.
🛡️ The Fix: Use strong_migrations
in Rails

strong_migrations
is a Ruby gem that proactively warns or blocks unsafe migrations like:
add_column
with default +null: false
add_index
withoutalgorithm: :concurrently
change_column_null
on big tableschange_column_default
that rewrites data
🔧 How It Works
- Hooks into ActiveRecord migration methods using Ruby metaprogramming.
- Analyzes your intended change.
- If it’s unsafe, it warns or raises, and suggests a safe alternative.
🚫 Dangerous Migration
add_column :posts, :category, :string, default: "news", null: false
✅ Safe (with strong_migrations
)
add_column :posts, :category, :string
change_column_default :posts, :category, "news"
execute <<-SQL
UPDATE posts SET category = 'news'
WHERE category IS NULL
SQL
change_column_null :posts, :category, false
The gem won't let you run the first version and will prompt the second.
✅ How to Safely Add a NOT NULL
Column in PostgreSQL
Here’s the right way to do it without blocking your table:
-- 1. Add the column as nullable
ALTER TABLE posts ADD COLUMN category TEXT;
-- 2. Backfill existing rows in batches
UPDATE posts SET category = 'Uncategorized' WHERE category IS NULL;
-- 3. Add a NOT NULL constraint (but not validated yet)
ALTER TABLE posts ADD CONSTRAINT posts_category_not_null CHECK (category IS NOT NULL) NOT VALID;
-- 4. Validate the constraint without locking writes
ALTER TABLE posts VALIDATE CONSTRAINT posts_category_not_null;
-- 5. (Optional) Make the column officially NOT NULL
ALTER TABLE posts ALTER COLUMN category SET NOT NULL;
You can skip step 5 — the constraint already enforces the rule.
✅
strong_migrations
can help you generate these steps.
⏱️ Use Timeouts to Avoid Production Blackouts
Even with safe migrations, it’s best to configure defensive timeouts.
🔧 Recommended Settings
lock_timeout
: cancels queries waiting too long for a lockstatement_timeout
: kills long-running queriesidle_in_transaction_session_timeout
: cleans up stuck transactions
In rails:
# database.yml
production:
variables:
statement_timeout: 2000 # 2s query timeout
lock_timeout: 1000 # 1s for lock wait
idle_in_transaction_session_timeout: 10000 # 10s for stuck txns
You can set by each session:
SET LOCAL lock_timeout = '1s';
SET LOCAL statement_timeout = '2s';
OR set in postgres config:
-- In postgresql.conf or via ALTER SYSTEM + reload
lock_timeout = '5s'
statement_timeout = '1min'
idle_in_transaction_session_timeout = '60s'
🔁 Add Retry Logic for Transient Failures
When a query fails due to timeout or lock, retrying might help.
Here’s a simple Ruby helper for your services or jobs:
def with_retry_on_timeout(max_retries: 3)
retries = 0
begin
yield
rescue ActiveRecord::QueryCanceled => e
if retries < max_retries
retries += 1
sleep(2 ** retries) # exponential backoff
retry
else
raise e
end
end
end
# Example usage
with_retry_on_timeout do
Post.where(category: nil).update_all(category: 'Uncategorized')
end
🧾 Summary
✅ Best Practices | 🚫 Don’t Do This |
---|---|
Use strong_migrations for safe migrations | Run ALTER TABLE SET NOT NULL on large tables without care |
Add nullable columns → backfill → add constraint | Add default + NOT NULL in a single step |
Use lock_timeout , statement_timeout , idle_timeout | Assume staging test catches production behavior |
Implement retry logic for canceled queries | Let queries hang and crash your API |
Backfill in batches (in_batches , chunked updates) | Update billions of rows in one huge query |
🔚 Final Thoughts
Migrations can be more dangerous than code changes — especially in high-traffic production systems.
Use this checklist:
- ✅ Do you understand the lock impact?
- ✅ Have you tested with realistic data volumes?
- ✅ Is
strong_migrations
helping you? - ✅ Are your DB timeouts set properly?
- ✅ Is retry logic in place?
📢 If you found this helpful…
Share it with your team. Help your future self avoid a 20-minute outage from a one-line migration.
And remember:
Being a senior doesn’t mean you don’t make mistakes — it means you build systems that forgive them.
Post Comment
You must be logged in to post a comment.