ActiveRecord Query Patterns That Actually Scale
by Eric Hanson, Backend Developer at Clean Systems Consulting
The query that works fine in development
Development databases have hundreds of rows. Production databases have millions. A query that returns in 2ms against a seeded development database can take 8 seconds against production data — not because the query changed, but because ActiveRecord's convenient abstractions hide the execution plan until it matters.
The patterns below are about building queries that remain correct and performant as data grows. Some are optimizations. Several are correctness issues that happen to surface as performance problems.
Select only what you need
Model.all fetches every column. For tables with binary columns, text blobs, or JSON fields with significant payload, this means pulling megabytes of data you discard immediately:
# Fetches every column including avatar_data (binary), bio (text), settings (jsonb)
users = User.where(active: true)
# Fetches only what the operation needs
users = User.where(active: true).select(:id, :email, :name)
The cost isn't just network transfer — the database must read every page that contains those columns, and wide rows mean fewer rows per page. On a table with 2M rows and a 20KB average row size, a full-table scan is 40GB of data movement. With select, it's the indexed columns only.
select returns AR objects with only the specified attributes populated. Accessing an unselected attribute raises ActiveModel::MissingAttributeError immediately. This is the right behavior — it tells you when downstream code has a wider dependency than expected.
For bulk reads where you don't need AR objects at all, pluck is faster:
# Returns an array of email strings — no AR object allocation
User.where(active: true).pluck(:email)
# Returns an array of [id, email] arrays
User.where(active: true).pluck(:id, :email)
pluck executes immediately and returns primitives. You can't chain further AR methods after it. Use select when the relation needs to stay lazy; use pluck when you're at the end of the chain and want raw values.
exists? vs count vs any? — pick the right check
Three ways to check if records exist, with meaningfully different SQL:
User.where(role: :admin).count # SELECT COUNT(*) — always scans
User.where(role: :admin).any? # SELECT 1 ... LIMIT 1 — stops at first match
User.where(role: :admin).exists? # SELECT 1 ... LIMIT 1 — same execution
count forces a full aggregate scan even when you only need a boolean. For large tables with many matching records, count does significantly more work than necessary. Use exists? or any? whenever the question is "does at least one record match?" not "how many records match?"
any? and exists? produce identical SQL when called on a relation. exists? also accepts conditions directly: User.exists?(role: :admin) — useful for one-off checks without building a relation first.
The antipattern to avoid:
# Loads all matching records into memory to check length
if User.where(role: :admin).all.length > 0
This is a full table scan into memory for a boolean check. Even count is better than this, and exists? is better still.
Avoid N+1 at the query design level
includes solves N+1 for known associations. But the N+1 that matters most in mature codebases is the one introduced by methods that look like model accessors:
orders.each do |order|
puts order.user.subscription_tier # N+1 if :user not eager-loaded
puts order.most_recent_payment.amount # N+1 if this is a method firing a query
end
The most_recent_payment case is the harder one. If it's defined on Order as:
def most_recent_payment
payments.order(created_at: :desc).first
end
Then includes(:payments) preloads all payments, and .order().first filters in Ruby — expensive for orders with many payments. The correct pattern for "the most recent of an association" is a has_one with a scope, or a subquery:
# has_one approach — eager-loadable with includes
has_one :most_recent_payment, -> { order(created_at: :desc) }, class_name: "Payment"
# Then eager-loadable
Order.includes(:most_recent_payment)
For complex cases, a lateral join or subquery brings the work into a single database round-trip:
Order.joins(
"JOIN LATERAL (
SELECT * FROM payments
WHERE payments.order_id = orders.id
ORDER BY created_at DESC
LIMIT 1
) latest_payment ON true"
).select("orders.*, latest_payment.amount AS latest_payment_amount")
This requires raw SQL but executes in one query regardless of result set size. For reporting queries over large datasets, this matters more than the code elegance.
find_each and find_in_batches for full-table operations
Loading a large AR collection with .each loads all matching records into memory simultaneously:
# Loads 500k records at once — triggers major GC mid-loop
Order.where(status: :pending).each { |o| process(o) }
find_each batches the query and processes in configurable batch sizes:
Order.where(status: :pending).find_each(batch_size: 500) { |o| process(o) }
Two important constraints: find_each requires ordering by primary key (it silently overrides any order you've applied) and cannot be used with limit on the outer relation (it sets its own limit per batch). If you need a specific order or to limit total records, find_in_batches gives you explicit batch arrays:
Order.where(status: :pending).find_in_batches(batch_size: 500) do |batch|
batch.sort_by(&:created_at).each { |o| process(o) }
end
Sorting within each batch is O(batch_size log batch_size) rather than O(n log n) over the full set — acceptable for batch processing, not for UI queries that need total ordering.
Use database functions instead of Ruby
A common pattern that moves work from the database to Ruby unnecessarily:
# Ruby does the grouping — database returns all records
orders.group_by { |o| o.created_at.to_date }
# Database does the grouping — Ruby receives counts
Order.group("DATE(created_at)").count
The database has indexes, can parallelize aggregation, and returns a small hash instead of potentially millions of objects. Any aggregation — sum, count, average, group — belongs in the database. Ruby does it after the fact as a fallback, not as a first choice.
Date truncation, string manipulation, conditional aggregation — PostgreSQL handles all of these. Don't filter large result sets in Ruby when a WHERE clause does it in the database:
# Wrong — fetches everything, filters in Ruby
User.all.select { |u| u.created_at > 30.days.ago }
# Correct — database filters before returning
User.where("created_at > ?", 30.days.ago)
Indexes for the queries you actually run
The queries above are pointless on large tables without supporting indexes. Three index patterns worth understanding explicitly:
Partial indexes — index a subset of rows matching a condition:
CREATE INDEX idx_active_users ON users (email) WHERE deleted_at IS NULL;
A partial index on active users for a soft-delete table is a fraction of the size of a full index and faster because the query planner uses it for queries with the same condition. Add this when a large percentage of rows are logically inactive.
Composite indexes — index multiple columns together:
CREATE INDEX idx_orders_user_status ON orders (user_id, status);
A composite index on (user_id, status) satisfies WHERE user_id = ? AND status = ? in a single index scan. Column order matters: the index is useful for queries that filter on user_id alone, but not for queries that filter on status alone. Lead with the column used in equality conditions most frequently.
Covering indexes — include additional columns in the index so the query is satisfied from the index alone without touching the table:
CREATE INDEX idx_users_email_covering ON users (email) INCLUDE (id, name);
INCLUDE (PostgreSQL 11+) adds columns to the index leaf pages without affecting sort order. A query SELECT id, name FROM users WHERE email = ? can be answered entirely from the index — no table heap access. Use covering indexes for hot read paths where the query columns are known and stable.
In Rails migrations:
add_index :users, :email, where: "deleted_at IS NULL", name: "idx_active_users_email"
add_index :orders, [:user_id, :status]
Preventing count queries on large tables
COUNT(*) on tables with millions of rows without a WHERE clause that uses an index requires a full scan in PostgreSQL (unlike MySQL's MyISAM, which stored row counts separately). For dashboard displays showing total record counts, this query can take seconds:
# Potentially seconds on a large table
User.count
Three alternatives depending on the use case:
counter_cache — Rails maintains a count column on the parent model, incremented/decremented by callbacks. Correct for association counts. Incorrect if records are inserted or deleted outside of Rails.
pg_stat_user_tables — PostgreSQL maintains approximate row counts in system tables. Fast, slightly stale:
result = ActiveRecord::Base.connection.execute(
"SELECT reltuples::bigint FROM pg_class WHERE relname = 'users'"
)
result.first["reltuples"]
Accurate within a few percent immediately after an ANALYZE. Useful for "approximately X users" dashboard displays.
Materialized views or periodic background jobs — compute and store the count on a schedule. Precise, always slightly behind, correct for any table regardless of insert source.
The query review checklist
Before any query goes to production on a table expected to exceed 100k rows:
Run EXPLAIN ANALYZE against production data (or a production-scale clone). Confirm the query planner uses an index for the primary filter. Check for Seq Scan on large tables — that's the signal an index is missing or not being used.
Check the result set size. A query that returns 50k rows to Ruby for filtering in memory is a query that should be reworked with a WHERE clause.
Check join cardinality. A join between two large tables on an unindexed column produces a nested loop or hash join over millions of rows. Index both sides of every join.
These aren't optimizations you apply later. They're queries you write correctly the first time, because adding an index to a table with 10M rows in production requires careful scheduling and an INDEX CONCURRENTLY to avoid table locks.