Ruby Performance Tips I Learned the Hard Way on a Production System
by Eric Hanson, Backend Developer at Clean Systems Consulting
The only performance rule that matters first
Profile before optimizing. Every tip in this article came from a flame graph or an allocation trace, not from intuition. Ruby has enough counterintuitive performance characteristics that guessing where the bottleneck is wastes more time than the optimization saves.
The two tools worth knowing: stackprof for CPU profiling (sampling-based, low overhead, safe to run in production with a low sampling interval) and memory_profiler for allocation analysis. For Rails specifically, rack-mini-profiler surfaces SQL queries and allocation counts per request in development with zero configuration beyond adding it to the Gemfile.
Everything below came from one of those three tools showing something unexpected.
N+1 is still the most expensive mistake
Every team knows about N+1 queries. Most teams still have them in production because they're easy to introduce invisibly — especially through serializers, presenters, and service objects that call model methods without knowing what's underneath:
# The classic
orders.each do |order|
puts order.user.email # fires a query per order
end
# The fix
orders = Order.includes(:user).all
orders.each { |order| puts order.user.email }
The variant that bypasses includes even when you've written it:
# includes won't help here — you're calling a method that triggers a new query
orders.includes(:user).each do |order|
puts order.user.recent_activity.last # fresh query each time
end
includes preloads the association you named, not arbitrary methods on that association. If recent_activity scopes or loads additional data, it fires per record regardless.
The tool: bullet in development. It detects N+1 queries and missing counter_cache columns and logs them. It produces false positives on legitimate lazy loading — review its output, don't treat it as infallible. In production, look for query count per request in your APM. Any request executing more than 20 queries is worth investigating.
Object allocation is where memory pressure actually lives
On a system processing 500 requests per second, allocating 1,000 temporary objects per request means the GC is cleaning up 500,000 short-lived objects per second. That GC pressure manifests as latency spikes, not as a gradual slowdown — requests that normally take 30ms take 150ms when a major GC cycle runs.
The pattern that caused the most allocation pressure I've seen in production: map chains that produce intermediate arrays:
# Allocates three arrays
records
.map { |r| r.to_h }
.select { |h| h[:active] }
.map { |h| h[:email] }
# Allocates one
records
.filter_map { |r| r.email if r.active? }
filter_map (Ruby 2.7+) combines select and map in a single pass with a single output array. For a collection of 10,000 records processed on every request, eliminating two intermediate arrays meaningfully reduces GC pressure.
The tool: memory_profiler run against a specific code path shows total allocations, retained objects, and which lines are responsible. Look for methods allocating thousands of strings or arrays on every call. The fix is usually filter_map, lazy enumerators, or preallocating and reusing a result structure.
String allocation in hot paths
String literals in Ruby are heap-allocated objects unless you've enabled frozen_string_literal: true. In a hot path — a method called thousands of times per request — repeated string allocation adds up:
# Every call allocates a new string
def content_type_header
"application/json; charset=utf-8"
end
# frozen_string_literal: true at the top of the file, or:
CONTENT_TYPE = "application/json; charset=utf-8".freeze
def content_type_header
CONTENT_TYPE
end
Freezing the constant means all callers share the same object. The allocation happens once at load time.
The cases where this matters most: logging formats, header values, fixed SQL fragments, and any method that returns a string constant that gets compared or matched against in a loop. For strings assembled dynamically (interpolation, concatenation), freeze doesn't help — the allocation is inherent to the operation.
Hash access patterns and symbol vs string keys
In a tight loop doing hash lookups, the key type matters more than most benchmarks suggest. Symbol key lookup involves a pointer comparison; string key lookup involves a character-by-character comparison. For short keys in high-frequency access:
# On a dataset of 100k lookups, 8-character keys, MRI 3.3:
# Symbol keys: ~18ms
# String keys: ~24ms
This gap is relevant in parsers, routing tables, and configuration lookups that run on every request. It's not relevant for hashes that are built and queried once per request regardless of collection size.
The sharper issue: Hash#dig versus chained []. Both are fine for deeply nested access, but dig is a single method call; chained [] is multiple. For deeply nested structures accessed in a loop, dig is marginally faster and considerably cleaner.
ActiveRecord: select only what you need
Model.all loads every column from every row into memory. On a users table with a profile_photo binary column, loading 10,000 users to get their emails means pulling megabytes of binary data you immediately discard:
# Loads all columns including binary data
User.where(active: true).each { |u| send_digest(u.email) }
# Loads only what's needed
User.where(active: true).select(:id, :email).each { |u| send_digest(u.email) }
select returns full AR objects with only the specified attributes populated. Accessing an unselected attribute raises ActiveModel::MissingAttributeError, which is the right behavior — it tells you immediately if something downstream needs more columns.
For read-only bulk processing, pluck is faster still — it skips AR object instantiation entirely and returns raw values:
# Returns an array of strings, no AR objects allocated
emails = User.where(active: true).pluck(:email)
On a table with 100k rows where you need one column, pluck runs 3–4x faster than select because it eliminates object allocation entirely. The tradeoff: pluck returns arrays, not relations — it executes immediately and can't be further chained.
Batch processing with find_each
User.all.each loads the entire result set into memory at once. On a table with 500k rows, that's 500k AR objects sitting in memory simultaneously, triggering a major GC before you've processed half of them:
# Loads everything at once
User.active.each { |u| process(u) }
# Loads 1,000 rows at a time, processes, discards, loads next batch
User.active.find_each(batch_size: 1000) { |u| process(u) }
find_each issues LIMIT 1000 OFFSET n queries in sequence. Memory usage is bounded to one batch at a time. The tradeoff: find_each orders by primary key and ignores any order clause you've applied, since it needs a stable sort for correct pagination. If you need a specific order, find_in_batches returns arrays you can sort within the batch.
For very large datasets where even 1,000 in-memory AR objects is expensive, find_each with pluck inside the batch is the most memory-efficient form:
User.active.find_in_batches(batch_size: 5000) do |batch|
emails = batch.map(&:email) # AR objects already loaded
bulk_send(emails)
end
The GC knobs worth knowing
MRI Ruby's GC is tunable through environment variables. Three settings that help high-throughput services:
RUBY_GC_HEAP_GROWTH_FACTOR — controls how aggressively Ruby requests new memory pages from the OS. Default is 1.8 (80% growth). Lower values (1.1–1.3) reduce peak memory at the cost of more frequent minor GCs. Useful when your pods have tight memory limits.
RUBY_GC_MALLOC_LIMIT and RUBY_GC_MALLOC_LIMIT_MAX — thresholds that trigger GC based on C-level malloc calls. The defaults are conservative. For services that allocate heavily in C extensions (JSON parsing, protobuf, nokogiri), raising these limits reduces GC frequency at the cost of higher peak memory.
These are not set-and-forget values. Tune them against your actual allocation profile with GC.stat before and after. The right values depend on your workload, pod size, and traffic pattern — values that help a batch job will hurt a low-latency API and vice versa.
The profiling workflow that finds the real problem
For any performance complaint, the sequence that avoids wasted effort:
- Reproduce with production-scale data in a staging environment
- Run
stackproffor 60 seconds under load, generate the flamegraph - Find the widest frame — that's where time is actually spent
- If the frame is in database code, check the query log for count and duration
- If the frame is in Ruby code, run
memory_profileron that specific path to check if allocation is the driver - Make one change, re-profile, verify the improvement is real
The failure mode I've seen most often: optimizing the second-widest frame because the widest one looked hard, then measuring no meaningful improvement in production because the bottleneck didn't move. Profile first, optimize the actual bottleneck, measure after.