Event-Driven Architecture: The Service Communication Style Worth Understanding
by Eric Hanson, Backend Developer at Clean Systems Consulting
What event-driven architecture actually solves
You have a checkout flow where confirming an order triggers inventory reservation, payment processing, email confirmation, and analytics logging. With synchronous REST calls, you've chained these sequentially, which means your checkout latency is the sum of all of them and a failure in email confirmation rolls back the entire checkout. This is the problem event-driven architecture was designed to solve.
Event-driven architecture (EDA) means services communicate by publishing and subscribing to events rather than by calling each other directly. When an order is confirmed, the Order Service publishes an OrderConfirmed event to a durable message broker — typically Apache Kafka, RabbitMQ with quorum queues, or AWS EventBridge. Every downstream service that needs to react subscribes independently. The Order Service never calls them. They process at their own pace, independently, and if they're down when the event is published, they catch up when they recover.
The core benefit is temporal decoupling: publisher and subscriber don't need to be available simultaneously. This is the property that eliminates the cascading failure problem inherent in synchronous service chains.
The Kafka model in practice
Kafka is the dominant choice for internal service events because of its durability guarantees and replay semantics. Topics are partitioned logs retained for a configurable period (often 7–30 days). Consumer groups track their offset, and if a consumer falls behind or restarts, it reads from where it left off.
// Producer: Order Service publishes after successful DB write
@Service
public class OrderService {
public Order confirmOrder(OrderRequest request) {
Order order = orderRepository.save(Order.from(request));
kafkaTemplate.send("orders.confirmed",
order.getId().toString(), // partition key: routes same order to same partition
OrderConfirmedEvent.builder()
.orderId(order.getId())
.userId(order.getUserId())
.items(order.getItems())
.total(order.getTotal())
.confirmedAt(Instant.now())
.build());
return order;
}
}
// Consumer: Inventory Service reacts independently
@KafkaListener(topics = "orders.confirmed", groupId = "inventory-service")
public void handleOrderConfirmed(OrderConfirmedEvent event) {
inventoryReservationService.reserve(event.getOrderId(), event.getItems());
}
The Inventory Service processes events in its own consumer group. If you add a new downstream service (say, a fraud detection service), you create a new consumer group subscribed to the same topic. The Order Service is unchanged.
The consistency problem you cannot ignore
The trade-off is consistency. With synchronous calls, you know immediately if inventory reservation failed. With events, you don't. The Order Service has published the confirmation event and returned a success to the user — but Inventory Service processing might fail ten seconds later.
This is eventual consistency, and it requires designing explicitly for compensating transactions. If Inventory Service can't reserve stock for an order, it publishes an InventoryReservationFailed event. Order Service (or a saga orchestrator) consumes that event and initiates a compensating action: cancel the order, notify the user, refund the payment.
The saga pattern formalizes this:
OrderConfirmed
→ InventoryService: reserve stock
→ success: StockReserved
→ PaymentService: charge card
→ success: PaymentCollected → order fulfilled
→ failure: PaymentFailed → compensate: release inventory
→ failure: InventoryFailed → compensate: cancel order
Each step publishes either a success or failure event. Compensating transactions undo the effects of earlier steps. This is conceptually clean but operationally demanding — you need to handle partial failures, duplicate events, and out-of-order processing.
Duplicate events: design for idempotency
In any at-least-once delivery system (Kafka with enable.auto.commit=false and consumer-side offset commits), events can be delivered more than once if a consumer crashes after processing but before committing the offset. Every consumer must be idempotent: processing the same event twice must produce the same result as processing it once.
@KafkaListener(topics = "orders.confirmed", groupId = "inventory-service")
@Transactional
public void handleOrderConfirmed(OrderConfirmedEvent event) {
// Idempotency check: skip if already processed
if (processedEventRepository.existsByEventId(event.getEventId())) {
return;
}
inventoryReservationService.reserve(event.getOrderId(), event.getItems());
processedEventRepository.save(ProcessedEvent.of(event.getEventId()));
}
The processed_events table (with the event ID as a unique key) prevents double-processing. The transactional boundary ensures the inventory write and the idempotency record are committed atomically.
The outbox pattern: avoiding dual-write failures
A common failure mode: the service writes to its database and then publishes to Kafka. If the process dies between those two operations, the event is lost. The DB state changed but no downstream service knows.
The outbox pattern solves this by writing the event to an outbox table in the same DB transaction as the domain write:
BEGIN;
INSERT INTO orders (...) VALUES (...);
INSERT INTO outbox (event_id, event_type, payload, created_at)
VALUES (gen_random_uuid(), 'order.confirmed', '...', NOW());
COMMIT;
A separate relay process (Debezium via CDC, or a polling loop) reads from outbox and publishes to Kafka, then marks events as published. The relay can retry safely — Kafka consumers are idempotent anyway.
What you're trading
EDA trades synchronous complexity (cascading failures, latency amplification) for asynchronous complexity (eventual consistency, idempotency requirements, saga design, consumer lag monitoring). Neither is free.
If you adopt EDA, invest immediately in consumer lag monitoring (Kafka's consumer lag metric via JMX or Prometheus), dead letter queues (DLQs) with alerting for failed events, and distributed tracing with correlation IDs propagated through events. Without those, debugging a failed saga becomes archaeology.