Writing

KafkaJS Works… Until It Doesn't: Lessons from Real-World Streaming Architectures

The hidden roadblocks of using KafkaJS in production, what they reveal about Node.js as a streaming platform, and how those lessons led me to build @jescrich/nestjs-kafka-client — a Kafka runtime layer for the NestJS ecosystem.

November 4, 2025·12–14 min read·English
KafkaNestJSNode.jsStreamingOpen SourceArchitecture

When you first integrate Kafka into a Node.js microservice, KafkaJS feels like a gift from heaven. Modern API, clean promises, solid documentation, and a community that seems to "get it." It all looks great — until your architecture grows up.

When you start handling hundreds of topics, cross-service event chains, multiple tenants, and autoscaling consumers running on Kubernetes, something happens. You realize you're no longer in the realm of tutorials — you're running a streaming system that lives, breathes, and occasionally explodes.

And when it does, you discover the invisible cracks that most teams never talk about.

This article is about those cracks — the hidden roadblocks of using KafkaJS in production, what they reveal about Node.js as a streaming platform, and how those lessons led me to build @jescrich/nestjs-kafka-client — a library designed to bring enterprise-grade reliability and observability to Kafka in the NestJS ecosystem.

01Premise

The illusion of simplicity

KafkaJS markets itself as "a modern Apache Kafka client for Node.js." And that's true — it's elegant, straightforward, and lightweight. But what most developers underestimate is what Kafka itself demands from its clients:

  • Offset management.
  • Backpressure handling.
  • Graceful shutdown.
  • Error recovery.
  • Consumer rebalancing.
  • Metrics and observability.
  • Throughput optimization.

KafkaJS doesn't abstract those things — it simply exposes them. That's not a flaw; it's a design decision. But it means you're building your own streaming runtime without realizing it.

At a small scale, that's fine. At enterprise scale, it's a disaster waiting to happen.
02Hidden problem #1

Lifecycle and connection hell

When you deploy multiple microservices consuming from Kafka, lifecycle management becomes your first nightmare.

KafkaJS doesn't integrate with application lifecycles — it's up to you to decide when to connect, disconnect, pause, resume, or commit offsets. In a containerized world (Docker, Kubernetes, ECS), your services restart all the time. A consumer that didn't close cleanly can keep a session open, block a partition, or trigger a rebalance storm.

You end up debugging strange logs like:

[Runner] consumer group rebalancing, reason: The member is rejoining

for hours — only to realize that your shutdown hook never completed before the pod died.

It's not Kafka's fault. It's not Node's fault either. It's the gap between a stateless runtime and a stateful streaming system.

03Hidden problem #2

Backpressure and concurrency

Node.js is single-threaded. Kafka isn't.

When you consume messages in batches, KafkaJS delivers them faster than your business logic can handle — especially if you're performing async I/O (database writes, HTTP calls, file operations). Without backpressure control, the event loop becomes a bottleneck. CPU climbs. Memory balloons. Latency spikes. And you're left wondering why your microservice suddenly feels like a distributed queue instead of a consumer.

Backpressure in streaming is not optional — it's a survival mechanism. But in KafkaJS, you must build it manually: queue pools, concurrency tokens, semaphore guards. Most developers don't — until production teaches them the hard way.

04Hidden problem #3

Observability gaps

KafkaJS has decent logging. But when things go wrong — lag builds up, rebalances loop infinitely, offsets drift — you need visibility, not logs.

Enterprise environments depend on telemetry. We monitor metrics like:

  • consumer_lag
  • rebalance_count
  • messages_processed_per_second
  • offset_commit_latency
  • retry_backlog

KafkaJS doesn't expose those metrics. You have to wrap it, patch it, or inject custom interceptors. Without metrics, you're blind. You only know your consumer is in trouble when customers start complaining.

05Hidden problem #4

Configuration spaghetti

Every KafkaJS consumer or producer instance needs its own configuration. You pass brokers, client IDs, group IDs, timeouts, retries, and SSL options manually.

Multiply that across 10 microservices — and suddenly you're maintaining a zoo of .env files and "copy-pasted" connection code.

In enterprise settings, this is unacceptable. You need centralized configuration, dependency injection, and shared lifecycle control. Otherwise, you can't enforce consistency, rotate secrets, or apply observability hooks globally.

06Hidden problem #5

Error handling and recovery

Here's a dirty secret: most KafkaJS apps don't handle errors well. They log them. Maybe retry once. Then die.

Distributed systems don't forgive failure — they multiply it.

When your consumer throws inside an async handler, the message isn't acknowledged, the offset isn't committed, and the same message re-enters the queue indefinitely. Without retry policies, dead-letter topics, and circuit breakers, your system ends up consuming its own chaos.

KafkaJS gives you the rope. What you build with it is up to you.

07The opportunity

Why NestJS needs a smarter Kafka layer

NestJS offers structure, lifecycle, dependency injection, and modules — all the ingredients for a clean abstraction over Kafka.

I realized that instead of patching KafkaJS over and over, what we needed was a Kafka runtime that behaves like a first-class citizen in the NestJS ecosystem.

That's how @jescrich/nestjs-kafka-client was born.

08The library

Introducing @jescrich/nestjs-kafka-client

@jescrich/nestjs-kafka-client is not another wrapper — it's a Kafka runtime layer for NestJS. It extends KafkaJS with enterprise features:

  • Multi-level backpressure.
  • Integration with NestJS lifecycle hooks ( onModuleInit, onModuleDestroy).
  • Auto-reconnect and graceful shutdown.
  • Centralized configuration.
  • Decorators for publishers and subscribers.
  • Observability hooks for metrics and tracing.
  • Retry and recovery mechanisms.
  • Namespace-aware topic management for multi-tenant environments.

Architecture overview

At its core, the library is built around three pillars:

1. Client module

Handles configuration, connection pooling, and integration with NestJS dependency injection.

@Module({
  imports: [
    KafkaClientModule.forRoot({
      brokers: ['kafka:9092'],
      clientId: 'orders-service',
    }),
  ],
})
export class OrdersModule {}

2. Subscriber decorators

Define consumers declaratively, without boilerplate.

@KafkaSubscriber('orders.created')
export class OrdersConsumer {
  async handle(message: OrderCreatedEvent) {
    console.log('Order received:', message);
  }
}

3. Publisher service

Simplifies producing events with consistent metadata and error handling.

await this.kafkaPublisher.emit('orders.created', { id: 123, status: 'paid' });

Under the hood, it manages offsets, backpressure, retries, and graceful lifecycle transitions — without manual orchestration.

Enterprise comparison

CapabilityKafkaJS@jescrich/nestjs-kafka-client
BackpressureManual throttlingAutomatic multi-level control
Lifecycle integrationManualNestJS lifecycle aware
ObservabilityLogs onlyMetrics, tracing, hooks
ConfigurationScatteredCentralized, DI-based
Multi-tenancyUnsupportedNamespace isolation
Recovery & retriesDIYBuilt-in retry policies
09The real inspiration

Flink, AWS, and reality

Before I built this library, I had been orchestrating Flink jobs on AWS, managing streaming pipelines that processed terabytes of data through Kafka topics, distributed microservices, and multi-tenant event routing.

We used Kafka → Flink → ClickHouse chains for event analytics and replay pipelines. In that environment, one missing commit could mean days of lag. One unhandled rebalance could freeze an entire consumer group.

That's where I learned the difference between "it works" and "it scales." Between running Kafka and owning Kafka as a platform.

Node.js didn't need another client. It needed an opinionated, reliable runtime.

10Closing

Lessons learned

  1. Reliability must be explicit. Async code hides complexity; streaming exposes it. Every handler must define its failure policy.
  2. Observability isn't optional. Metrics tell you what's happening long before incidents do.
  3. Lifecycle management is everything. Start-up and shutdown are where most Kafka issues are born.
  4. Abstractions win over wrappers. A thin wrapper doesn't fix systemic issues — architecture does.
  5. You can't fake backpressure. Streams need real throttling, not async hopes.

KafkaJS is a brilliant project. It made Kafka accessible to thousands of Node.js developers. But accessibility is not the same as resilience.

If you're working on a hobby project or a simple event processor, KafkaJS is perfect. If you're building enterprise pipelines, multi-tenant architectures, or distributed data flows — you'll hit the same walls I did.

That's why I built @jescrich/nestjs-kafka-client: to bridge the gap between developer convenience and enterprise discipline. It's not just about publishing and consuming messages — it's about building streaming systems that can survive real-world conditions.

Because in production, KafkaJS works… until it doesn't.
José Escrich

Fractional CTO and software architect. Built in Bariloche, Patagonia — working with teams worldwide.

© 2020–2026 José Escrich. All rights reserved.
Designed & built by @jes