Distributed Systems
Kafka
Messaging
Event-Driven
Pull-Based Architecture
Java
Open Source
Logs
Protocols
Microservices

From Zero to MEGA: Rebuilding Kafka Concepts for Learning and Fun

... views
... megz
Loading date...

TL;DR

This article explores the core concepts behind distributed messaging systems like Apache Kafka through the lens of MEGA, my educational project that rebuilds these concepts from scratch. We'll cover message brokers, event-driven architectures, log-based storage, and more. Whether you're new to distributed systems or looking to deepen your understanding, this practical guide shows how building a simplified version can provide powerful insights into these complex systems.

When I first encountered Kafka, I was overwhelmed by its complexity. To truly understand it, I decided to build a simplified version from scratch. This journey taught me more than any documentation could. Now, I'm sharing these insights with you—whether you're building your first microservice architecture or scaling complex event-driven applications.

Why We Need Message Brokers

Imagine you're building an e-commerce platform. When a customer places an order, multiple things need to happen: inventory updates, payment processing, shipping notifications, analytics tracking, and more. How should these systems communicate?

The Problem: Direct Communication Doesn't Scale

The simplest approach is direct communication between services:

Order Service ----> Inventory Service
    \
     \----> Payment Service
       \
        \----> Notification Service

This works for small systems, but quickly becomes problematic as you grow. Here's why large tech companies don't build their systems this way:

  1. Tight coupling:
      When services communicate directly, each one needs to know exactly where to find all the others. When Service B changes its location or API, you must update references everywhere. Your team wastes time maintaining connection details instead of building features.
  2. Point-to-point complexity: The number of connections grows quadratically (n²)
      With just 5 services, you might already have up to 20 direct connections. Now imagine scaling to 50 services — you're looking at potentially 2,450 connections! At that point, your architecture diagram starts to look like a mess of spaghetti. No one can manage that.
  3. Different communication protocols:
      Different teams use different protocols. Service A uses REST, B uses gRPC, C uses GraphQL. Every service must speak multiple "languages," creating a fragmented and inconsistent codebase.
  4. Resilience challenges:
      If Service B is temporarily down, Service A must handle the failures. Without built-in resilience patterns, Service A needs complex retry logic, timeout handling, and failure modes for each service it connects to. This duplicated effort across all services makes your system less reliable and harder to maintain.

The Solution: Message Brokers

A message broker is like a smart postal service for your microservices. It accepts messages from senders (producers), stores them temporarily if needed, and delivers them to the right recipients (consumers).

                 ┌─────────────┐
                 │             │
Order Service ───►  Message    ◄──── Inventory Service
                 │   Broker    │
Payment Service ───►             ◄──── Notification Service
                 └─────────────┘

This simple change transforms your architecture:

  • Decoupling: Services only know about the broker, not each other. You can replace, update, or scale individual services without disrupting others.
  • Linear Scaling: As your system grows from 5 to 500 services, each new service only needs one connection to the broker.
  • Consistent Communication: All services use the same protocol to talk to the broker, eliminating the need to support multiple communication methods.
  • Built-in Resilience: If a service goes down, messages wait safely in the broker until it recovers. No message loss, no complex retry logic.

Two Approaches to Messaging

Not all message brokers work the same way. There are two fundamental paradigms that dramatically affect how systems behave and what they're good for.

Push-Based Systems: The Active Delivery Model

In push-based systems like RabbitMQ or ActiveMQ, the broker actively delivers messages to consumers as soon as they arrive:

┌──────────┐    Publish    ┌──────────┐    Push     ┌──────────┐
│ Producer ├──────────────►│  Broker  ├────────────►│ Consumer │
└──────────┘               └──────────┘             └──────────┘

Think of this like a postal worker who immediately delivers each letter to your door as soon as it arrives at the post office.

Pull-Based Systems: The Log Model

Pull-based systems like Kafka and MEGA use a fundamentally different approach. Messages are stored in an ordered log, and consumers request messages when they're ready:

                ┌─────────────────────────┐
                │                         │
                │  ┌───┬───┬───┬───┬───┐  │
┌──────────┐    │  │ 0 │ 1 │ 2 │ 3 │ 4 │  │    ┌──────────┐
│ Producer ├────┼─►└───┴───┴───┴───┴───┘  │◄───┤ Consumer │
└──────────┘    │          Topic          │    └──────────┘
                │                         │    (pulls from offset 2)
                └─────────────────────────┘
                          Broker

This is more like a post office that keeps all mail in chronological order. You visit when you have time and pick up everything that arrived since your last visit.

Which Approach Should You Choose?

Neither approach is universally better—they serve different needs. Here's a quick comparison to help you decide:

ConsiderationPush Model (RabbitMQ)Pull Model (Kafka/MEGA)
Best forReal-time notifications, task distributionStream processing, event sourcing, data pipelines
LatencyLower (immediate delivery)Potentially higher (polling interval)
ThroughputLimited by slowest consumerCan achieve much higher throughput
Message OrderingCan be challenging to guaranteeNatural guarantee through log structure
Consumer ControlBroker controls delivery rateConsumer controls consumption rate
Replay CapabilityLimited (messages typically disappear after consumption)Strong (consumers can restart from any point)

Real-World Example

Imagine you're building a food delivery app. You might use both approaches for different parts of the system:

  • Push model for real-time delivery notifications: "Your driver is 5 minutes away!"
  • Pull model for order history and analytics: Reliably process all orders for business intelligence, even if the analytics service was down for maintenance.

The Log: A Deceptively Simple Data Structure

At the heart of pull-based systems like Kafka and MEGA lies a surprisingly simple data structure: the append-only log. Despite its simplicity, this structure enables powerful distributed system patterns.

What Exactly Is a Log?

In this context, a log is just an append-only sequence of records ordered by time. New records are always added to the end, and existing records are never modified:

┌─────────────────────────────────────────────────────────────────────────►
│  Record 0    Record 1    Record 2    Record 3    Record 4    Record 5
└─────────────────────────────────────────────────────────────────────────►
   Oldest                                                        Newest
   (Time)                                                        (Time)

Each record gets a sequential identifier (offset) that never changes. This simple structure provides powerful properties:

  1. Total Ordering: Every record has a precise position relative to others, making it easy to reason about sequences of events.
  2. Immutability: Past events cannot be altered, providing a reliable history that multiple consumers can trust.
  3. Sequential Access: Reading and writing patterns are highly efficient and predictable.
  4. Position Tracking: Consumers simply track their position as a number, making it easy to resume after disconnections.

From Logs to Topics

In systems like Kafka and MEGA, logs are organized into named topics that provide logical separation of different event streams:

Topic: "user_events"
┌─────────────────────────────────────────────────────────────────────────►
│  Event 0     Event 1     Event 2     Event 3     Event 4     Event 5
└─────────────────────────────────────────────────────────────────────────►

Topic: "payment_transactions"
┌─────────────────────────────────────────────────────────────────────────►
│  Trans 0     Trans 1     Trans 2     Trans 3
└─────────────────────────────────────────────────────────────────────────►

This allows consumers to subscribe only to the data they care about. For example, the email service might only need "user_events" while the fraud detection service needs "payment_transactions".

How MEGA Implements Logs

In MEGA, the Topic class implements this log concept with a straightforward approach:

java
public class Topic { private final String name; private final Queue<Message> messages; private AtomicInteger currentOffset; // Methods for appending and reading messages public int produce(Message message) { /* ... */ } public Message consume(int offset) { /* ... */ } }

While this in-memory implementation is simplified, it captures the essence of the log structure. In production systems like Kafka, logs are typically persisted to disk, partitioned across multiple nodes, and replicated for fault tolerance.

Binary Protocols: Efficiency at the Wire Level

When building high-performance messaging systems, every byte matters. While text-based protocols like JSON are human-readable and developer-friendly, they're not always the most efficient choice for systems that need to handle millions of messages per second.

Text vs. Binary: A Practical Comparison

Let's compare how a simple message might be represented in different formats:

JSON (Text-based):

{
  "correlationId": 123456,
  "messageType": "PRODUCE",
  "topic": "user_events",
  "timestamp": 1651234567890,
  "payload": "User logged in"
}

Binary (schematized):

[4 bytes: 123456][1 byte: 0x01][10 bytes: "user_events"][8 bytes: 1651234567890][14 bytes: "User logged in"]

The binary format is not only more compact (saving network bandwidth) but also faster to parse and process. This efficiency becomes critical at scale.

MEGA's Binary Protocol

MEGA uses a simple binary protocol where each field has a precise size and position. Here's a simplified view of a message header:

+----------------+---------------+---------------+-------------+--------------+----------------+
|  Correlation   |   Message     |    Topic      |  Topic Name |  Timestamp   |    Optional    |
|      ID        |     Type      |    Length     |             |              |     Fields     |
+----------------+---------------+---------------+-------------+--------------+----------------+
|     4 bytes    |    1 byte     |    4 bytes    |   Variable  |    8 bytes   |    Variable    |

This structured format provides several advantages:

  • Space Efficiency: No need to repeat field names in every message
  • Parsing Speed: The receiver knows exactly where each field starts and ends
  • Type Safety: Each field has a specific data type and size
  • Network Efficiency: Less data transmitted means better throughput

The Trade-offs

Binary protocols aren't without disadvantages. They're harder to debug (you can't just read them in a text editor), require special tools for inspection, and make schema evolution more challenging. But for high-performance messaging systems, these trade-offs are often worth it.

Correlation IDs: A Small Feature with Big Impact

Sometimes the simplest ideas have the most profound effects. Correlation IDs—unique identifiers that track requests and responses across system boundaries—are one such concept.

The Problem: Matching Requests and Responses

In asynchronous communication, a client might send multiple requests before receiving any responses. When responses finally arrive, how does the client know which response belongs to which request?

The Solution: Correlation IDs

A correlation ID is a unique identifier generated by the client and included in the request. The server then includes the same identifier in the corresponding response:

Client                                  Server
  |                                       |
  |-- Request (Correlation ID: 42) ------>|
  |-- Request (Correlation ID: 43) ------>|
  |                                       |-- Process Request 42
  |                                       |-- Process Request 43
  |<-- Response (Correlation ID: 43) -----|
  |<-- Response (Correlation ID: 42) -----|

Even though the responses arrived out of order, the client can match each response to its original request using the correlation ID.

Implementation in MEGA

In MEGA, every message includes a correlation ID as part of its header:

public class Message {
    private int correlationId;
    // Other fields...
    
    public int getCorrelationId() {
        return this.correlationId;
    }
}

The server handler ensures this ID is reflected in responses:

private void sendProduceResponse(int correlationId, boolean success, long timestamp, int offset)
        throws IOException {
    output.writeInt(correlationId); // Echo back the client's correlation ID
    // Write other response fields...
}

Beyond Simple Request Matching

Correlation IDs enable more sophisticated patterns beyond just matching requests and responses:

Distributed Tracing

Track a single request as it flows through multiple services in your architecture, making it easier to debug complex interactions.

Request Timeout Management

Detect and handle stalled requests by tracking which correlation IDs haven't received responses within the expected timeframe.

Idempotency Guarantees

Identify and deduplicate repeated requests, ensuring that operations like "process payment" only happen once even if the request is sent multiple times.

Debugging

Trace message flow through complex systems by following the correlation ID through logs across different services and components.

This simple concept becomes increasingly valuable as system complexity grows, providing a thread that connects related operations across distributed components.

How MEGA Makes Life Easier for Developers

The Problem with Traditional Integration

Developers shouldn't need to become messaging experts to use a messaging system. Yet many platforms force teams to:

  • Learn complex messaging concepts
  • Handle low-level protocol details
  • Adapt to unfamiliar programming patterns

MEGA's Layer Cake Approach

Instead, MEGA uses a layered architecture that shields developers from complexity:

┌─────────────────────────────────┐
│    Application Code (User)      │
└───────────────┬─────────────────┘
                │
┌───────────────▼─────────────────┐
│    Client Library (Language)    │
└───────────────┬─────────────────┘
                │
┌───────────────▼─────────────────┐
│    Protocol Implementation      │
└───────────────┬─────────────────┘
                │
┌───────────────▼─────────────────┐
│    Network Transport Layer      │
└───────────────┬─────────────────┘
                │
┌───────────────▼─────────────────┐
│     Broker Implementation       │
└─────────────────────────────────┘

What This Means in Practice

Use your language

Python, JavaScript, Go, Java—it doesn't matter

Write natural code

No need to learn new paradigms

Join the community

Create libraries in your favorite language

This same approach helped Kafka become an industry standard. When messaging feels like a natural extension of your existing code, adoption happens naturally.

Language-Specific APIs: Code That Feels Natural

Let's see how the same operation—producing a message—looks in different language-specific client libraries:

Raw Protocol (Java):

// Low-level protocol implementation
byte[] payload = "Hello World".getBytes();
int correlationId = generateCorrelationId();
outputStream.writeInt(correlationId);
outputStream.writeByte(MESSAGE_TYPE_PRODUCE);
outputStream.writeInt("my-topic".length());
outputStream.writeBytes("my-topic");
outputStream.writeLong(System.currentTimeMillis());
outputStream.writeInt(payload.length);
outputStream.write(payload);
// ... handle response manually

Idiomatic Java API:

// High-level idiomatic Java API
ProducerRecord<String, String> record = new ProducerRecord<>("my-topic", "Hello World");
producer.send(record).get(); // Returns CompletableFuture

Idiomatic Python API:

# High-level idiomatic Python API
async with Producer() as producer:
    await producer.send("my-topic", "Hello World")

Idiomatic JavaScript API:

// High-level idiomatic JavaScript API
const producer = new Producer();
await producer.connect();
await producer.send({
  topic: "my-topic",
  messages: [{ value: "Hello World" }]
});

Each example accomplishes the same task but in a way that feels natural to developers in that language ecosystem. This dramatically improves developer experience and reduces the learning curve.

Real-World Applications

Let's explore how distributed messaging systems like MEGA can be applied in real-world scenarios with clear, practical examples.

Event Broadcasting: One Message, Many Listeners

                       ┌──────────┐
                    ┌─►│ Email    │
                    │  │ Service  │ Welcome email
                    │  └──────────┘
                    │
┌──────────┐    ┌───┴────┐  ┌──────────┐
│  User    │───►│ Topic: │◄─┤Analytics │ Track signup
│ Service  │    │ user.  │  │ Service  │
└──────────┘    │created │  └──────────┘
                └───┬────┘
                    │  ┌────────────┐
                    └─►│Recommend.  │ Initial 
                       │Service     │ suggestions
                       └────────────┘

How It Works

  1. A user registers on your website through the User Service
  2. The User Service publishes a single "UserCreated" event to the "user.created" topic
  3. Three separate services subscribe to this topic:
    • The Email Service sends a personalized welcome email
    • The Analytics Service updates user acquisition metrics
    • The Recommendations Service prepares initial content suggestions

Key Benefits

  • The User Service doesn't need to know which systems need the information
  • New consumers can be added without modifying the producer
  • If any service fails, others continue working independently

Work Distribution: Parallel Processing

                    ┌─────────────────┐
                    │                 │
┌──────────┐        │  ┌───┬───┬───┐  │        ┌──────────┐
│  Upload  │────┬──►│  │ 1 │ 2 │ 3 │  │◄───┬───┤ Worker 1 │
│ Service  │    │   │  └───┴───┴───┘  │    │   └──────────┘
└──────────┘    │   │   Task Queue    │    │   ┌──────────┐
                └──►│                 │◄───┼───┤ Worker 2 │
                    │  ┌───┬───┬───┐  │    │   └──────────┘
                    │  │ 4 │ 5 │ 6 │  │◄───┴───┤ Worker 3 │
                    │  └───┴───┴───┘  │        └──────────┘

How It Works

  1. A user uploads multiple images to be processed (resized, filtered, etc.)
  2. The Upload Service creates a task for each image and sends it to the task queue topic
  3. Multiple worker services consume from this topic in parallel
  4. Each worker processes its assigned task and reports completion
  5. If a worker crashes, the task can be automatically reassigned to another worker

Key Benefits

  • Processing happens in parallel, reducing total time
  • Workers can be added or removed dynamically based on load
  • The system is resilient to worker failures

Event Sourcing: The Log as Source of Truth

┌───────────┐   │ Event Store │   ┌───────────┐
│ Commands  │──►│ Topic:      │
│ (deposit, │   │ tx.events   │◄──┤ Query     │
│ withdraw) │   │             │   │ Services  │
└───────────┘   │ ┌───┬───┬───┤   └───────────┘
                │ │ 1 │ 2 │ 3 │       │
                │ └───┴───┴───┘       │
                └─────────────────────┘       
                        │                ┌────────────┐
                        └───────────────►│ Projections│
                                         │ (current   │
                                         │  balances) │
                                         └────────────┘

Banking Example

Instead of updating account balances directly, every transaction is stored as an immutable event:

  1. Customer deposits $100 → "Deposit" event stored in log
  2. Customer withdraws $30 → "Withdrawal" event stored in log
  3. Current balance ($70) is calculated by replaying all events

Key Benefits

  • Complete audit trail built into the system architecture
  • Historical states can be reconstructed at any point in time
  • Business logic bugs can be fixed by reprocessing events
  • New business insights can be derived from existing events

Learning Through Implementation

MEGA (Minimalist Event Gateway Architecture) is my educational open-source project that demonstrates the core principles of distributed messaging systems. I created it to help developers understand these concepts through hands-on experience.

Why Build Your Own?

Even if you ultimately use existing systems in production, building a simplified version yourself provides several benefits:

Deeper Understanding

You'll see exactly how and why certain design decisions are made, not just follow documentation.

Debugging Ability

When issues arise in production systems, you'll have mental models for troubleshooting.

Architectural Insight

You'll make better decisions about when and how to use different messaging patterns.

Innovation Potential

You might discover improvements or alternatives to existing approaches.

Balancing Theory and Practice

While implementing systems is invaluable, deep theoretical knowledge provides the foundation for good design decisions. The most powerful learning happens when you combine both:

  • Read the literature: Papers like "The Log: What every software engineer should know about real-time data's unifying abstraction" by Jay Kreps provide critical insights
  • Study distributed systems theory: Understanding concepts like consensus algorithms, consistency models, and failure modes
  • Analyze existing systems: Review the architecture of systems like Kafka, RabbitMQ, and NATS
  • Build your own implementation: Start simple and gradually add features as you learn

Contributing to MEGA

MEGA is an open-source project that welcomes contributions from the community. There are many ways to get involved:

  • Client Libraries: Implement the protocol in different programming languages
  • Feature Extensions: Add new capabilities like persistence or partitioning
  • Documentation: Improve explanations and examples
  • Testing: Create robust test suites and benchmarks

By contributing, you not only improve your own understanding but also help others learn these important concepts.

If you found this article helpful, consider exploring my educational project MEGA on GitHub , which implements these concepts in a simplified form designed for learning.

Watch the MEGA code review in Arabic . The best way to understand these systems is to combine theoretical study with hands-on implementation.

Follow me on Twitter @Z_MEGATR0N for more updates

Comments (0)