From Zero to MEGA: Rebuilding Kafka Concepts for Learning and Fun
TL;DR
This article explores the core concepts behind distributed messaging systems like Apache Kafka through the lens of MEGA, my educational project that rebuilds these concepts from scratch. We'll cover message brokers, event-driven architectures, log-based storage, and more. Whether you're new to distributed systems or looking to deepen your understanding, this practical guide shows how building a simplified version can provide powerful insights into these complex systems.
When I first encountered Kafka, I was overwhelmed by its complexity. To truly understand it, I decided to build a simplified version from scratch. This journey taught me more than any documentation could. Now, I'm sharing these insights with you—whether you're building your first microservice architecture or scaling complex event-driven applications.
Why We Need Message Brokers
Imagine you're building an e-commerce platform. When a customer places an order, multiple things need to happen: inventory updates, payment processing, shipping notifications, analytics tracking, and more. How should these systems communicate?
The Problem: Direct Communication Doesn't Scale
The simplest approach is direct communication between services:
Order Service ----> Inventory Service
\
\----> Payment Service
\
\----> Notification Service
This works for small systems, but quickly becomes problematic as you grow. Here's why large tech companies don't build their systems this way:
- Tight coupling:
- When services communicate directly, each one needs to know exactly where to find all the others. When Service B changes its location or API, you must update references everywhere. Your team wastes time maintaining connection details instead of building features.
- Point-to-point complexity: The number of connections grows quadratically (n²)
- With just 5 services, you might already have up to 20 direct connections. Now imagine scaling to 50 services — you're looking at potentially 2,450 connections! At that point, your architecture diagram starts to look like a mess of spaghetti. No one can manage that.
- Different communication protocols:
- Different teams use different protocols. Service A uses REST, B uses gRPC, C uses GraphQL. Every service must speak multiple "languages," creating a fragmented and inconsistent codebase.
- Resilience challenges:
- If Service B is temporarily down, Service A must handle the failures. Without built-in resilience patterns, Service A needs complex retry logic, timeout handling, and failure modes for each service it connects to. This duplicated effort across all services makes your system less reliable and harder to maintain.
The Solution: Message Brokers
A message broker is like a smart postal service for your microservices. It accepts messages from senders (producers), stores them temporarily if needed, and delivers them to the right recipients (consumers).
┌─────────────┐
│ │
Order Service ───► Message ◄──── Inventory Service
│ Broker │
Payment Service ───► ◄──── Notification Service
└─────────────┘
This simple change transforms your architecture:
- Decoupling: Services only know about the broker, not each other. You can replace, update, or scale individual services without disrupting others.
- Linear Scaling: As your system grows from 5 to 500 services, each new service only needs one connection to the broker.
- Consistent Communication: All services use the same protocol to talk to the broker, eliminating the need to support multiple communication methods.
- Built-in Resilience: If a service goes down, messages wait safely in the broker until it recovers. No message loss, no complex retry logic.
Two Approaches to Messaging
Not all message brokers work the same way. There are two fundamental paradigms that dramatically affect how systems behave and what they're good for.
Push-Based Systems: The Active Delivery Model
In push-based systems like RabbitMQ or ActiveMQ, the broker actively delivers messages to consumers as soon as they arrive:
┌──────────┐ Publish ┌──────────┐ Push ┌──────────┐
│ Producer ├──────────────►│ Broker ├────────────►│ Consumer │
└──────────┘ └──────────┘ └──────────┘
Think of this like a postal worker who immediately delivers each letter to your door as soon as it arrives at the post office.
Pull-Based Systems: The Log Model
Pull-based systems like Kafka and MEGA use a fundamentally different approach. Messages are stored in an ordered log, and consumers request messages when they're ready:
┌─────────────────────────┐
│ │
│ ┌───┬───┬───┬───┬───┐ │
┌──────────┐ │ │ 0 │ 1 │ 2 │ 3 │ 4 │ │ ┌──────────┐
│ Producer ├────┼─►└───┴───┴───┴───┴───┘ │◄───┤ Consumer │
└──────────┘ │ Topic │ └──────────┘
│ │ (pulls from offset 2)
└─────────────────────────┘
Broker
This is more like a post office that keeps all mail in chronological order. You visit when you have time and pick up everything that arrived since your last visit.
Which Approach Should You Choose?
Neither approach is universally better—they serve different needs. Here's a quick comparison to help you decide:
Consideration | Push Model (RabbitMQ) | Pull Model (Kafka/MEGA) |
---|---|---|
Best for | Real-time notifications, task distribution | Stream processing, event sourcing, data pipelines |
Latency | Lower (immediate delivery) | Potentially higher (polling interval) |
Throughput | Limited by slowest consumer | Can achieve much higher throughput |
Message Ordering | Can be challenging to guarantee | Natural guarantee through log structure |
Consumer Control | Broker controls delivery rate | Consumer controls consumption rate |
Replay Capability | Limited (messages typically disappear after consumption) | Strong (consumers can restart from any point) |
Real-World Example
Imagine you're building a food delivery app. You might use both approaches for different parts of the system:
- Push model for real-time delivery notifications: "Your driver is 5 minutes away!"
- Pull model for order history and analytics: Reliably process all orders for business intelligence, even if the analytics service was down for maintenance.
The Log: A Deceptively Simple Data Structure
At the heart of pull-based systems like Kafka and MEGA lies a surprisingly simple data structure: the append-only log. Despite its simplicity, this structure enables powerful distributed system patterns.
What Exactly Is a Log?
In this context, a log is just an append-only sequence of records ordered by time. New records are always added to the end, and existing records are never modified:
┌─────────────────────────────────────────────────────────────────────────►
│ Record 0 Record 1 Record 2 Record 3 Record 4 Record 5
└─────────────────────────────────────────────────────────────────────────►
Oldest Newest
(Time) (Time)
Each record gets a sequential identifier (offset) that never changes. This simple structure provides powerful properties:
- Total Ordering: Every record has a precise position relative to others, making it easy to reason about sequences of events.
- Immutability: Past events cannot be altered, providing a reliable history that multiple consumers can trust.
- Sequential Access: Reading and writing patterns are highly efficient and predictable.
- Position Tracking: Consumers simply track their position as a number, making it easy to resume after disconnections.
From Logs to Topics
In systems like Kafka and MEGA, logs are organized into named topics that provide logical separation of different event streams:
Topic: "user_events"
┌─────────────────────────────────────────────────────────────────────────►
│ Event 0 Event 1 Event 2 Event 3 Event 4 Event 5
└─────────────────────────────────────────────────────────────────────────►
Topic: "payment_transactions"
┌─────────────────────────────────────────────────────────────────────────►
│ Trans 0 Trans 1 Trans 2 Trans 3
└─────────────────────────────────────────────────────────────────────────►
This allows consumers to subscribe only to the data they care about. For example, the email service might only need "user_events" while the fraud detection service needs "payment_transactions".
How MEGA Implements Logs
In MEGA, the Topic
class implements this log concept with a straightforward approach:
javapublic class Topic { private final String name; private final Queue<Message> messages; private AtomicInteger currentOffset; // Methods for appending and reading messages public int produce(Message message) { /* ... */ } public Message consume(int offset) { /* ... */ } }
While this in-memory implementation is simplified, it captures the essence of the log structure. In production systems like Kafka, logs are typically persisted to disk, partitioned across multiple nodes, and replicated for fault tolerance.
Binary Protocols: Efficiency at the Wire Level
When building high-performance messaging systems, every byte matters. While text-based protocols like JSON are human-readable and developer-friendly, they're not always the most efficient choice for systems that need to handle millions of messages per second.
Text vs. Binary: A Practical Comparison
Let's compare how a simple message might be represented in different formats:
JSON (Text-based):
{
"correlationId": 123456,
"messageType": "PRODUCE",
"topic": "user_events",
"timestamp": 1651234567890,
"payload": "User logged in"
}
Binary (schematized):
[4 bytes: 123456][1 byte: 0x01][10 bytes: "user_events"][8 bytes: 1651234567890][14 bytes: "User logged in"]
The binary format is not only more compact (saving network bandwidth) but also faster to parse and process. This efficiency becomes critical at scale.
MEGA's Binary Protocol
MEGA uses a simple binary protocol where each field has a precise size and position. Here's a simplified view of a message header:
+----------------+---------------+---------------+-------------+--------------+----------------+
| Correlation | Message | Topic | Topic Name | Timestamp | Optional |
| ID | Type | Length | | | Fields |
+----------------+---------------+---------------+-------------+--------------+----------------+
| 4 bytes | 1 byte | 4 bytes | Variable | 8 bytes | Variable |
This structured format provides several advantages:
- Space Efficiency: No need to repeat field names in every message
- Parsing Speed: The receiver knows exactly where each field starts and ends
- Type Safety: Each field has a specific data type and size
- Network Efficiency: Less data transmitted means better throughput
The Trade-offs
Binary protocols aren't without disadvantages. They're harder to debug (you can't just read them in a text editor), require special tools for inspection, and make schema evolution more challenging. But for high-performance messaging systems, these trade-offs are often worth it.
Correlation IDs: A Small Feature with Big Impact
Sometimes the simplest ideas have the most profound effects. Correlation IDs—unique identifiers that track requests and responses across system boundaries—are one such concept.
The Problem: Matching Requests and Responses
In asynchronous communication, a client might send multiple requests before receiving any responses. When responses finally arrive, how does the client know which response belongs to which request?
The Solution: Correlation IDs
A correlation ID is a unique identifier generated by the client and included in the request. The server then includes the same identifier in the corresponding response:
Client Server
| |
|-- Request (Correlation ID: 42) ------>|
|-- Request (Correlation ID: 43) ------>|
| |-- Process Request 42
| |-- Process Request 43
|<-- Response (Correlation ID: 43) -----|
|<-- Response (Correlation ID: 42) -----|
Even though the responses arrived out of order, the client can match each response to its original request using the correlation ID.
Implementation in MEGA
In MEGA, every message includes a correlation ID as part of its header:
public class Message {
private int correlationId;
// Other fields...
public int getCorrelationId() {
return this.correlationId;
}
}
The server handler ensures this ID is reflected in responses:
private void sendProduceResponse(int correlationId, boolean success, long timestamp, int offset)
throws IOException {
output.writeInt(correlationId); // Echo back the client's correlation ID
// Write other response fields...
}
Beyond Simple Request Matching
Correlation IDs enable more sophisticated patterns beyond just matching requests and responses:
Distributed Tracing
Track a single request as it flows through multiple services in your architecture, making it easier to debug complex interactions.
Request Timeout Management
Detect and handle stalled requests by tracking which correlation IDs haven't received responses within the expected timeframe.
Idempotency Guarantees
Identify and deduplicate repeated requests, ensuring that operations like "process payment" only happen once even if the request is sent multiple times.
Debugging
Trace message flow through complex systems by following the correlation ID through logs across different services and components.
This simple concept becomes increasingly valuable as system complexity grows, providing a thread that connects related operations across distributed components.
How MEGA Makes Life Easier for Developers
The Problem with Traditional Integration
Developers shouldn't need to become messaging experts to use a messaging system. Yet many platforms force teams to:
- Learn complex messaging concepts
- Handle low-level protocol details
- Adapt to unfamiliar programming patterns
MEGA's Layer Cake Approach
Instead, MEGA uses a layered architecture that shields developers from complexity:
┌─────────────────────────────────┐
│ Application Code (User) │
└───────────────┬─────────────────┘
│
┌───────────────▼─────────────────┐
│ Client Library (Language) │
└───────────────┬─────────────────┘
│
┌───────────────▼─────────────────┐
│ Protocol Implementation │
└───────────────┬─────────────────┘
│
┌───────────────▼─────────────────┐
│ Network Transport Layer │
└───────────────┬─────────────────┘
│
┌───────────────▼─────────────────┐
│ Broker Implementation │
└─────────────────────────────────┘
What This Means in Practice
Use your language
Python, JavaScript, Go, Java—it doesn't matter
Write natural code
No need to learn new paradigms
Join the community
Create libraries in your favorite language
This same approach helped Kafka become an industry standard. When messaging feels like a natural extension of your existing code, adoption happens naturally.
Language-Specific APIs: Code That Feels Natural
Let's see how the same operation—producing a message—looks in different language-specific client libraries:
Raw Protocol (Java):
// Low-level protocol implementation
byte[] payload = "Hello World".getBytes();
int correlationId = generateCorrelationId();
outputStream.writeInt(correlationId);
outputStream.writeByte(MESSAGE_TYPE_PRODUCE);
outputStream.writeInt("my-topic".length());
outputStream.writeBytes("my-topic");
outputStream.writeLong(System.currentTimeMillis());
outputStream.writeInt(payload.length);
outputStream.write(payload);
// ... handle response manually
Idiomatic Java API:
// High-level idiomatic Java API
ProducerRecord<String, String> record = new ProducerRecord<>("my-topic", "Hello World");
producer.send(record).get(); // Returns CompletableFuture
Idiomatic Python API:
# High-level idiomatic Python API
async with Producer() as producer:
await producer.send("my-topic", "Hello World")
Idiomatic JavaScript API:
// High-level idiomatic JavaScript API
const producer = new Producer();
await producer.connect();
await producer.send({
topic: "my-topic",
messages: [{ value: "Hello World" }]
});
Each example accomplishes the same task but in a way that feels natural to developers in that language ecosystem. This dramatically improves developer experience and reduces the learning curve.
Real-World Applications
Let's explore how distributed messaging systems like MEGA can be applied in real-world scenarios with clear, practical examples.
Event Broadcasting: One Message, Many Listeners
┌──────────┐
┌─►│ Email │
│ │ Service │ Welcome email
│ └──────────┘
│
┌──────────┐ ┌───┴────┐ ┌──────────┐
│ User │───►│ Topic: │◄─┤Analytics │ Track signup
│ Service │ │ user. │ │ Service │
└──────────┘ │created │ └──────────┘
└───┬────┘
│ ┌────────────┐
└─►│Recommend. │ Initial
│Service │ suggestions
└────────────┘
How It Works
- A user registers on your website through the User Service
- The User Service publishes a single "UserCreated" event to the "user.created" topic
- Three separate services subscribe to this topic:
- The Email Service sends a personalized welcome email
- The Analytics Service updates user acquisition metrics
- The Recommendations Service prepares initial content suggestions
Key Benefits
- The User Service doesn't need to know which systems need the information
- New consumers can be added without modifying the producer
- If any service fails, others continue working independently
Work Distribution: Parallel Processing
┌─────────────────┐
│ │
┌──────────┐ │ ┌───┬───┬───┐ │ ┌──────────┐
│ Upload │────┬──►│ │ 1 │ 2 │ 3 │ │◄───┬───┤ Worker 1 │
│ Service │ │ │ └───┴───┴───┘ │ │ └──────────┘
└──────────┘ │ │ Task Queue │ │ ┌──────────┐
└──►│ │◄───┼───┤ Worker 2 │
│ ┌───┬───┬───┐ │ │ └──────────┘
│ │ 4 │ 5 │ 6 │ │◄───┴───┤ Worker 3 │
│ └───┴───┴───┘ │ └──────────┘
How It Works
- A user uploads multiple images to be processed (resized, filtered, etc.)
- The Upload Service creates a task for each image and sends it to the task queue topic
- Multiple worker services consume from this topic in parallel
- Each worker processes its assigned task and reports completion
- If a worker crashes, the task can be automatically reassigned to another worker
Key Benefits
- Processing happens in parallel, reducing total time
- Workers can be added or removed dynamically based on load
- The system is resilient to worker failures
Event Sourcing: The Log as Source of Truth
┌───────────┐ │ Event Store │ ┌───────────┐
│ Commands │──►│ Topic: │
│ (deposit, │ │ tx.events │◄──┤ Query │
│ withdraw) │ │ │ │ Services │
└───────────┘ │ ┌───┬───┬───┤ └───────────┘
│ │ 1 │ 2 │ 3 │ │
│ └───┴───┴───┘ │
└─────────────────────┘
│ ┌────────────┐
└───────────────►│ Projections│
│ (current │
│ balances) │
└────────────┘
Banking Example
Instead of updating account balances directly, every transaction is stored as an immutable event:
- Customer deposits $100 → "Deposit" event stored in log
- Customer withdraws $30 → "Withdrawal" event stored in log
- Current balance ($70) is calculated by replaying all events
Key Benefits
- Complete audit trail built into the system architecture
- Historical states can be reconstructed at any point in time
- Business logic bugs can be fixed by reprocessing events
- New business insights can be derived from existing events
Learning Through Implementation
MEGA (Minimalist Event Gateway Architecture) is my educational open-source project that demonstrates the core principles of distributed messaging systems. I created it to help developers understand these concepts through hands-on experience.
Why Build Your Own?
Even if you ultimately use existing systems in production, building a simplified version yourself provides several benefits:
Deeper Understanding
You'll see exactly how and why certain design decisions are made, not just follow documentation.
Debugging Ability
When issues arise in production systems, you'll have mental models for troubleshooting.
Architectural Insight
You'll make better decisions about when and how to use different messaging patterns.
Innovation Potential
You might discover improvements or alternatives to existing approaches.
Balancing Theory and Practice
While implementing systems is invaluable, deep theoretical knowledge provides the foundation for good design decisions. The most powerful learning happens when you combine both:
- Read the literature: Papers like "The Log: What every software engineer should know about real-time data's unifying abstraction" by Jay Kreps provide critical insights
- Study distributed systems theory: Understanding concepts like consensus algorithms, consistency models, and failure modes
- Analyze existing systems: Review the architecture of systems like Kafka, RabbitMQ, and NATS
- Build your own implementation: Start simple and gradually add features as you learn
Contributing to MEGA
MEGA is an open-source project that welcomes contributions from the community. There are many ways to get involved:
- Client Libraries: Implement the protocol in different programming languages
- Feature Extensions: Add new capabilities like persistence or partitioning
- Documentation: Improve explanations and examples
- Testing: Create robust test suites and benchmarks
By contributing, you not only improve your own understanding but also help others learn these important concepts.
If you found this article helpful, consider exploring my educational project MEGA on GitHub , which implements these concepts in a simplified form designed for learning.
Watch the MEGA code review in Arabic . The best way to understand these systems is to combine theoretical study with hands-on implementation.
Follow me on Twitter @Z_MEGATR0N for more updates
Comments (0)