tahnik@portfolio:~$ project-detail

ServeLoop - Minimal-Scale LLM Inference Engine

A compact open-source LLM serving engine implementing continuous batching, paged KV-cache, and GPU scheduling optimization with mixed precision (BF16/FP8) support — built to study scheduler behavior, memory efficiency, and tail-latency trade-offs in inference workloads.

Project Category

AI Engineering

Tech Stacks:

ServeLoop is an open-source, minimal-scale LLM inference engine built from scratch to make the internals of model serving readable, modifiable, and measurable. Rather than wrapping an existing runtime, ServeLoop implements the core serving stack — request lifecycle, batching, KV-cache management, and GPU scheduling — as clear, isolated components designed for learning and research.

Core Architecture

The engine is structured around four main subsystems:

  • Request Manager: Handles intake, sequence tracking, and priority assignment for incoming inference requests. Manages the full lifecycle from arrival through completion, including preemption and cancellation.

  • Continuous Batching Scheduler: Forms micro-batches across active requests at each decode step rather than waiting for a full batch to complete. This allows new requests to enter and completed requests to exit mid-batch, improving GPU utilization and reducing time-to-first-token.

  • Paged KV-Cache Manager: Allocates and frees cache blocks using a block table approach inspired by virtual memory systems. Avoids the fragmentation and over-allocation problems of contiguous KV-cache layouts, enabling higher concurrency per GPU.

  • Model Execution Loop: Orchestrates prefill and decode steps, coordinating between the scheduler and cache manager to execute forward passes efficiently.

GPU Scheduling Optimization

ServeLoop includes a dedicated GPU scheduling layer designed to study the decisions that sit beneath production inference systems. This covers batch formation strategies (how requests are grouped for execution), memory-aware scheduling (ensuring KV-cache pressure doesn't force unnecessary evictions), and priority-based preemption (allowing latency-sensitive requests to jump the queue when GPU resources are constrained).

Mixed Precision Support

The engine implements BF16 and FP8 precision modes to study throughput-vs-accuracy trade-offs in serving workloads. BF16 provides a practical balance for most inference tasks, while FP8 support enables experimentation with aggressive quantization for throughput-critical scenarios — useful for understanding where precision loss becomes unacceptable and how it interacts with different model architectures.


Observability

ServeLoop exposes internal metrics at each stage: queue depth, batch sizes, cache utilization, scheduling decisions, and per-request latency breakdowns (queue wait, prefill, decode). This makes it possible to trace exactly why a request was slow, where memory pressure built up, or how a scheduling change affected tail latency.


ServeLoop is being developed as a standalone OSS project. The GitHub repository will be published soon.

SentinelOps - SRE CLI with Agentic Workflow and eBPF

A CLI-based AI SRE tool that combines eBPF telemetry with agent workflows to monitor, diagnose, and remediate issues in Kubernetes and cloud-native systems.

Project Category

AI Engineering

Tech Stacks:

PyTorch

Kubernetes

FastAPI

TypeScript

© Tahnik Ahmed | 2026