← Projects
Active Work

LLM Eval Harness

View on GitHub ↗
TypeScript (npm workspaces)Express + Prisma (PostgreSQL)React 18 + VitePython FastAPI (scorer)Bull (job queue)Docker Compose

About this project

Self-hosted LLM evaluation platform for side-by-side model comparison, automated scoring, and experiment tracking. TypeScript monorepo with Express API, React 18 frontend, Prisma ORM, and a Python FastAPI scorer microservice for ROUGE and BERTScore metrics. SSE-driven live run progress.

Background

When evaluating which language model to use for a given enterprise task, the honest answer is that you have to run them on your actual data. Generic benchmarks tell you something about model capability in the abstract, but they don't tell you which model performs best on your specific corpus with your specific prompts. The LLM Eval Harness was built to make that comparative evaluation systematic rather than ad-hoc.

The architecture reflects a few practical constraints. The scoring layer needs Python because the best NLP evaluation libraries (rouge-score, bert-score) are in the Python ecosystem. The API and frontend are TypeScript because that's where most of the team is comfortable and where the build tooling is mature. Separating these into a monorepo with npm workspaces gives clean boundaries without requiring a full microservices deployment — you run them together with Docker Compose.

The SSE-driven live run progress was a deliberate UX choice. Evaluation runs can take minutes when you're running multiple models against a large dataset. Showing progress in real time — model by model, prompt by prompt — keeps the experience from feeling like a black box. Polling would have been simpler to implement but would have introduced either latency or excessive server load.

The Prisma schema is normalised around the right domain objects: a User creates Experiments, each Experiment has multiple Runs, each Run produces Results per dataset item with JSON score payloads. That structure makes it straightforward to query "which model performed best on this experiment" or "how has model X's performance changed over time" without schema gymnastics.

Highlights

  • npm workspaces monorepo: packages/api, packages/web, packages/scorer
  • Prisma schema: User → Experiment → Run → Result with dataset items and JSON scores
  • SSE endpoint streams live run progress to React frontend without polling
  • Python scorer microservice: ROUGE-1/2/L via rouge-score, BERTScore via bert-score
  • Supports OpenAI, Anthropic, Google, and Ollama model providers
← All projects GitHub ↗
← infractl Claude Code Agentic Workflows →