Skip to content

Blog

The Validation Tool That Couldn't Validate Itself

I've been building Haytham for months. It's a Claude Code plugin that orchestrates AI agents to validate startup ideas, scope MVPs, design architecture, and generate specs. Eight agents, four phases, structured handoffs, the whole thing.

A few weeks ago I had a long conversation with someone who poked holes in the project until I couldn't patch them anymore. The conclusion: Haytham has an identity crisis. It presents as a validation tool, but the actual vision is a lifecycle control plane for AI-built products. The validation pipeline is phase one of a three-phase vision that only I can see. Everyone else sees a validation tool in a crowded market.

That conversation surfaced a specific finding I couldn't shake: the product I'm building (Genesis, the validation pipeline) isn't the real product. The real product is Evolution, the phase where the reasoning graph I'm building gets used to handle change requests without full rewrites. Without Evolution, Genesis is a well-structured validation tool competing with ChatPRD and ValidatorAI. With Evolution, it's something genuinely new.

So I did the obvious thing. I ran /haytham:validate on Haytham's own idea and compared what the tool found to what that conversation found.

I Fed My AI Pipeline Its Own Idea

Haytham is a Claude Code plugin that takes a startup idea through four phases: market research, MVP scoping, architecture decisions, and spec generation. Eight agents, structured handoffs, validation hooks. I've been building it for months.

At some point I had to try the obvious thing: feed Haytham its own idea and see what comes out.

So I typed this into a fresh Claude Code session: "An open source Claude Code plugin that takes an Idea, researches it, provides recommendations on the research and competitors, proposes a MVP, suggests architecture, produce standardised specs and finally builds the system." Then I sat back and let it run.

Why You Should Ship Your Agentic Workflow as a Claude Code Plugin

I'm building an agentic workflow that takes a startup idea does market research, drafts a MVP, and generates a spec detailed enough to hand straight to a coding agent.

I started out with a full-fledged agentic framework (Strands SDK). That gives you a lot of flexibility and control over the agents, but it also comes with a lot of overhead. If the overhead was on the system only, it would have been fine, but it was on the user too.

My goal was to validate the idea from users, which ironically is the product I was building. In my test runs it worked great, but I didn't realise that the setup was a deal breaker for most people. This made me rethink my approach, and I decided to sacrifice the bells and whistles for something that can be tested with one prompt (if you already have Claude Code).

If you're building AI-powered developer tools, the trade-offs below might save you some time.

From Startup Idea to Agent-Ready Spec in 20 Minutes

You describe a startup idea. Twenty minutes later, you have a validated OpenSpec directory tree, complete with SHALL requirements, Gherkin acceptance criteria, and architecture decisions. No hand-writing specs. No prompt engineering. Point Claude Code (or Cursor, or Copilot) at it and start building.

That's the workflow Haytham delivers. This post is about why it matters and what the output actually looks like.

Your Agents Are Playing Telephone

Agents Playing Telephone


The Case for Decomposition

There are good reasons to split a complex task across multiple agents. A single agent trying to research a market, design an architecture, and write specs all at once will hit context limits, lose focus, and give you no chance to review between phases.

Decomposition buys you three things: specialization (each agent does one thing well), decision gates (you review before the next phase runs), and cost control (change one phase without re-running everything).

So you split the work. Research agent, planning agent, design agent, implementation agent. Each one focused and manageable.

But give the pipeline a specific, nuanced input. Something like "build an invite-only marketplace for vintage furniture restorers, with escrow payments, max 500 sellers at launch." What comes out the other end is a generic two-sided marketplace with open signup, Stripe checkout, and infinite scalability. Every distinctive constraint, the things that made the input yours, got smoothed away by agents who each did their job perfectly in isolation.

This is the telephone game, except the players are LLMs and the message is your system.