The SkillsBench Study: Why Your AI Agent Can’t Write Its Own Skills

Last week, a fascinating paper dropped on arXiv that should make every developer reconsider how they’re building AI agents. The researchers tested a simple question: Can AI agents write their own skills — and do those skills actually help?

The answer? Nope. And honestly, I’m not surprised.

The Background

If you’ve been following the AI agent space, you’ve probably heard about “skills” — those structured packages of procedural knowledge that augment LLM agents at inference time. Think of them as little capability modules your agent can pull from when needed.

The hype around self-generating these skills has been building. The idea is seductive: your AI figures out what it needs, writes its own skills, and suddenly gets better at tasks without you having to manually program everything.

But there’s just one problem — nobody had actually measured whether any of this worked.

What SkillsBench Found

The researchers built SkillsBench, a benchmark with 86 tasks across 11 domains. They tested three conditions: no skills, curated skills (written by humans), and self-generated skills (written by the AI itself).

The results are stark:

Curated skills raise average pass rate by 16.2 percentage points. That’s significant — we’re talking meaningful improvements in real tasks.

Self-generated skills provide zero benefit on average. The models literally cannot reliably author the procedural knowledge they benefit from consuming.

Let me say that again: The same model that benefits from using skills cannot write useful skills itself. It’s like asking a student to create their own textbook and then learn from it — somehow, they do better using a book written by someone else.

The Domain Variance Is Fascinating

What really caught my eye was how much the benefit varies by domain:

Healthcare: +51.9pp — Skills are incredibly valuable here
Software Engineering: +4.5pp — Barely better than nothing

Sixteen out of 84 tasks actually showed negative deltas with skills. That’s right — sometimes adding skills made things worse.

My take? In healthcare, there’s clear, structured procedural knowledge. In software engineering, every project is a snowflake with different requirements. The “best practices” that work in one codebase might be anti-patterns in another.

What This Means For You

If you’re building AI agents today, here’s the practical reality:

Don’t waste time on self-generating skills. The research is clear — it’s not working. Your AI isn’t going to bootstrap itself into competence.

Invest in curated, focused skills. The paper found that 2-3 module “focused” skills beat comprehensive documentation. Think narrow, specific capabilities rather than massive prompt libraries.

Small models + good skills can match large models. This is big. A smaller model with the right skills can perform as well as a larger model without them. That means you might not need to pay for GPT-5 when a well-equipped GPT-4 does the job.

My Experience

I’ve been playing with Claude Code and similar agentic tools, and honestly? The best results come from writing my own, specific instructions for the tasks I need. I’ve wasted hours trying to get agents to “figure out” their own workflows — they just end up creating convoluted, circular patterns that don’t generalize.

The moment I sat down and wrote clear, specific skill definitions? Things actually worked.

The Bigger Picture

This research lands at an interesting time. We’re seeing a massive push toward “agentic AI” — AI that can do things autonomously. Companies are pouring billions into making agents more capable.

But here’s what worries me: We’re optimizing for the wrong thing. Making agents more autonomous doesn’t make them better. The SkillsBench study shows that hand-crafted human knowledge still beats AI-generated knowledge, even in 2026.

Maybe the future isn’t about making AI write its own skills. Maybe it’s about creating better tools for humans to write better skills.

Common Mistakes I’m Seeing

Prompt libraries instead of skills — People dump 50 prompts and think they have an agent. That’s not a skill system; that’s a mess.
Too broad skills — “Write good code” isn’t a skill. “Validate SQL injection patterns” is.
Ignoring the domain — Not every task benefits from skills. Software engineering showed minimal gains — maybe your use case doesn’t either.

Next Steps

If you’re building with AI agents:

Audit your current skills. Are they focused (2-3 modules) or comprehensive (50+ pages)?
Test with and without skills on your specific tasks. The variance across domains means your mileage may vary.
Invest time in writing your own skills rather than auto-generating them.

The era of “AI builds itself” isn’t here yet. But the era of “humans build better AI skills” definitely is.

What do you think? Are you seeing similar results with self-generated skills? Drop a comment — I’d love to hear about your experience.

Bittalks

Developer and tech enthusiast exploring the intersection of open source, AI, and modern software development.