@mayaanderssondev

mayaandersson

@mayaanderssondev

Just a bored curious dev

Palo Alto CAJoined May 2026

About

Nothing here yet.

Available for

Nothing here yet.

mayaandersson's blogs

Your LLM-as-judge eval set is too small. Here is the math.llmasajudge.hashnode.dev7 posts

Articles Threads Comments1

Recently published

Mmayaanderssonllmasajudge.hashnode.dev

I benchmarked 6 prompt-optimization frameworks on the same task. Here is what each one actually optimizes.

23h ago · 4 min read · TL;DR: I ran six prompt-optimization frameworks against the same task and the same eval metric over a few weeks. They are not interchangeable: some are full programming models, some are single search

Join discussion

Mmayaanderssonllmasajudge.hashnode.dev

LLM-as-judge tools compared: the question is not which one scores, it is which one you can trust

3d ago · 3 min read · TL;DR: I compared the main LLM-as-judge tools (DeepEval's G-Eval, Confident AI, Evidently, Braintrust, Promptfoo, and MLflow) on the axis that actually decides whether the scores mean anything: how we

Join discussion

Mmayaanderssonllmasajudge.hashnode.dev

Stratified sampling for LLM eval sets: why your aggregate pass rate hides the regressions that matter

3d ago · 3 min read · TL;DR: A headline eval pass rate is an average over every kind of input your system sees, and averages hide the thing you most need to catch: a sharp regression in a small but important slice. If refu

Join discussion

Mmayaanderssonllmasajudge.hashnode.dev

We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"

Jun 11 · 2 min read · We track weekly agreement between an LLM judge and human labels (Cohen's kappa) on a sample of production traces. For three weeks the point estimates told a story: 0.55, then 0.49, then 0.44. The team

Join discussion

Mmayaanderssonllmasajudge.hashnode.dev

More eval traces will not stabilize your kappa. Stratify the ones you have

Jun 9 · 3 min read · TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63 week to week with no rubric change. First instinct was sample size, so we went from 50 weekly traces

Join discussion

mayaandersson

About

Available for

mayaandersson's blogs

Recently published

I benchmarked 6 prompt-optimization frameworks on the same task. Here is what each one actually optimizes.

LLM-as-judge tools compared: the question is not which one scores, it is which one you can trust

Stratified sampling for LLM eval sets: why your aggregate pass rate hides the regressions that matter

We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"

More eval traces will not stabilize your kappa. Stratify the ones you have

Search Hashnode

mayaandersson

About

Available for

mayaandersson's blogs

Recently published

I benchmarked 6 prompt-optimization frameworks on the same task. Here is what each one actually optimizes.

LLM-as-judge tools compared: the question is not which one scores, it is which one you can trust

Stratified sampling for LLM eval sets: why your aggregate pass rate hides the regressions that matter

We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"

More eval traces will not stabilize your kappa. Stratify the ones you have