Blog

Using AI to run performance reviews: how we evaluate our team with Claude

Diario del capitán, fecha estelar d162.y42/AB

Jordi Vendrell Farreny
Founder & COO
Performance review

A few days ago, we shared how we updated our performance review templates to reflect what a great engineer looks like in the age of AI. That post was about what we evaluate. This one is about how we evaluate it, because the process itself has changed just as much as the criteria.

At MarsBased, quality has always been the thing we obsess over. Not just the code we ship (which, yes, we now write with AI), but everything around it: security, performance, DevOps capabilities, technical decision-making, and the soft skills that make a project actually work: communication, ownership, reliability. When we review a Tech Lead, for example, we look at things like whether they can explain technical concepts to a non-technical client, whether they proactively detect security issues before production, whether their weekly reports contain real insight or just a list of tasks, and whether they're mentoring the people around them.

Evaluating all of that properly used to be expensive. You had to attend meetings, re-read reports, browse boards, collect feedback, and then try to assemble a fair picture from memory. Inevitably, recency bias creeps in: the last two weeks weigh more than the previous five months.

Then we realised something: in a remote company, the evidence is already there.

A remote company is a fully documented company

We've been remote since day one, which means our work doesn't happen in the office but in tools, and tools keep records:

  • Meeting notes. Every project has notes for internal meetings and client meetings. Who said what, what was decided, what risks were raised.
  • Slack. Every project has a channel, some shared with clients. Daily updates, blockers, how people communicate under pressure.
  • Linear. Our boards hold the entire development history: issues with descriptions, user stories, estimates, comments, templates, and cycles. You can see how someone defines work, how they prioritise, how they handle deviations.
  • Google Drive. The weekly reports we send to clients live there, with information about budget deviations, project phases, deploys, and releases.
  • GitHub. Pull requests, reviews, the kind of errors that show up and how they get fixed.

On top of that, we have the human layer: client feedback, peer feedback, our own perception, and regular 1-on-1s where feedback flows in both directions and career plans take shape.

Put it all together and you have an enormous, honest dataset about how every person at MarsBased actually works. The problem was never lack of information. It was that no human had the time to read all of it.

Enter Claude, MCP, and very large context windows

Two things made this workflow possible recently.

The first is that all these tools now expose MCP servers. Linear, Google Drive, Slack, GitHub: Claude can connect to all of them and read the raw evidence directly, no exports, no copy-pasting.

The second is model capability. With the latest Claude models, like Opus 4.8 or the recently (almost) released Fable, the amount of context an LLM can work with is dramatically larger. Reviewing six months of meeting notes, weekly reports, board activity, and Slack threads for one person is no longer science fiction. It's a prompt.

So we built a performance review procedure on top of this. For each role (Tech Lead, Software Engineer, Product Manager) we have a review template with explicit criteria, and for each criterion, we've encoded how it should be measured into reusable Claude skills. One skill evaluates the quality of weekly client reports. Another checks whether someone has been sending their weekly highlights, by looking at the actual Linear issues. Another reviews how a PM manages their boards: hygiene, prioritisation, issue definition quality. Another goes through meeting notes to assess communication with clients and the team.

Each skill knows where the evidence lives, what the rating scale is, and what "exceeds expectations" versus "partially meets expectations" looks like for that specific item. We run them, and Claude returns a rating, a summary, and the part we care about most: the evidence. Concrete examples, quotes from reports, specific issues, and actual behaviours that justify the rating.

The criteria matter more than the AI

If there's one lesson from setting this up, it's this: the AI is the easy part. The hard part is having evaluation criteria that are genuinely clear.

When the criterion is vague ("communicates well"), the AI's output is vague too. When the criterion is precise ("provides accurate and qualitative information in weekly reports, not just a list of tasks", and "measured by reviewing reports and daily messages"), the AI can actually do the job, and do it across all the reports, not just the ones the reviewer happened to remember.

This forced us to refine our templates. Every item now has a description, a stakeholder, and an explicit measurement method. That exercise alone made our reviews better, AI or not.

What this does NOT replace

To be clear: the AI doesn't decide anyone's review. The output is a draft backed by evidence, and a human, the reviewer, validates it, challenges it, and combines it with the things that don't live in any tool: client feedback, peer feedback, what comes up in 1-on-1s, and plain human judgment.

What AI removes is the administrative burden and the recency bias. What stays human is the conversation, the career planning, and the final call. Same philosophy as how we use Claude for product management: the model handles the mechanical reading, the human provides direction and judgment.

My take

Performance reviews have always suffered from a paradox: the more thorough they are, the more expensive they become, so most companies settle for impressions. In a remote, fully digitalised company, that trade-off is gone. The evidence exists: AI can finally read it at scale.

For us, this isn't about surveilling people. It's about fairness and quality. Reviews based on six months of actual evidence are fairer than reviews based on the loudest recent memory. And a team that knows the bar is clear, measurable, and consistently applied is a team that trusts the process.

If you're running a remote team, you're probably sitting on the same dataset we were. The tools have MCP servers. The models can handle the context. The only thing missing is writing down, precisely, what "good" means at your company.

Compartir este post

Artículos relacionados

MarsBased x Claude

How we use Claude for product management at MarsBased

How MarsBased uses Claude and Linear to automate PM workflows, shifting the role from administrative tasks to strategic product thinking.

Leer el artículo
Bye bye Cursor

The strategy behind the shift: Why MarsBased left Cursor for Claude Code

After starting with Cursor in 2024, we are now standardizing on Claude Code for our 2026 agentic strategy. This move marks our official shift from writing code to defining problems and steering the process.

Leer el artículo
Linear AI

Linear AI: Is it any good?

We thought Linear AI was just another tech distraction, until we actually tried it. Find out why this new update is a secret superpower for Product Managers, not just developers.

Leer el artículo