Blog

How we evaluate software engineers in the age of AI

Diario del capitán, fecha estelar d145.y42/AB

Jordi Vendrell Farreny
Founder & COO
Software engineers in the age of AI

At MarsBased, we've spent the last year or so rethinking what good looks like for a software engineer. Not because the fundamentals changed, clean code, solid architecture, and clear communication still matter, but because the baseline has shifted. AI tools have raised the floor, and that changes what we pay attention to.

This week, we finished updating our performance review templates for all roles at the company. It felt like the right moment to share what we actually evaluate, and why we think the technical bar in 2025 looks different from what it was even two years ago.

The fundamentals haven't gone away

Before talking about AI, let's be clear: the things that made a great software engineer in 2019 still matter. A lot.

We still evaluate delivery speed and volume, whether someone can ship a proportional amount of work for the hours they dedicate, manage multiple tasks in parallel, and stay close to their estimates without needing constant overtime or justification. We still look at code quality: no recurring security issues, no unnecessary technical debt, modular and maintainable architecture.

We also evaluate something harder to measure: project ownership. Can the engineer answer technical and functional questions about decisions made in the project? Do they understand not just what they built, but why? This is evaluated informally in meetings and 1-on-1s, and it's a reliable signal of whether someone is genuinely invested or just shipping tickets.

Bug rate matters too. A low bug rate isn't about perfection, it's about whether the engineer is thinking through edge cases, writing code they're confident in, and not generating rework for the rest of the team.

None of that has changed. What's changed is that we now expect all of this to happen with AI as part of the workflow.

AI adoption is now a first-class evaluation criterion

We added two specific items to our technical evaluation that didn't exist before:

Uses AI for code writing. We evaluate whether engineers are actively using AI tools, specifically Claude Code, which we've standardised on as a team, following company policies and best practices. This isn't about whether they've installed it. It's about whether AI is genuinely integrated into their daily workflow: writing, reviewing, refactoring, testing.

Claude Code proficiency. Beyond basic usage, we look at deeper fluency: working with agents, managing context effectively, understanding how to structure prompts and tasks for reliable output, using advanced features that actually move the needle on productivity. This is evaluated through conversation in 1-on-1s, not via some arbitrary benchmark.

Why make this explicit? Because we've seen two failure modes. The first is engineers who ignore AI tools entirely and fall behind in throughput. The second, more subtle and arguably more dangerous, is engineers who use AI indiscriminately, blindly accepting output, generating code they don't understand and can't maintain. Neither is acceptable.

The standard we hold our team to is AI-augmented judgment: using AI to go faster and tackle more, while retaining full ownership of the output.

What AI hasn't replaced

When you look at the rest of our evaluation criteria, something becomes clear: most of what we value is deeply human.

Communication, for example, remains one of the most important dimensions. We evaluate daily Slack updates, how engineers document tasks in Linear, whether they proactively ask for help when blocked, and the quality of their written communication overall. AI can help you write better, but it can't replace the judgment of knowing when to escalate a blocker, or the habit of leaving enough context in a ticket for a colleague to understand it three weeks later.

The same goes for proposing and driving technical improvements. We look for engineers who don't just execute tickets but identify refactors, flag issues before they become problems, and push projects forward beyond the immediate scope. This is initiative, not output, and it's something AI tools don't generate on their own.

A note on how we measure all of this

Performance reviews at MarsBased involve multiple stakeholders: the Product Manager evaluates communication and delivery from a project perspective, the Tech Lead assesses technical quality and architecture, and HR looks at culture and commitment. Each criterion has a rating scale that goes from "does not meet expectations" to "exceeds expectations," with qualitative comments required alongside the score.

We deliberately avoid relying on proxy metrics, commits, PR count, story points, because they're gameable and don't reflect actual value delivered. Instead, we invest in ongoing observation: attending meetings, reading updates, reviewing code, and having direct conversations about how people work.

Why we are sharing this

Partly because we think transparency about how we evaluate our team makes us a better employer and a more trustworthy partner for clients. If you're working with MarsBased, this is what you're getting: engineers who are held to a clear, well-defined standard that includes AI fluency alongside the fundamentals.

But also because we believe the industry is still figuring out what "good" looks like in this new context. There's no shortage of takes on AI and software engineering, from "AI will replace developers" to "nothing has changed, ship the code." Our view is more practical: AI is a tool that the best engineers use well, and using it well is a skill that needs to be learned, practised, and yes, evaluated.

We don't have all the answers. But we've built a framework that works for us, and we're happy to share it.

Compartir este post

Artículos relacionados

AI-Augmented development

Standardising innovation: Our internal guide to AI-augmented development

How we integrate agents like Claude and Copilot into our workflow using a rigorous Research, Plan, and Implement framework to ensure speed without sacrificing architectural excellence.

Leer el artículo
AI-Generated Code

How "Don’t repeat yourself" can mess up with your AI-generated code

AI agents often over-abstract coincidental duplication, creating fragile architectures that increase cognitive load and technical debt. Prioritizing clear domain boundaries over rigid DRY principles is essential for keeping codebases maintainable for both humans and AI.

Leer el artículo
Bye bye Cursor

The strategy behind the shift: Why MarsBased left Cursor for Claude Code

After starting with Cursor in 2024, we are now standardizing on Claude Code for our 2026 agentic strategy. This move marks our official shift from writing code to defining problems and steering the process.

Leer el artículo