Blog

How we safely upgraded a legacy Rails app with limited tests

Captain's log, stardate d100.y42/AB

Xavier Redó
Founder & CTO
Upgraded legacy Rails app

# How we safely upgraded a legacy rails app with limited tests

Over the last few months, we’ve been busy modernizing one of the oldest Ruby on Rails applications we run at MarsBased.

And when I say "old", I don’t mean "a couple of years behind". I mean legacy in the way that hurts: big, monolithic, high-traffic, with years of features stacked on top of each other... and not enough automated tests to make a major upgrade feel safe.

We’re talking about ~150,000 lines of Ruby code (counting only Ruby files, excluding comments, etc.). By the way, If you ever need a quick, reasonably accurate way to measure project size across languages and folders, we recommend cloc. It runs on Linux, Windows and macOS, and it’s perfect for quick estimates when you want to ignore certain folders, focus on specific languages, and get a reality check before committing to a plan.

## An upgrade we can't do piece by piece

This app is a monolith. No independently deployable services. No neat boundaries where we can upgrade one module, ship it, and move on to the next.

Upgrading Ruby and Rails meant upgrading the whole thing. And we had to do it with a relatively small team, under the constraints you'd expect from a production system: security, stability, performance, and lots of traffic from the outside world.

To make things more fun, the app has many edge cases, multiple content types, and a wide surface area where things can break in subtle ways. Like most mature systems, it also leans heavily on the exact parts of the framework and APIs that tend to change and be deprecated between versions.

So the question wasn’t "how do we upgrade Rails?" but "how do we upgrade Rails without relying on a test suite we don’t have?"

## Progressive rollout, not a big-bang deploy

We're lucky in one key aspect: we have strong control over the infrastructure this app runs on.

Instead of doing a classic "deploy day" where everything flips at once, we did a progressive release that gradually moved traffic to the new stack.

The high-level strategy was:

* Bring up a new set of servers running the upgraded Ruby/Rails version.
* Ensure data compatibility between old and new versions.
* Route traffic selectively to the new infrastructure, starting with internal QA and low-risk segments.
* Increase exposure gradually, sending crawlers and bots (Google, Bing, OpenAI, Perplexity, etc) to the new infrastructure, while observing errors and performance.
* Move background processing and scheduled jobs.
* Flip the remaining traffic once the new setup was already carrying the majority of the load.

## Compatibility

We had to make sure both versions could live side-by-side. Rails upgrades can introduce changes in how data is persisted or represented. A very common example: fields that used to be serialized as YAML in older Rails versions are now often stored as JSON in newer ones.

We also had to deal with session management and authentication libraries. Sessions might not be 100% compatible across versions, but we worked to make them as compatible as possible, minimizing user-facing issues while still moving forward.

## Routing traffic to the new infrastructure

Once the new servers were ready, we needed a clean way to test them in production without exposing regular users to risk.

We configured our reverse proxy (in our case, Varnish) to route requests differently depending on a cookie. That allowed our QA team to access the upgraded infrastructure transparently, while everyone else continued using the old one.

The important detail here: each environment had its own independent cache, so we didn’t end up with confusing cross-contamination between versions.

## Using bots as a testing army

This application is content-heavy. It's essentially an online media platform with many public pages and multiple types of content. Which means that it gets crawled. A lot.

Bots hit the public side of the platform constantly, and they’re great at covering massive surface area repeatedly, every day.

So we did something that worked surprisingly well: we started routing a percentage of bot traffic to the new infrastructure. This gave us real-world coverage of the public-facing site and helped us uncover errors organically.

Meanwhile, our team together with the client's team could focus on QA for the private, logged-in side of the application with production data.

## Expanding to real users: cms staff first

The platform also includes a CMS used internally by the company’s employees. We progressively enabled the new infrastructure for CMS users by configuring their cookies to access the new servers, increasing the number of internal users running on the upgraded system.

At one point, we reached a milestone where around 50% of the platform’s activity was already happening on the new infrastructure.

## Background jobs and scheduled tasks

We moved background jobs and scheduled tasks gradually as well, starting controlled, monitoring results, then migrating more until eventually all background processing was running on the new infrastructure.

That got us to a very comfortable position:

* Roughly 75% of total activity (traffic + background work) was already on the new stack.
* 100% of background jobs were running there successfully.

At that point, the final "deploy" wasn't a dramatic event. It was basically: route the remaining users to the new infrastructure.

## Performance testing as you ramp

This progressive rollout gave us more than safer testing. It also gave us a way to measure performance under gradually increasing real load. Because we didn’t flip traffic all at once, we could observe how the new version behaved at 5%, 10%, 25%, 50%, 85%... and spot bottlenecks early. We could fix issues before they turned into incidents.

## Wrapping up

If you’re planning a similar upgrade and want to compare notes, especially around compatibility traps and rollout strategies, reach out. We've navigated those waters many times in our 12 years of history and we have learned a lot of invaluable insights during the journey.

Share this article

Related articles

Astronaut

How we made Plausible analytics work with ad blockers

To combat analytics data loss caused by ad-blockers, we implemented a custom Ruby on Rails proxy to route Plausible requests through our own backend. This privacy-friendly approach ensures we capture reliable traffic insights while fully respecting our visitors' data and privacy.

Read full article
Boy-scout on Mars

The boy-scout engineer: Frontending with detail series (part 3)

True quality is defined by a mindset of accountability and the Boy-Scout Rule of leaving code cleaner than you found it. Senior success requires radical ownership and rigorous self-review to deliver excellence in every single commit.

Read full article
WebMCP

WebMCP: Making your website ready for AI agents

WebMCP is the new W3C standard that makes websites natively operable by AI agents, no slow vision models needed. Learn how to implement it in 15 minutes.

Read full article