A small trick to help you estimate the size of a site

Quick entry today, after a few on the heavy side!

Sometimes, as an agency, you're forced to guesstimate the size of a project, in terms of number of pages, especially for migration projects. In most of these cases, the prospective client won't give you access to the code even if you have signed an NDA. Maybe they want to do it to replace their current provider, or they don't want to bother their internal technical team, so it's up to you to find out.

This kind of guesstimate is crucial in SEO-based projects, as you'll need as accurate a number as you can have in order to know where you will be putting your efforts later on. It's not the same to migrate a size with 1000 contents that one with 100000 pieces of unique content.

One quick trick is to take a look at the sitemap.xml file(s) of the website.

For instance, let's suppose the popular tech portal TechCrunch comes to us and wants us to migrate their site. However, they don't give us access to the code nor database, and we need to submit a proposal for them with whatever information is publicly available.

For the client, SEO is extremely important but, at the same time, they might not have all the bandwidth available to help all the potential providers gather all the information, so the only instructions we receive are "I don't know how many contents I have, but they all have to be migrated". This happens more often than not, hence this blog post.

What we will do here is to open up this url: https://techcrunch.com/sitemap.xml (usually is always /sitemap.xml) and see that in this project in particular, the sitemap is split into 2 sub-sitemaps. The first one contains 108 sitemap files, and the second one 184. This is a lot of content, but we're talking about one of the main sites for tech out there.

What we will do here, is count the number of items on each sitemap and we will have a number of how many contents are there on the site. If done properly, all sub-sitemap files have got the exact number of items, so you can do number of sub-sitemap files times the average of items on each file, and voilà!

Key idea: 🚨 Multiple sitemaps or sub-sitemaps mean LOTS OF CONTENTS 🚨

Actually, this answer is technically not 100% accurate: contents listed on sitemap files are only those that have got a URL. Unpublished or drafted contents will not appear here, but we can assume these won't be a significant number and negligible in the big scheme of things. You can always add a buffer to the guesstimate to deal with these later, or leave them for a subsequent phase once you will have taken control of the platform and you're more familiar with everything. Maybe they're not needed at all.

For the bonus points, if you open the robots.txt file, you might be able to see disallowed contents for indexing robots, which also need to be taken into consideration in migrations because they need to be ported with the same level of access.

Example: https://startupgenome.com/robots.txt.

Robots.txt can also give other information as to routes of image or assets folders, so it's always worth taking a look into, if you're estimating a migration project.

A small trick to help you estimate the size of a site

Related articles

How to do agile prototyping using static pages to reduce development time

MySQL deferred constraints and unique checks

Query data from PostgreSQL to represent it in a time series graph