How would you compare two version numbers?

Sep 29th, 2021

by Juha-Matti Santala

This was published in categories: python

Earlier this week, I ran into a tweet by Leah Neukirchen who said "You won't believe how much software breaks because Python 3.10 has a two-digit minor version." In the discussion that followed, Tim Yates mentioned using comparing two software version numbers as an interview question and that started to linger in my mind.

Initially I thought that this can be quite a binary interview question: if you're familiar with the schemes of software versioning, you'll have a good shot but if you're like me and have only used the basic stuff and never had to compare them, you may feel like a fish out of the water.

In a good job interview setting, this would probably lead into discussion about the restrictions of version numbering systems that need to be supported, some discussion about how often standards are not always adhered or teams may use different versioning systems. And that would be an interesting discussion, I'd love to be part of one. But having been to job interviews were you're given a task and let alone to code for an hour, I'm not so optimistic and it leads to me approaching technical interview questions with a very different mindset.

So how would we do it then?

But more interesting than just talking about good or bad interviews, I want to talk about this problem and my thoughts on how I would approach it. I mentioned at the beginning that I'm not very familiar with versioning, especially at the level of comparing them programatically. I've only ever read and created them as a human, never with a computer.

A very common system for software versions is called Semantic Versioning or SemVer when you're strapped with characters. On a very basic, it consists of three numbers separated by dots:

x.y.z

where x denotes a MAJOR version, y denotes a MINOR version and z denotes a PATCH version. This is what I know about version numbers without doing any other research so I would start by solving this.

I would split the string by ., convert each part to an integer and then compare them one by one from major to minor to patch until there's a difference. Maybe something like this:

def compare_versions(v1, v2):
    """Compare two version strings (in format x.y.z).
    Returns RESULT.EQUAL if they are the same,
    otherwise returns the one that is higher"""

    if v1 == v2:
        return RESULT.EQUAL

    v1t = [int(num) for num in v1.split('.')]
    v2t = [int(num) for num in v2.split('.')]

    for v1v, v2v in zip(v1t, v2t):
        if v1v > v2v:
            return v1
        elif v2v > v1v:
            return v2

First check if the strings are equal and if not, go through each version part (major, minor, patch) at the time, if they are different, return the version number that is bigger.

That wasn't so hard, was it? But like they say in infomercials, that's not all.

But what about pre-release and build info?

SemVer supports additional information for pre-release versions and metadata for build numbers. At this point, I needed to start doing research, something that most interview sessions probably would not include. But luckily, I'm not interviewing so I can go down this rabbit hole.

A pre-release version MAY be denoted by appending a hyphen and a series of dot separated identifiers immediately following the patch version.

So we learn that a pre-release version is optional and is separated from the version with a hyphen: 1.0.0-alpha or 2.3.1-rc1.

For comparing these, from the SemVer docs we learn two things:

When major, minor, and patch are equal, a pre-release version has lower precedence than a normal version:

Example: 1.0.0-alpha < 1.0.0.

and

Precedence for two pre-release versions with the same major, minor, and patch version MUST be determined by comparing each dot separated identifier from left to right until a difference is found as follows:

1. Identifiers consisting of only digits are compared numerically.

2. Identifiers with letters or hyphens are compared lexically in ASCII sort order.

3. Numeric identifiers always have lower precedence than non-numeric identifiers.

4. A larger set of pre-release fields has a higher precedence than a smaller set, if all of the preceding identifiers are equal.

Example: 1.0.0-alpha < 1.0.0-alpha.1 < 1.0.0-alpha.beta < 1.0.0-beta < 1.0.0-beta.2 < 1.0.0-beta.11 < 1.0.0-rc.1 < 1.0.0.

For build metadata, SemVer says (emphasis mine):

Build metadata MAY be denoted by appending a plus sign and a series of dot separated identifiers immediately following the patch or pre-release version. Identifiers MUST comprise only ASCII alphanumerics and hyphens [0-9A-Za-z-]. Identifiers MUST NOT be empty. Build metadata MUST be ignored when determining version precedence.

Alright, so our function needs to take these two into account when parsing and only consider pre-release version when comparing.

There's a regex for that

^(0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)(?:-((?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+([0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$

Yeah, I wouldn't come up with that in an interview.

Good, everyone uses SemVer. Right?

Well, obviously not. It's just one system out there amongst many others, some more well-defined than others.

Sometimes people may add a "v" before the version number to say it's a version, like v1.2.0 or they may only use two digits like 1.5.

They also may use pre-release versions but not in a way that SemVer defines their precedence. Or they are not separated by hyphens but rather just appended to the version number like 1.0b3.

Or there might be revision numbers: 1.2.12.102.

A bit from the outfield but branding-wise there might be versions like 95, 98, ME, 2000, XP, 7, 8 and 10. Good luck with those, someone might use that as their version number.

There's also CalVer that uses time in different formats. For example, Ubuntu uses versions like 20.04.3 LTS.

Python uses versioning that is defined as [N!]N(.N)*[{a|b|rc}N][.postN][.devN].

TeX has been using idiosyncratic version numbering system since version 3, "where updates have been indicated by adding an extra digit at the end of the decimal, so that the version number asymptotically approaches π. This is a reflection of the fact that TeX is now very stable, and only minor updates are anticipated. The current version of TeX is 3.141592653".

And then there are those who don't really follow any other system than "this feels like a good version number".

Okay, it's complicated

Probably, if you're writing a version parser or comparing functionality, you know what system you are developing it towards. I only knew about the ones I listed above because I did a lot of googling while writing this blog post.

I wouldn't have even gotten through comparing pre-release versions in a job interview setting and would have mentioned something like "not everyone uses SemVer though so it's quite a wild wild west".

Why I wanted to write this post was two-fold: one, I really got excited about exploring different versioning systems (I find TeX's quite hilarious and creative, albeit not probably most human friendly). Two, I wouldn't have known to even talk about those details in an interview setting because I had never dealt with them on this level.

It took me 15 minutes to google and I could probably write a function to compare most of them now, especially the more defined ones. But depending on the interviewer and the other applicants, I don't know if I would have gotten the job.

That's why I hate most technical questions in job interviews. If you know the answer, it's easy and you can have a great in-depth discussion of details. If you have never encountered it, no matter how simple the problem eventually would be, you'd be stuck on the very surface level stuff.