Document intended usage through tests with doctest

Aug 8th, 2024

by Juha-Matti Santala

This was published in categories: blaugust-2024 python blaugust batteries-included

Batteries included is a blog series about the Python Standard Library. Each day, I share insights, ideas and examples for different parts of the library. Blaugust is an annual blogging festival in August where the goal is to write a blog post every day of the month.

A good test serves many purposes. It serves to make sure that the specific case it tests is implemented correctly and that later changes in the code base don’t introduce unintended issues. It also serves to provide documentation for how a piece of software is intended to be used.

Writing good tests in a way that serve both of these purposes is a form of art that requires practice and intentionality. In Python, we have many ways to write and run our tests.

There’s a lot to like about doctest module: it keeps your unit tests as close to your implementation as possible and utilises the syntax of interactive Python REPL sessions.

When tests live in the docstring of your functions, they are right there to find and read for example when using help() or utilising your code editor’s or IDE’s tooling.

Yet, it has its shortcomings and I’ll discuss those in the end as well.

Writing doctests

To add a test to your function, you start a new REPL-looking block with >>> and provide the code being executed and the output on the following line:

def reverse(sentence):
  """Reverses provided sentence
  
  >>> reverse('Hello world!')
  '!dlrow olleH'
  """
  return sentence[::-1]

To run these tests, you import the module and run its testmod() function:

import doctest
doctest.testmod()

In our reverse example, this would result in output of

TestResults(failed=0, attempted=1)

What I do really like about doctest is that instead of having to mangle test description into a function name, you can write more descriptive information about them next to the tests:

def listify(argument):
	"""Returns the argument in a list unless it's already a list
	
	Single item is wrapped into a single-item list
	>>> listify(5)
	[5]
	
  A list is returned as-is
	>>> listify([1,2,3])
	[1, 2, 3]
	
	A tuple is turned into a list
	>>> listify((1,2,3))
	[1, 2, 3]
	"""
	try:
		return list(argument)
	except TypeError:
		return [argument]

In the soapbox section of the documentation, the author writes about how to select these examples:

When writing a docstring, choose docstring examples with care. There’s an art to this that needs to be learned—it may not be natural at first. Examples should add genuine value to the documentation. A good example can often be worth many words. If done with care, the examples will be invaluable for your users, and will pay back the time it takes to collect them many times over as the years go by and things change.

The entire soapbox section is one of the best ones I’ve read in Python documentation and I highly recommend reading it even if you don’t plan to use doctest.

Doctests with pytest

If you use pytest for your testing, you can run it with --doctest-modules option to run all the doctests in your Python files. I put all the examples from this blog post into individual files in a folder and ran pytest:

➜ pipx run pytest --doctest-modules
============ test session starts ============
platform darwin -- Python 3.12.2, pytest-8.1.1, pluggy-1.4.0
rootdir: /code/testbench/doctests
collected 3 items

listify.py .          [ 33%]
reverse.py .          [ 66%]
sorted.py F           [100%]

============ FAILURES ============
____________ [doctest] sorted.sort_items ____________
002 Returns items in sorted order
003
004     >>> sort_items(['a', 'c', 'b', 'f', 'd'])
Expected:
    ['a','b','c','d','f']
Got:
    ['a', 'b', 'c', 'd', 'f']

/code/testbench/doctests/sorted.py:4: DocTestFailure
============ short test summary info ============
FAILED sorted.py::sorted.sort_items
============ 1 failed, 2 passed in 0.07s ============

Shortcomings

Doctest compares print results

One of the big differences between doctest and other test solutions is that doctests match against the output in a very literal way which can become annoying when writing the tests.

def sort_items(items):
	""" Returns items in sorted order
	
	>>> sort_items(['a', 'c', 'b', 'f', 'd'])
	['a','b','c','d','f']
	"""
	return sorted(items)

This looks good on the first glance: we call the function and we provide a Python list that we’d expect as the result. However, this will fail no matter what your implementation of the function would be because when lists are printed out, there are spaces between the items.

Failed example:
    sort_items(['a', 'c', 'b', 'f', 'd'])
Expected:
    ['a','b','c','d','f']
Got:
    ['a', 'b', 'c', 'd', 'f']

The functioning test in this case would be

def sort_items(items):
	""" Returns items in sorted order
	
	>>> sort_items(['a', 'c', 'b', 'f', 'd'])
	['a', 'b', 'c', 'd', 'f']
	"""
	return sorted(items)

This also means that doing debugging prints from within the function breaks all the tests:

def sort_items(items):
	""" Returns items in sorted order
	
	>>> sort_items(['a', 'c', 'b', 'f', 'd'])
	['a', 'b', 'c', 'd', 'f']
	"""
	print(items)
	return sorted(items)

breaks as

Failed example:
    sort_items(['a', 'c', 'b', 'f', 'd'])
Expected:
    ['a', 'b', 'c', 'd', 'f']
Got:
    ['a', 'c', 'b', 'f', 'd']
    ['a', 'b', 'c', 'd', 'f']

This can be worked around by printing to sys.stderr but this adds a lot of friction to debugging steps.

# Instead of writing
print(items)

# you gotta write
print(items, file=sys.stderr)

It seems like a really small thing but remembering to add it to every print statement starts to add up.

Gets messy with more complex functions

For functions with very basic arguments, doctest is wonderful. I do often struggle with it though when writing software where more complex variables are being passed into the function. Creating the fixture data in the docstring can make the tests hard to read.

In a way, a dedicated usage of doctests might lead to better architecture of functions but real life code bases are often more messy and in other systems it is easier to hide the test data / fixtures from “polluting” the readability of the tests themselves.

I’d love to see good real-life production software examples that use doctests as their main testing driver so I could learn how developers maintain that complexity. If you know any, send them my way!

If something above resonated with you, let's start a discussion about it! Email me at juhamattisantala at gmail dot com and share your thoughts. In 2025, I want to have more deeper discussions with people from around the world and I'd love if you'd be part of that.