Juha-Matti Santala
Community Builder. Dreamer. Adventurer.

Data classes in Python with dataclasses

Batteries included is a blog series about the Python Standard Library. Each day, I share insights, ideas and examples for different parts of the library. Blaugust is an annual blogging festival in August where the goal is to write a blog post every day of the month.

PEP 557 introduced dataclasses into Python and described them as “mutable namedtuples with defaults”. I wrote about namedtuples a week ago and this Sunday I wanted to look into another option for creating these types of data structures in Python.

Data class describes a pattern of class that contains data and setters/getters but does not otherwise contain much additional functionality and does not operate on its own data. In their book Refactoring: Improving the Design of Existing Code, Martin Fowler and Kent Beck described it as a code smell:

These are classes that have fields, getting and setting methods for the fields, and nothing else. Such classes are dumb data holders and are almost certainly being manipulated in far too much detail by other classes - (Refactoring: Improving the Design of Existing Code, 1st edition)

In a purist view of object-oriented programming, this may very well be true but personally I think there are definitely good use cases for these type of classes.

How to build data classes

To create a dataclass, we first import dataclass decorator from dataclasses module and annotate our class with this decorator. We then define the fields using the Variable Annotations syntax defined in PEP 526:

from dataclasses import dataclass

@dataclass
class Coordinate:
  x: int
  y: int
  z: int

The decorator will then add methods like __init__ , __eq__ and __repr__ (among others, see docs for full list) and in the constructor, will map x, y and z into attributes of the class.

origin = Coordinate(x=0, y=0, z=0)
zero = Coordinate(x=0, y=0, z=0)

print(origin) 
# prints Coordinate(x=0, y=0, z=0) 
# thanks to generated __repr__

print(origin.x, origin.y, origin.z)
# prints 0 0 0 
# thanks to generated __init__ that 
# maps arguments to fields

print(origin == zero) 
# prints True thanks to generated `__eq__` 
# that compares the class and its values

When this is all we need, we can get a very nice mutable data structure with few lines of code. And when we need more, we can customise the class by defining optional fields with defaults or overriding methods:

from dataclasses import dataclass
from typing import Optional

@dataclass
class Character:
  # Mandatory string field
  name: str 
  
  # Optional integer field with a default value
  level: Optional[int] = 1 
  
  def __gt__(self, other):
    if isinstance(other, Character):
   	  return self.level > other.level
   	  
bard = Character('Cacofonix')
wizard = Character('Harry', level=20)

Data classes, like any other class, are mutable by default. Often though, there’s value in making these types of data structures immutable and dataclasses supports that with argument frozen=True for the decorator:

from dataclasses import dataclass

@dataclass(frozen=True)
class Record:
  id: str
  entry: str

r1 = Record('0001', 'First')
r1.id = '0003'
# throws dataclasses.FrozenInstanceError: cannot assign to field 'id'

Data classes with pattern matching

In the book Fluent Python (2nd edition), there’s a great example of keyword class patterns in pattern matching and I’ve adjusted that here to use data classes:

from dataclasses import dataclass

@dataclass
class City:
  continent: str
  name: str
  country: str


cities = [
	City('Asia', 'Tokyo', 'JP'),
	City('Asia', 'Delhi', 'IN'),
	City('North America', 'Mexico City', 'MX'),
	City('North America', 'New York', 'US'),
	City('South America', 'São Paulo', 'BR')
]

for city in cities:
  match city:
    case City(continent='Asia'):
      print(city)

# prints
# City('Asia', 'Tokyo', 'JP')
#	City('Asia', 'Delhi', 'IN')

I’m personally very satisfied with how elegant and readable the Python code becomes when using these structures.

The differences between namedtuples and data classes

Eric V. Smith, the author of PEP 557 that introduced data classes into Python, described them in relation to namedtuples so what are the differences and when should you choose one over another?

PEP 557 describes the differences as:

  • Since namedtuples are tuples, the equality comparison only compares values, so Point3D(2017, 6, 2) == Date(2017, 6, 2) would return True despite being different namedtuples
  • If the user of namedtuples uses tuple unpacking, modifying the underlying namedtuple will cause the code break (for example:hour, minute = get_time())
  • namedtuples are always immutable
  • namedtuples don’t allow default values
  • One cannot modify namedtuple’s dunder methods like __repr__
  • namedtuple doesn’t support inheritance

Fluent Python (2nd edition) has a nice section on data class style structures, comparing namedtuples, typing.NamedTuples and dataclasses. In it, the author Luciano Ramalho brings up a couple of main points:

  • namedtuples are always immutable while dataclasses enable mutation (while can also be defined as immutable)
  • data classes’ class syntax makes it easier to add new methods and docstrings
  • data classes are type hinted

In last week’s namedtuples blog post, I wrote how I like namedtuples and one reason for that is its backwards compatibility with regular tuples, allowing improving the code base one piece at the time. Data classes don’t have the same benefit but as the differences listed above imply, there are many benefits to robustness and stability of the code base.