Data classes in Python with dataclasses
Batteries included is a blog series about the Python Standard Library. Each day, I share insights, ideas and examples for different parts of the library. Blaugust is an annual blogging festival in August where the goal is to write a blog post every day of the month.
PEP 557 introduced dataclasses into Python and described them as “mutable namedtuples with defaults”. I wrote about namedtuples a week ago and this Sunday I wanted to look into another option for creating these types of data structures in Python.
Data class describes a pattern of class that contains data and setters/getters but does not otherwise contain much additional functionality and does not operate on its own data. In their book Refactoring: Improving the Design of Existing Code, Martin Fowler and Kent Beck described it as a code smell:
These are classes that have fields, getting and setting methods for the fields, and nothing else. Such classes are dumb data holders and are almost certainly being manipulated in far too much detail by other classes - (Refactoring: Improving the Design of Existing Code, 1st edition)
In a purist view of object-oriented programming, this may very well be true but personally I think there are definitely good use cases for these type of classes.
How to build data classes
To create a dataclass, we first import
dataclass
decorator from
dataclasses
module and annotate our
class with this decorator. We then define the fields using the
Variable Annotations syntax defined in PEP 526:
from dataclasses import dataclass
@dataclass
class Coordinate:
x: int
y: int
z: int
The decorator will then add methods like
__init__
,
__eq__
and
__repr__
(among others, see
docs
for full list) and in the constructor, will map
x
,
y
and
z
into attributes of the class.
origin = Coordinate(x=0, y=0, z=0)
zero = Coordinate(x=0, y=0, z=0)
print(origin)
# prints Coordinate(x=0, y=0, z=0)
# thanks to generated __repr__
print(origin.x, origin.y, origin.z)
# prints 0 0 0
# thanks to generated __init__ that
# maps arguments to fields
print(origin == zero)
# prints True thanks to generated `__eq__`
# that compares the class and its values
When this is all we need, we can get a very nice mutable data structure with few lines of code. And when we need more, we can customise the class by defining optional fields with defaults or overriding methods:
from dataclasses import dataclass
from typing import Optional
@dataclass
class Character:
# Mandatory string field
name: str
# Optional integer field with a default value
level: Optional[int] = 1
def __gt__(self, other):
if isinstance(other, Character):
return self.level > other.level
bard = Character('Cacofonix')
wizard = Character('Harry', level=20)
Data classes, like any other class, are mutable by default. Often though,
there’s value in making these types of data structures immutable and
dataclasses
supports that with argument
frozen=True
for the decorator:
from dataclasses import dataclass
@dataclass(frozen=True)
class Record:
id: str
entry: str
r1 = Record('0001', 'First')
r1.id = '0003'
# throws dataclasses.FrozenInstanceError: cannot assign to field 'id'
Data classes with pattern matching
In the book Fluent Python (2nd edition), there’s a great example of keyword class patterns in pattern matching and I’ve adjusted that here to use data classes:
from dataclasses import dataclass
@dataclass
class City:
continent: str
name: str
country: str
cities = [
City('Asia', 'Tokyo', 'JP'),
City('Asia', 'Delhi', 'IN'),
City('North America', 'Mexico City', 'MX'),
City('North America', 'New York', 'US'),
City('South America', 'São Paulo', 'BR')
]
for city in cities:
match city:
case City(continent='Asia'):
print(city)
# prints
# City('Asia', 'Tokyo', 'JP')
# City('Asia', 'Delhi', 'IN')
I’m personally very satisfied with how elegant and readable the Python code becomes when using these structures.
The differences between namedtuples and data classes
Eric V. Smith, the author of PEP 557 that introduced data classes into Python, described them in relation to namedtuples so what are the differences and when should you choose one over another?
PEP 557 describes the differences as:
-
Since namedtuples are tuples, the equality comparison only compares values,
so
Point3D(2017, 6, 2) == Date(2017, 6, 2)
would returnTrue
despite being different namedtuples -
If the user of namedtuples uses tuple unpacking, modifying the underlying
namedtuple will cause the code break (for example:
hour, minute = get_time()
) - namedtuples are always immutable
- namedtuples don’t allow default values
-
One cannot modify namedtuple’s dunder methods like
__repr__
- namedtuple doesn’t support inheritance
Fluent Python (2nd edition) has a nice section on data class style structures, comparing namedtuples, typing.NamedTuples and dataclasses. In it, the author Luciano Ramalho brings up a couple of main points:
- namedtuples are always immutable while dataclasses enable mutation (while can also be defined as immutable)
- data classes’ class syntax makes it easier to add new methods and docstrings
- data classes are type hinted
In last week’s namedtuples blog post, I wrote how I like namedtuples and one reason for that is its backwards compatibility with regular tuples, allowing improving the code base one piece at the time. Data classes don’t have the same benefit but as the differences listed above imply, there are many benefits to robustness and stability of the code base.