The Data in Data Class

Python has well-known data holders. If you need a sequence, you can use a list. If you need an immutable sequence, use a tuple. If you need a homogeneous array, use the array module. If you need a hash table, a dictionary is likely what you want.

However, sometimes we need more.

More control, more flexibility, more objects that perfectly fit into the domains of problems we are to solve.

Named Tuples for the Win (or Almost)

Named tuples offer a friendly and clean way to define data holders that behave like objects from a class we defined.

I'm going to use a Payment instance to illustrate the next examples.


from collections import namedtuple


Payment = namedtuple("Payment", ["id_", "amount", "method"])

payment = Payment(1, amount=123, method="CC")

print(payment)

>>> Payment(id_=1, amount=123, method='CC')

If you like typing, you can also use the typed version of the named tuple:

from typing import NamedTuple


class Payment(NamedTuple):
    id_: int
    amount: int
    method: str


payment = Payment(id_=2, amount=1234, method="ACH")

>>> print(payment)
Payment(id_=2, value=123, method='CC')

>>> payment.id_, payment.amount, payment.method
(2, 1234, 'ACH')

That seems to solve the data holder quest quickly, right? Simple API, available in the standard library, no class boilerplate.

However, it is essential to remember that Payment, our NamedTuple, is still a tuple behind the scenes.

And why is that important? Because every instance of Payment, for good or bad, will behave just like a tuple.

Named Tuples Are Still Tuples

You can call len() on your named tuple instance:

>>> len(payment)
3

You can unpack it:

>>> id_, amount, method = payment
>>> id_, amount, method
(2, 1234, 'ACH')

Named tuples are immutable:

>>> payment.amount = 345

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: can't set attribute

This one can be very counterintuitive: -If you compare the instance created by the Payment named tuple with a tuple having the same values, the comparison will be true:

>>> payment = Payment(id_=2, amount=1234, method="ACH")

>>> payment == (2, 1234, "ACH")
True

Data Classes as Data Holders

If named tuples are almost what you want, but you are still missing some flexibility, data classes might be a good fit.

We can smoothly go from a typed named tuple to a data class. Remove the namedtuple inheritance and decorate the class with @dataclass:

from dataclasses import dataclass


@dataclass
class Payment:
    id_: int
    amount: int
    method: str

The Payment instance will work almost like the namedtuple for what a data holder is concerned:

>>> payment
Payment(id_=2, amount=1234, method='ACH')
>>> payment.id_, payment.amount, payment.method
(2, 1234, 'ACH')

However, the instance is now mutable, has no len(), and can no longer compare with a tuple:

>>> assert payment == (2, 1234, "ACH")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

Field by Field Implementation

If we aim for more control and flexibility, the default fields in a data class may not be enough. That is why each field can be implemented in a more granular way using field()

In the example below, I'm defining quite a few setup options in a few lines:

  • id_ is an auto-generated UUID field that won't be used for comparison with another Payment instance and won't display its value in the object's repr.
  • amount is a Decimal field initializing as Zero Decimal if no value is passed.
  • method defaults to CC on every object.
from dataclasses import dataclass, field

from decimal import Decimal
from uuid import uuid4


@dataclass(frozen=True)
class Payment:
    id_: int = field(repr=False, compare=False, default_factory=uuid4)
    amount: Decimal = field(default_factory=Decimal)
    method: str = field(default='CC')
    

payment = Payment()

>>> payment
Payment(amount=Decimal('0'), method='CC')

>>> payment.id_
UUID('05d37e88-d069-484a-b454-42c2ca3594fc')


Why Not a Regular Class?

Still, this looks like something we could do with a regular class, right? Yes, we could. And data classes are indeed regular classes.

What makes them shine, besides its configurability, is its code-saving approach.

The Class in Data Classes

Data classes save considerable amounts of boilerplate code. They ship with pre-defined dunder methods that we would likely have to write ourselves.

Data classes have built-in string representation for friendly printing, so you don't need to write __str__() or __repr__() methods:

payment = Payment(id_=2, amount=1234, method="ACH")

# Non-dataclass default printing:
<__main__.Payment object at 0x10ae98d00>

# Dataclasses buil-int string representation

Payment(id_=2, amount=1234, method='ACH')

Default Values

It is possible to initiate a data class with default values:

@dataclass
class Payment:
    id_: int
    amount: int
    method: str = 'CC'

Validation

The __post_init__() method allows injecting initialization logic in data classes. Useful for things like validation.

@dataclass(frozen=True)
class Payment:
    id_: int
    amount: int
    method: str

    def __post_init__(self):
        if self.method not in ("ACH", "CC"):
            raise ValueError("Payment method must be ACH or CC")


payment = Payment(id_=1, amount=1234, method="ACC")
            
"""
Traceback (most recent call last):
  File "...", line 28, in __post_init__
    raise ValueError("Payment method must be ACH or CC")
ValueError: Payment method must be ACH or CC
"""

Raise an exception if the payment method is not ACH or CC

Comparable by Default

Data classes implement __eq__() by default, so the comparison considers each value in the object like a tuple comparison:

payment_1 = Payment(id_=1, amount=1234, method="ACH")
payment_2 = Payment(id_=1, amount=1234, method="ACH")

>>> payment_1 == payment_2
>>> True

Making it sortable

Passing order=True to the dataclass decorator implementes __lt__(), __le__(), __gt__(), and __ge__(), making the instances fully sortable.

Undoubtedly one of my favorite capabilities in data classes. So much code saved!

@dataclass(order=True)
class Payment:
    id_: int
    amount: int
    method: str
 
>>> payment_1 = Payment(id_=1, amount=1234, method="ACH")
>>> payment_2 = Payment(id_=1, amount=3456, method="ACH")
 
>>> payment_2 > payment_1
True
>>> payment_1 > payment_2
False

Making it Immutable (or almost)

Python doesn't allow for genuinely immutable objects. However, you can create a frozen data class that emulates immutability pretty well.

Passing frozen=True to the dataclass decorator implementes __setattr__() and __delattr__() in a way that raises a FrozenInstanceError when someone tries updating or deleting a property:

@dataclass(frozen=True)
class Payment:
    id_: int
    amount: int
    method: str


payment = Payment(id_=1, amount=1234, method="ACH")

>>> payment.amount = 2345
"""
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 4, in __setattr__
dataclasses.FrozenInstanceError: cannot assign to field 'amount'
"""

>>> del payment.method
"""
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 4, in __delattr__
dataclasses.FrozenInstanceError: cannot delete field 'method'
"""

Can It Be Hashable?

The short answer is most likely. However, this topic can get quite convoluted, so I'll defer it to a separate article.


Did you find this article helpful? What are some of your use cases for data classes?

Share this post