New Python Dataclass
Dataclass - New for Python 3.7 and above
While this is not necessarily new new (June 27, 2018) it is worth noting. The Dataclass is a useful option for handling data.
Per the PEP 557 Abstract:
This PEP describes an addition to the standard library called Data Classes. Although they use a very different mechanism, Data Classes can be thought of as “mutable namedtuples with defaults”. Because Data Classes use normal class definition syntax, you are free to use inheritance, metaclasses, docstrings, user-defined methods, class factories, and other Python class features.
A class decorator is provided which inspects a class definition for variables with type annotations as defined in PEP 526, “Syntax for Variable Annotations”. In this document, such variables are called fields. Using these fields, the decorator adds generated method definitions to the class to support instance initialization, a repr, comparison methods, and optionally other methods as described in the Specification section. Such a class is called a Data Class, but there’s really nothing special about the class: the decorator adds generated methods to the class and returns the same class it was given.
What does this mean?
The initial example in the PEP 557 documents walks through how it works, but the gist of it is that the Dataclass creates the class automatically in the background. Its constructor and other magic methods, such as repr(), eq(), and hash() are generated automatically. They also come with basic functionality such as instantiate, print, and compare data class instances that are ready to use once created.
Where is it not appropriate to use Data Classes?
- API compatibility with tuples or dicts is required.
- Type validation beyond that provided by PEPs 484 and 526 is required, or value validation or conversion is required.
Why use a Dataclass?
Below is an example of code that pulls in USGS earthquake data using an API, uses the Dataclass to format the data in the desired manner, and then loads it into a Pandas Dataframe. This approach standardizes and cleans up ingesting the JSON data from the API to a Dataclass with some added features and then loads the ingested data into a Pandas DataFrame.
Code Example
Consider the USGS Earthquake API and ingesting the data. (For full details on USGS Earthquake API - LINK)
The API return data is in JSON and the information is a collection of strings, integers, and floats. The full description of the data is available on the USGS website. For this example the variable of interest is time. The time is stored as a integer, example 1596974857650, which is ISO8601 Date/Time format. But for ease of reading converting it to a date and time format such as YYYY-MM-DD HH:MM:SS is desired. To do this the Dataclass will be used to add a new variable within the class and automatically convert the time when an instance of the class is created.
Input Data
The data input looks like this (one line scroll for overall page readability):
{"type":"FeatureCollection","metadata":{"generated":1617631452000,"url":"https://earthquake.usgs.gov/fdsnws/event/1/query?format=geojson&starttime=2014-01-01&endtime=2014-01-02","title":"USGS Earthquakes","status":200,"api":"1.10.3","count":326},"features":[{"type":"Feature","properties":{"mag":1.29,"place":"10km SSW of Idyllwild, CA","time":1388620296020,"updated":1457728844428,"tz":-480,"url":"https://earthquake.usgs.gov/earthquakes/eventpage/ci11408890","detail":"https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=ci11408890&format=geojson","felt":null,"cdi":null,"mmi":null,"alert":null,"status":"reviewed","tsunami":0,"sig":26,"net":"ci","code":"11408890","ids":",ci11408890,","sources":",ci,","types":",cap,focal-mechanism,general-link,geoserve,nearby-cities,origin,phase-data,scitech-link,","nst":39,"dmin":0.06729,"rms":0.09,"gap":51,"magType":"ml","type":"earthquake","title":"M 1.3 - 10km SSW of Idyllwild, CA"},"geometry":{"type":"Point","coordinates":[-116.7776667,33.6633333,11.008]},"id":"ci11408890"},
Building a Dataclass
Using Pyhthon3 requests to call API and get the data it is then ingested into a Dataclass. The Dataclass is as follows.
@dataclass
class EarthquakeClassEvent:
mag: float
place: str
time: str
updated: int
tz: str
url: str
detail: str
felt: int
cdi: float
mmi: float
alert: str
status: str
tsunami: int
sig: int
net: str
code: str
ids: str
sources: str
types: str
nst: str
dmin: float
rms: float
gap: int
magType: str
ttype: str
title: str
readable_time: datetime = field(init = False)
Note the last line readable_time this is not a data field from the API this is a user defined variable that will not get a value input when created. To convert the raw time input to a readable time __post_init__(self):
is used as follows:
def __post_init__(self):
"""Converts the raw timestamp input into a readable
time format and saves to readable_time variable"""
ts = int(self.time)/1000
self.readable_time = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
When data is loaded into the class the function above (part of the Dataclass itself) will convert time to the desired format and store it in readable_time. The process of appending each earthquake event as a class object to a list is shown below:
def parse_data_to_dataclass(data):
# declare list to store dataclass
earthquake_list = []
for j in (data['features']):
i = j['properties']
earthquake_list.append(EarthquakeClassEvent(i['mag'],
i['place'], i['time'], i['updated'], i['tz'],
i['url'], i['detail'], i['felt'], i['cdi'],
i['mmi'], i['alert'], i['status'], i['tsunami'],
i['sig'], i['net'], i['code'], i['ids'], i['sources'],
i['types'], i['nst'], i['dmin'], i['rms'], i['gap'],
i['magType'], i['type'], i['title']))
return earthquake_list
The advantage of this approach is that the ingestion of the data and operations on the data is all in one place. This helps keep the process clean and easier to use in different applications.
Another useful feature is that Dataclass can be used to load a Dataframe in Pandas. Below the list of EarthquakeClassEvent are converted to a Pandas Dataframe.
# load dataclass to the pandas dataframe
df = pd.DataFrame(earthquakes)
print(df.head(5))
The list of Dataclass data is loaded into the Pandas Dataframe and the first five rows are displayed.
mag place time updated tz ... gap magType ttype title readable_time
0 6.4 26 km SW of Pocito, Argentina 1611024382380 1616879123040 NaN ... 22 mww earthquake M 6.4 - 26 km SW of Pocito, Argentina 2021-01-18 21:46:22
1 5.5 52 km NE of Bandar-e Lengeh, Iran 1610746264660 1616879081040 NaN ... 35 mww earthquake M 5.5 - 52 km NE of Bandar-e Lengeh, Iran 2021-01-15 16:31:04
2 5.5 7 km WNW of Sivrice, Turkey 1609051052897 1615072766040 NaN ... 26 mww earthquake M 5.5 - 7 km WNW of Sivrice, Turkey 2020-12-27 01:37:32
3 7.6 99 km SE of Sand Point, Alaska 1603140878950 1615823316759 NaN ... 36 mww earthquake M 7.6 - 99 km SE of Sand Point, Alaska 2020-10-19 16:54:38
4 6.6 13 km E of San Pedro, Philippines 1597709028566 1603570140040 NaN ... 14 mww earthquake M 6.6 - 13 km E of San Pedro, Philippines 2020-08-17 20:03:48
[5 rows x 27 columns]
Now the data is converted as desired and in a Pandas Dataframe. The readable_time is the last column to the right. The full code is in GitHub LINK.
Conclusions
The “New” Python Dataclass is a useful tool that standardizes how data is ingested, transformed, and handled. It keeps the data handling in one place and can be easily converted to a Pandas Dataframe. This is just a brief and simple example, but there are many examples and resources that go into much greater depth.
References
- What’s New In Python 3.7
- PEP 557 – Data Classes
- dataclasses — Data Classes
- Data Classes in Python 3.7+ (Guide)
- Cool New Features in Python 3.7
- Classes in Python: Fundamentals for Data Scientists
- Understanding and using Python classes
- Object-oriented programming for data scientists: Build your ML estimator
- Earthquake Data API
- USGS API Documentation - Earthquake Catalog
- Python Dataclasses With Properties and Pandas
- Post-Init Processing In Python Data Class