During my career in Data Science, I’ve learned some stuff along the way that has helped make my career easier. Trust me, there’s things about Python are straight up boss. Here’s one example.
Using Pandas to do Time Series Data Processing
If you leverage time series data in your day-to-day at work, then you probably have spent a huge chunk of time taking into account missing records or aggregating data at a specific granularity either through the help of SQL queries or with custom functions. There’s a very useful resample function in Pandas that can assist you when you process your data with a specific frequency by making the DataFrame index set to the timestamp column.
I intend to use the data set for room occupancy to demonstrate this function. The data set can be obtained here.
First, I intend to demonstrate how to perform a simple aggregation to obtain hourly metrics.
Even though the data isn’t sparse, in the real life, one typically finds data that is missing records. It’s crucial to take those records into account since you might consider using 0 values if there were no records. You also might consider imputing the past or following time steps. It depends on what the data says. Below, I get rid of records at hour 15 to demonstrate how you can make use of hour 14 to impute the missing information: