I have a dataframe which has aggregated data for some days. I want to add in the missing days
I was following another post, Add missing dates to pandas dataframe, unfortunately, it overwrote my results (maybe functionality was changed slightly?)... the code is below
import random
import datetime as dt
import numpy as np
import pandas as pd
def generate_row(year, month, day):
while True:
date = dt.datetime(year=year, month=month, day=day)
data = np.random.random(size=4)
yield [date] + list(data)
# days I have data for
dates = [(2000, 1, 1), (2000, 1, 2), (2000, 2, 4)]
generators = [generate_row(*date) for date in dates]
# get 5 data points for each
data = [next(generator) for generator in generators for _ in range(5)]
df = pd.DataFrame(data, columns=['date'] + ['f'+str(i) for i in range(1,5)])
# df
groupby_day = df.groupby(pd.PeriodIndex(data=df.date, freq='D'))
results = groupby_day.sum()
idx = pd.date_range(min(df.date), max(df.date))
results.reindex(idx, fill_value=0)
Results before filling in missing date indices

Results after

You need to use period_range rather than date_range:
In [11]: idx = pd.period_range(min(df.date), max(df.date))
...: results.reindex(idx, fill_value=0)
...:
Out[11]:
f1 f2 f3 f4
2000-01-01 2.049157 1.962635 2.756154 2.224751
2000-01-02 2.675899 2.587217 1.540823 1.606150
2000-01-03 0.000000 0.000000 0.000000 0.000000
2000-01-04 0.000000 0.000000 0.000000 0.000000
2000-01-05 0.000000 0.000000 0.000000 0.000000
2000-01-06 0.000000 0.000000 0.000000 0.000000
2000-01-07 0.000000 0.000000 0.000000 0.000000
2000-01-08 0.000000 0.000000 0.000000 0.000000
2000-01-09 0.000000 0.000000 0.000000 0.000000
2000-01-10 0.000000 0.000000 0.000000 0.000000
2000-01-11 0.000000 0.000000 0.000000 0.000000
2000-01-12 0.000000 0.000000 0.000000 0.000000
2000-01-13 0.000000 0.000000 0.000000 0.000000
2000-01-14 0.000000 0.000000 0.000000 0.000000
2000-01-15 0.000000 0.000000 0.000000 0.000000
2000-01-16 0.000000 0.000000 0.000000 0.000000
2000-01-17 0.000000 0.000000 0.000000 0.000000
2000-01-18 0.000000 0.000000 0.000000 0.000000
2000-01-19 0.000000 0.000000 0.000000 0.000000
2000-01-20 0.000000 0.000000 0.000000 0.000000
2000-01-21 0.000000 0.000000 0.000000 0.000000
2000-01-22 0.000000 0.000000 0.000000 0.000000
2000-01-23 0.000000 0.000000 0.000000 0.000000
2000-01-24 0.000000 0.000000 0.000000 0.000000
2000-01-25 0.000000 0.000000 0.000000 0.000000
2000-01-26 0.000000 0.000000 0.000000 0.000000
2000-01-27 0.000000 0.000000 0.000000 0.000000
2000-01-28 0.000000 0.000000 0.000000 0.000000
2000-01-29 0.000000 0.000000 0.000000 0.000000
2000-01-30 0.000000 0.000000 0.000000 0.000000
2000-01-31 0.000000 0.000000 0.000000 0.000000
2000-02-01 0.000000 0.000000 0.000000 0.000000
2000-02-02 0.000000 0.000000 0.000000 0.000000
2000-02-03 0.000000 0.000000 0.000000 0.000000
2000-02-04 1.856158 2.892620 2.986166 2.793448
This is because your groupby uses PeriodIndex, rather than datetime:
df.groupby(pd.PeriodIndex(data=df.date, freq='D'))
You could have instead used a pd.Grouper:
df.groupby(pd.Grouper(key="date", freq='D'))
which would have give a datetime index.
所有评论(0)