Answer a question

I have a dataframe that looks like this:

Out[14]:
    impwealth  indweight
16     180000     34.200
21     384000     37.800
26     342000     39.715
30    1154000     44.375
31     421300     44.375
32    1210000     45.295
33    1062500     45.295
34    1878000     46.653
35     876000     46.653
36     925000     53.476

I want to calculate the weighted median of the column impwealth using the frequency weights in indweight. My pseudo code looks like this:

# Sort `impwealth` in ascending order 
df.sort('impwealth', 'inplace'=True)

# Find the 50th percentile weight, P
P = df['indweight'].sum() * (.5)

# Search for the first occurrence of `impweight` that is greater than P 
i = df.loc[df['indweight'] > P, 'indweight'].last_valid_index()

# The value of `impwealth` associated with this index will be the weighted median
w_median = df.ix[i, 'impwealth']

This method seems clunky, and I'm not sure it's correct. I didn't find a built in way to do this in pandas reference. What is the best way to go about finding weighted median?

Answers

If you want to do this in pure pandas, here's a way. It does not interpolate either. (@svenkatesh, you were missing the cumulative sum in your pseudocode)

df.sort_values('impwealth', inplace=True)
cumsum = df.indweight.cumsum()
cutoff = df.indweight.sum() / 2.0
median = df.impwealth[cumsum >= cutoff].iloc[0]

This gives a median of 925000.

Logo

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐