Answer a question

Lists or numpy arrays can be unpacked to multiple variables if the dimensions match. For a 3xN array, the following will work:

import numpy as np 
a,b =          [[1,2,3],[4,5,6]]
a,b = np.array([[1,2,3],[4,5,6]])
# result: a=[1,2,3],   b=[4,5,6]

How can I achieve a similar behaviour for the columns of a pandas DataFrame? Extending the above example:

import pandas as pd 
df = pd.DataFrame([[1,2,3],[4,5,6]])
df.columns = ['A','B','C']    # Rename cols and
df.index = ['i', 'ii']        # rows for clarity

The following does not work as expected:

a,b = df.T
# result: a='i',   b='ii'
a,b,c = df
# result: a='A',   b='B',   c='C'

However, what I would like to get is the following:

a,b,c = unpack(df)
result: a=df['A'], b=df['B'], c=df['C']

Is the function unpack already available in pandas? Or can it be mimicked in an easy way?

Answers

I just figured that the following works, which is already close to what I try to achieve:

a,b,c = df.T.values        # Common
a,b,c = df.T.to_numpy()    # Recommended
# a,b,c = df.T.as_matrix() # Deprecated

Details: As always, things are a little more complicated than one thinks. Note that a pd.DataFrame stores columns separately in Series. Calling df.values (or better: df.to_numpy()) is potentially expensive, as it combines the columns in a single ndarray, which likely involves copying actions and type conversions. Also, the resulting container has a single dtype able to accommodate all data in the data frame.

In summary, the above approach loses the per-column dtype information and is potentially expensive. It is technically cleaner to iterate the columns in one of the following ways (there are more options):

# The following alternatives create VIEWS!
a,b,c = (v for _,v in df.items())      # returns pd.Series
a,b,c = (df[c] for c in df)            # returns pd.Series

Note that the above creates views! Modifying the data likely will trigger a SettingWithCopyWarning.

a.iloc[0] = "blabla"    # raises SettingWithCopyWarning

If you want to modify the unpacked variables, you have to copy the columns.

# The following alternatives create COPIES!
a,b,c = (v.copy() for _,v in df.items())      # returns pd.Series
a,b,c = (df[c].copy() for c in df)            # returns pd.Series
a,b,c = (df[c].to_numpy() for c in df)        # returns np.ndarray

While this is cleaner, it requires more characters. I personally do not recommend the above approach for production code. But to avoid typing (e.g., in interactive shell sessions), it is still a fair option...

# More verbose and explicit alternatives
a,b,c = df["the first col"], df["the second col"], df["the third col"]
a,b,c = df.iloc[:,0], df.iloc[:,1], df.iloc[:,2]
Logo

学AI,认准AI Studio!GPU算力,限时免费领,邀请好友解锁更多惊喜福利 >>>

更多推荐