Because
- I don't need double precision
- My machine has limited memory and I want to process bigger datasets
- I need to pass the extracted data (as matrix) to BLAS libraries, and BLAS calls for single precision are 2x faster than for double precision equivalence.
Note that not all columns in the raw csv file have float types. I only need to set float32 as the default for float columns.
Try:
import numpy as np
import pandas as pd
# Sample 100 rows of data to determine dtypes.
df_test = pd.read_csv(filename, nrows=100)
float_cols = [c for c in df_test if df_test[c].dtype == "float64"]
float32_cols = {c: np.float32 for c in float_cols}
df = pd.read_csv(filename, engine='c', dtype=float32_cols)
This first reads a sample of 100 rows of data (modify as required) to determine the type of each column.
It the creates a list of those columns which are 'float64', and then uses dictionary comprehension to create a dictionary with these columns as the keys and 'np.float32' as the value for each key.
Finally, it reads the whole file using the 'c' engine (required for assigning dtypes to columns) and then passes the float32_cols dictionary as a parameter to dtype.
df = pd.read_csv(filename, nrows=100)
>>> df
int_col float1 string_col float2
0 1 1.2 a 2.2
1 2 1.3 b 3.3
2 3 1.4 c 4.4
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 4 columns):
int_col 3 non-null int64
float1 3 non-null float64
string_col 3 non-null object
float2 3 non-null float64
dtypes: float64(2), int64(1), object(1)
df32 = pd.read_csv(filename, engine='c', dtype={c: np.float32 for c in float_cols})
>>> df32.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 4 columns):
int_col 3 non-null int64
float1 3 non-null float32
string_col 3 non-null object
float2 3 non-null float32
dtypes: float32(2), int64(1), object(1)
所有评论(0)