Some Pointers for Handling Large Datasets in Python
Here are some tips collected from around the web on handling large datasets in Python.
-
Use Categorical DataTypes
Selectively assign a category datatype. This saves memory when working with Pandas Series and DataFrames.
data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3' df = pd.read_csv(pd.compat.StringIO(data), dtype={'col1':'category'}) print (df.dtypes) df.memory_usage(index=False, deep=True)
-
Use the “chunksize” parameters in Pandas when reading csv files.
chunks_df = pd.read_csv(datasource, chunksize=1000000) chunk_list = [] # append each chunk df here #Each chunk is in df format for chunk in chunks_df: #Once the data filtering is done, append the chunk to list chunk_list.append(chunk) #concat the list into dataframe df_concat = pd.concat(chunk_list)
-
Filter-out columns
Remove all unnecessary columns when preparing a dataframe for processing.
df = df['a','b']
-
Free-up some memory by changing column types (i.e. int64 to int32)
Use .astype() to convert your column types
#Change the dtypes (int64 -> int32) df[['a','b']] = df[[['a','b']].astype('int32') #Change the dtypes (float64 -> float32) df[[['a','b']]] = df[[['a','b']]].astype('float32')