Pandas: Best Practices Guide

Some Pointers for Handling Large Datasets in Python

Here are some tips collected from around the web on handling large datasets in Python.

Use Categorical DataTypes

Selectively assign a category datatype. This saves memory when working with Pandas Series and DataFrames.

 data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'
 df = pd.read_csv(pd.compat.StringIO(data), dtype={'col1':'category'})
 print (df.dtypes)
 df.memory_usage(index=False, deep=True)

Use the “chunksize” parameters in Pandas when reading csv files.

 chunks_df = pd.read_csv(datasource, chunksize=1000000)
 chunk_list = []  # append each chunk df here 

 #Each chunk is in df format
 for chunk in chunks_df:  
	   
     #Once the data filtering is done, append the chunk to list
     chunk_list.append(chunk)
		
 #concat the list into dataframe 
 df_concat = pd.concat(chunk_list)

Filter-out columns

Remove all unnecessary columns when preparing a dataframe for processing.
```
 df = df['a','b']
```

Free-up some memory by changing column types (i.e. int64 to int32)

Use .astype() to convert your column types

 #Change the dtypes (int64 -> int32)
 df[['a','b']] = df[[['a','b']].astype('int32')

 #Change the dtypes (float64 -> float32)
 df[[['a','b']]] = df[[['a','b']]].astype('float32')

Some Pointers for Handling Large Datasets in Python

Sources: