Some Pointers for Handling Large Datasets in Python
Here are some tips collected from around the web on handling large datasets in Python.
- 
    Use Categorical DataTypes Selectively assign a category datatype. This saves memory when working with Pandas Series and DataFrames. data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3' df = pd.read_csv(pd.compat.StringIO(data), dtype={'col1':'category'}) print (df.dtypes) df.memory_usage(index=False, deep=True)
- 
    Use the “chunksize” parameters in Pandas when reading csv files. chunks_df = pd.read_csv(datasource, chunksize=1000000) chunk_list = [] # append each chunk df here #Each chunk is in df format for chunk in chunks_df: #Once the data filtering is done, append the chunk to list chunk_list.append(chunk) #concat the list into dataframe df_concat = pd.concat(chunk_list)
- 
    Filter-out columns Remove all unnecessary columns when preparing a dataframe for processing. df = df['a','b']
- 
    Free-up some memory by changing column types (i.e. int64 to int32) Use .astype() to convert your column types #Change the dtypes (int64 -> int32) df[['a','b']] = df[[['a','b']].astype('int32') #Change the dtypes (float64 -> float32) df[[['a','b']]] = df[[['a','b']]].astype('float32')