I have a large data chunk(about 10M rows) in Amazon-Redishift, that I was to obtain in a Pandas data-frame and store the data in a pickle file. However, it shows "Out of Memory" exception for obvious reasons, because of the size of data. I tried a lot other things like sqlalchemy, however, not able to crack the Problem. Can anyone suggest a better way or code to get through it.
My current (simple) code snippet goes as below:
import psycopg2
import pandas as pd
import numpy as np
cnxn = psycopg2.connect(dbname=<mydatabase>, host='my_redshift_Server_Name', port='5439', user=<username>, password=<pwd>)
sql = "Select * from mydatabase.mytable"
df = pd.read_sql(sql, cnxn, columns=1)
pd.to_pickle(df, 'Base_Data.pkl')
print(df.head(50))
cnxn.close()
print(df.head(50))
1) find the row count in the table and the maximum chunk of the table that you can pull by adding order by [column] limit [number] offset 0
and increasing the limit number reasonably
2) add a loop that will produce the sql with the limit that you found and increasing offset, i.e. if you can pull 10k rows your statements would be:
... limit 10000 offset 0;
... limit 10000 offset 10000;
... limit 10000 offset 20000;
until you reach the table row count
3) in the same loop, append every new obtained set of rows to your dataframe.
p.s. this will work assuming you won't run into any issues with memory/disk on client end which I can't guarantee since you have such issue on a cluster which is likely higher grade hardware. To avoid the problem you would just write a new file on every iteration instead of appending.
Also, the whole approach is probably not right. You'd better unload the table to S3 which is pretty quick because the data is copied from every node independently, and then do whatever needed against the flat file on S3 to transform it to the final format you need.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments