Out of Memory:Transferring Large Data from Amazon Redshift to Pandas

debugcn Published at Dev

RC0706

I have a large data chunk(about 10M rows) in Amazon-Redishift, that I was to obtain in a Pandas data-frame and store the data in a pickle file. However, it shows "Out of Memory" exception for obvious reasons, because of the size of data. I tried a lot other things like sqlalchemy, however, not able to crack the Problem. Can anyone suggest a better way or code to get through it.

My current (simple) code snippet goes as below:

import psycopg2 

import pandas as pd

import numpy as np

cnxn = psycopg2.connect(dbname=<mydatabase>, host='my_redshift_Server_Name', port='5439', user=<username>, password=<pwd>)

sql = "Select * from mydatabase.mytable" 

df = pd.read_sql(sql, cnxn, columns=1)

pd.to_pickle(df, 'Base_Data.pkl')

print(df.head(50))

cnxn.close()

print(df.head(50))

AlexYes

1) find the row count in the table and the maximum chunk of the table that you can pull by adding order by [column] limit [number] offset 0 and increasing the limit number reasonably

2) add a loop that will produce the sql with the limit that you found and increasing offset, i.e. if you can pull 10k rows your statements would be:

... limit 10000 offset 0;
... limit 10000 offset 10000;
... limit 10000 offset 20000;

until you reach the table row count

3) in the same loop, append every new obtained set of rows to your dataframe.

p.s. this will work assuming you won't run into any issues with memory/disk on client end which I can't guarantee since you have such issue on a cluster which is likely higher grade hardware. To avoid the problem you would just write a new file on every iteration instead of appending.

Also, the whole approach is probably not right. You'd better unload the table to S3 which is pretty quick because the data is copied from every node independently, and then do whatever needed against the flat file on S3 to transform it to the final format you need.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-08-11

Comments

0 comments

From Dev

Related Related

Article

Out of Memory:Transferring Large Data from Amazon Redshift to Pandas

Out of Memory:Transferring Large Data from Amazon Redshift to Pandas

Synchronize data from MySql to Amazon RedShift

Synchronize data from MySql to Amazon RedShift

Migrating Data From Amazon Redshift into DynamoDB

FileZilla times out when transferring large file

Filter out data in a large pandas dataframe

Memory management of large data from oracle database

Transferring large quantities of data from S3 bucket with direct connect

extract the data out from a large list

Entity framework large data set, out of memory exception

EF - Query Large Data Sets Causes Out Of Memory Exception

Transferring data from int to string

Transferring data from postgres to rabbitmq

Deserializing large JSON Objects from Web Service (Out of Memory)

Transferring large files using scp with CPU and memory considerations

large bitmap crushes (out of memory)

Transferring data to/from a callback from/to a worker thread

JOGL Large Texture Out Loading Out Of Memory

LAN / WAN issues when transferring large amounts of data

Transferring resources from Amazon S3 to Tomcat via Chef

How to get pandas to print out data and not the memory address?

Amazon Redshift: Copying Data Between Databases

Amazon Redshift - How to extract previous month data

Transferring data from double array to HashMap

Transferring data from controller to view - List<> / IEnumerable<>?

Consilidating and Transferring data from multiple sheets

transferring data from string arrays into another activity

Transferring data from one view to another in Rails

connecting amazon redshift server from tableau server

Export from Amazon Redshift into an RDS MySQL database