How to load data in chunks from a pandas dataframe to a spark dataframe

Gaurav Dhama Published at Dev

Gaurav Dhama

I have read data in chunks over a pyodbc connection using something like this :

import pandas as pd
import pyodbc
conn = pyodbc.connect("Some connection Details")
sql = "SELECT * from TABLES;"
df1 = pd.read_sql(sql,conn,chunksize=10)

Now I want to read all these chunks into one single spark dataframe using something like:

i = 0
for chunk in df1:
    if i==0:
        df2 = sqlContext.createDataFrame(chunk)
    else:
        df2.unionAll(sqlContext.createDataFrame(chunk))
    i = i+1

The problem is when i do a df2.count() i get the result as 10 which means only the i=0 case is working.Is this a bug with unionAll. Am i doing something wrong here??

mechanical_meat

The documentation for .unionAll() states that it returns a new dataframe so you'd have to assign back to the df2 DataFrame:

i = 0
for chunk in df1:
    if i==0:
        df2 = sqlContext.createDataFrame(chunk)
    else:
        df2 = df2.unionAll(sqlContext.createDataFrame(chunk))
    i = i+1

Furthermore you can instead use enumerate() to avoid having to manage the i variable yourself:

for i,chunk in enumerate(df1):
    if i == 0:
        df2 = sqlContext.createDataFrame(chunk)
    else:
        df2 = df2.unionAll(sqlContext.createDataFrame(chunk))

Furthermore the documentation for .unionAll() states that .unionAll() is deprecated and now you should use .union() which acts like UNION ALL in SQL:

for i,chunk in enumerate(df1):
    if i == 0:
        df2 = sqlContext.createDataFrame(chunk)
    else:
        df2 = df2.union(sqlContext.createDataFrame(chunk))

Edit:
Furthermore I'll stop saying furthermore but not before I say furthermore: As @zero323 says let's not use .union() in a loop. Let's instead do something like:

def unionAll(*dfs):
    ' by @zero323 from here: http://stackoverflow.com/a/33744540/42346 '
    first, *rest = dfs  # Python 3.x, for 2.x you'll have to unpack manually
    return first.sql_ctx.createDataFrame(
        first.sql_ctx._sc.union([df.rdd for df in dfs]),
        first.schema
    )

df_list = []
for chunk in df1:
    df_list.append(sqlContext.createDataFrame(chunk))

df_all = unionAll(df_list)

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-03-3

Comments

0 comments

From Dev

Related Related

Article

How to load data in chunks from a pandas dataframe to a spark dataframe

How to load data in chunks from a pandas dataframe to a spark dataframe

Using spark dataFrame to load data from HDFS

Using spark dataFrame to load data from HDFS

load data from python variable into pandas dataframe

Apache Spark Dataframe - Load data from nth line of a CSV file

Pandas load dataframe from MSSQL

From Spark to Pandas Dataframe iteratively

Pandas - Slice Large Dataframe in Chunks

How to convert pandas' DataFrame to DataFrame or LabeledPoint in Spark?

how to load a json into a pandas dataframe?

Not able to load file from HDFS in spark Dataframe

How to iterate over consecutive chunks of Pandas dataframe efficiently

Spark SQL: How to consume json data from a REST service as DataFrame

Spark SQL: How to consume json data from a REST service as DataFrame

How to convert Spark Streaming data into Spark DataFrame

Python + Pandas + Spark - How to import a dataframe into Pandas dataframe and convert it into a dictionary?

How to compare data from the same column in a dataframe (Pandas)

How to extract data from a Tweepy object into a pandas dataframe?

How to select this kind of data from this pandas multi-index dataframe

How does Spark DataFrame handles Pandas DataFrame that is larger than memory

PHOENIX SPARK - Load Table as DataFrame

Spark - load CSV file as DataFrame?

PHOENIX SPARK - Load Table as DataFrame

How to create a DataFrame from a text file in Spark

How to create a spark DataFrame from sequenceFile

Apache Spark: How to create a matrix from a DataFrame?

How to create dataframe from list in Spark SQL?

How to get all the rows from spark DataFrame?

Editing values in a pandas dataframe using data from another part of the dataframe

Assign values to columns in Pandas Dataframe based on data from another dataframe