How to load data in chunks from a pandas dataframe to a spark dataframe

Gaurav Dhama

I have read data in chunks over a pyodbc connection using something like this :

import pandas as pd
import pyodbc
conn = pyodbc.connect("Some connection Details")
sql = "SELECT * from TABLES;"
df1 = pd.read_sql(sql,conn,chunksize=10)

Now I want to read all these chunks into one single spark dataframe using something like:

i = 0
for chunk in df1:
    if i==0:
        df2 = sqlContext.createDataFrame(chunk)
    else:
        df2.unionAll(sqlContext.createDataFrame(chunk))
    i = i+1

The problem is when i do a df2.count() i get the result as 10 which means only the i=0 case is working.Is this a bug with unionAll. Am i doing something wrong here??

mechanical_meat

The documentation for .unionAll() states that it returns a new dataframe so you'd have to assign back to the df2 DataFrame:

i = 0
for chunk in df1:
    if i==0:
        df2 = sqlContext.createDataFrame(chunk)
    else:
        df2 = df2.unionAll(sqlContext.createDataFrame(chunk))
    i = i+1

Furthermore you can instead use enumerate() to avoid having to manage the i variable yourself:

for i,chunk in enumerate(df1):
    if i == 0:
        df2 = sqlContext.createDataFrame(chunk)
    else:
        df2 = df2.unionAll(sqlContext.createDataFrame(chunk))

Furthermore the documentation for .unionAll() states that .unionAll() is deprecated and now you should use .union() which acts like UNION ALL in SQL:

for i,chunk in enumerate(df1):
    if i == 0:
        df2 = sqlContext.createDataFrame(chunk)
    else:
        df2 = df2.union(sqlContext.createDataFrame(chunk))

Edit:
Furthermore I'll stop saying furthermore but not before I say furthermore: As @zero323 says let's not use .union() in a loop. Let's instead do something like:

def unionAll(*dfs):
    ' by @zero323 from here: http://stackoverflow.com/a/33744540/42346 '
    first, *rest = dfs  # Python 3.x, for 2.x you'll have to unpack manually
    return first.sql_ctx.createDataFrame(
        first.sql_ctx._sc.union([df.rdd for df in dfs]),
        first.schema
    )

df_list = []
for chunk in df1:
    df_list.append(sqlContext.createDataFrame(chunk))

df_all = unionAll(df_list)

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Using spark dataFrame to load data from HDFS

From Dev

Using spark dataFrame to load data from HDFS

From Dev

load data from python variable into pandas dataframe

From Dev

Apache Spark Dataframe - Load data from nth line of a CSV file

From Dev

Pandas load dataframe from MSSQL

From Dev

From Spark to Pandas Dataframe iteratively

From Dev

Pandas - Slice Large Dataframe in Chunks

From Dev

How to convert pandas' DataFrame to DataFrame or LabeledPoint in Spark?

From Dev

how to load a json into a pandas dataframe?

From Dev

Not able to load file from HDFS in spark Dataframe

From Dev

How to iterate over consecutive chunks of Pandas dataframe efficiently

From Dev

Spark SQL: How to consume json data from a REST service as DataFrame

From Dev

Spark SQL: How to consume json data from a REST service as DataFrame

From Dev

How to convert Spark Streaming data into Spark DataFrame

From Dev

Python + Pandas + Spark - How to import a dataframe into Pandas dataframe and convert it into a dictionary?

From Java

How to compare data from the same column in a dataframe (Pandas)

From Dev

How to extract data from a Tweepy object into a pandas dataframe?

From Dev

How to select this kind of data from this pandas multi-index dataframe

From Dev

How does Spark DataFrame handles Pandas DataFrame that is larger than memory

From Dev

PHOENIX SPARK - Load Table as DataFrame

From Java

Spark - load CSV file as DataFrame?

From Dev

PHOENIX SPARK - Load Table as DataFrame

From Dev

How to create a DataFrame from a text file in Spark

From Dev

How to create a spark DataFrame from sequenceFile

From Dev

Apache Spark: How to create a matrix from a DataFrame?

From Dev

How to create dataframe from list in Spark SQL?

From Dev

How to get all the rows from spark DataFrame?

From Dev

Editing values in a pandas dataframe using data from another part of the dataframe

From Dev

Assign values to columns in Pandas Dataframe based on data from another dataframe

Related Related

  1. 1

    Using spark dataFrame to load data from HDFS

  2. 2

    Using spark dataFrame to load data from HDFS

  3. 3

    load data from python variable into pandas dataframe

  4. 4

    Apache Spark Dataframe - Load data from nth line of a CSV file

  5. 5

    Pandas load dataframe from MSSQL

  6. 6

    From Spark to Pandas Dataframe iteratively

  7. 7

    Pandas - Slice Large Dataframe in Chunks

  8. 8

    How to convert pandas' DataFrame to DataFrame or LabeledPoint in Spark?

  9. 9

    how to load a json into a pandas dataframe?

  10. 10

    Not able to load file from HDFS in spark Dataframe

  11. 11

    How to iterate over consecutive chunks of Pandas dataframe efficiently

  12. 12

    Spark SQL: How to consume json data from a REST service as DataFrame

  13. 13

    Spark SQL: How to consume json data from a REST service as DataFrame

  14. 14

    How to convert Spark Streaming data into Spark DataFrame

  15. 15

    Python + Pandas + Spark - How to import a dataframe into Pandas dataframe and convert it into a dictionary?

  16. 16

    How to compare data from the same column in a dataframe (Pandas)

  17. 17

    How to extract data from a Tweepy object into a pandas dataframe?

  18. 18

    How to select this kind of data from this pandas multi-index dataframe

  19. 19

    How does Spark DataFrame handles Pandas DataFrame that is larger than memory

  20. 20

    PHOENIX SPARK - Load Table as DataFrame

  21. 21

    Spark - load CSV file as DataFrame?

  22. 22

    PHOENIX SPARK - Load Table as DataFrame

  23. 23

    How to create a DataFrame from a text file in Spark

  24. 24

    How to create a spark DataFrame from sequenceFile

  25. 25

    Apache Spark: How to create a matrix from a DataFrame?

  26. 26

    How to create dataframe from list in Spark SQL?

  27. 27

    How to get all the rows from spark DataFrame?

  28. 28

    Editing values in a pandas dataframe using data from another part of the dataframe

  29. 29

    Assign values to columns in Pandas Dataframe based on data from another dataframe

HotTag

Archive