To pass schema to a json file we do this:
from pyspark.sql.types import (StructField, StringType, StructType, IntegerType)
data_schema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]
final_struc = StructType(fields = data_schema)
df =spark.read.json('people.json', schema=final_struc)
The above code works as expected. However now, I have data in table which I display by:
df = sqlContext.sql("SELECT * FROM people_json")
But if I try to pass a new schema to it by using following command it does not work.
df2 = spark.sql("SELECT * FROM people_json", schema=final_struc)
It gives the following error:
sql() got an unexpected keyword argument 'schema'
NOTE: I am using Databrics Community Edition
You cannot apply a new schema to already created dataframe. However, you can change the schema of each column by casting to another datatype as below.
df.withColumn("column_name", $"column_name".cast("new_datatype"))
If you need to apply a new schema, you need to convert to RDD and create a new dataframe again as below
df = sqlContext.sql("SELECT * FROM people_json")
val newDF = spark.createDataFrame(df.rdd, schema=schema)
Hope this helps!
この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。
侵害の場合は、連絡してください[email protected]
コメントを追加