入力ファイル名を分割し、sparkデータフレーム列に特定の値を追加する方法

debugcn 投稿 Dev

user7547751

これは私が私のcsvファイルをsparkデータフレームにロードする方法です

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{Date, Timestamp}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf



val get_cus_val = spark.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(4))

val df1With_ = df.toDF(df.columns.map(_.replace(".", "_")): _*)
val column_to_keep = df1With_.columns.filter(v => (!v.contains("^") && !v.contains("!") && !v.contains("_c"))).toSeq
val df1result = df1With_.select(column_to_keep.head, column_to_keep.tail: _*)
val df1Final=df1result.withColumn("DataPartition", lit(null: String))

これは私の入力ファイル名の1つの例です。

Fundamental.FinancialLineItem.FinancialLineItem.SelfSourcedPrivate.CUS.1.2017-09-07-1056.Full

Fundamental.FinancialLineItem.FinancialLineItem.Japan.CUS.1.2017-09-07-1056.Full.txt

ここで、このファイルを読み取り、「。」で分割します。演算子を入力し、DataPartitionの代わりにCUSを新しい列として追加します。

UDFなしでそれを行うことはできますか？

これが既存のデータフレームのスキーマです

root
 |-- LineItem_organizationId: long (nullable = true)
 |-- LineItem_lineItemId: integer (nullable = true)
 |-- StatementTypeCode: string (nullable = true)
 |-- LineItemName: string (nullable = true)
 |-- LocalLanguageLabel: string (nullable = true)
 |-- FinancialConceptLocal: string (nullable = true)
 |-- FinancialConceptGlobal: string (nullable = true)
 |-- IsDimensional: boolean (nullable = true)
 |-- InstrumentId: string (nullable = true)
 |-- LineItemSequence: string (nullable = true)
 |-- PhysicalMeasureId: string (nullable = true)
 |-- FinancialConceptCodeGlobalSecondary: string (nullable = true)
 |-- IsRangeAllowed: boolean (nullable = true)
 |-- IsSegmentedByOrigin: boolean (nullable = true)
 |-- SegmentGroupDescription: string (nullable = true)
 |-- SegmentChildDescription: string (nullable = true)
 |-- SegmentChildLocalLanguageLabel: string (nullable = true)
 |-- LocalLanguageLabel_languageId: integer (nullable = true)
 |-- LineItemName_languageId: integer (nullable = true)
 |-- SegmentChildDescription_languageId: integer (nullable = true)
 |-- SegmentChildLocalLanguageLabel_languageId: integer (nullable = true)
 |-- SegmentGroupDescription_languageId: integer (nullable = true)
 |-- SegmentMultipleFundbDescription: string (nullable = true)
 |-- SegmentMultipleFundbDescription_languageId: integer (nullable = true)
 |-- IsCredit: boolean (nullable = true)
 |-- FinancialConceptLocalId: integer (nullable = true)
 |-- FinancialConceptGlobalId: integer (nullable = true)
 |-- FinancialConceptCodeGlobalSecondaryId: string (nullable = true)
 |-- FFAction: string (nullable = true)

提案された回答の後にコードを更新する

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._

    import org.apache.spark.{ SparkConf, SparkContext }
    import java.sql.{Date, Timestamp}
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.functions.udf
    import org.apache.spark.sql.functions.{input_file_name, regexp_extract}

spark.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(4))

import org.apache.spark.sql.functions.input_file_name

val df = sqlContext.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema","true").load("s3://trfsdisu/SPARK/FinancialLineItem/MAIN")

val df1With_ = df.toDF(df.columns.map(_.replace(".", "_")): _*)
val column_to_keep = df1With_.columns.filter(v => (!v.contains("^") && !v.contains("!") && !v.contains("_c"))).toSeq
val df1result = df1With_.select(column_to_keep.head, column_to_keep.tail: _*)

df1result.withColumn("cus_val", get_cus_val(input_file_name))

df1result.printSchema()

mrsrinivas

事前定義されたUDFを使用してファイル名を取得できます。つまりinput_file_name()、その後、UDFを作成してCUSを抽出するか、または2つのUDFを使用できますregexp_extract。

使用してregexp_extractUDFヲ ここに正規表現の使用

import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract

df.withColumn("cus_val", 
  regexp_extract(input_file_name, "\.(\w+)\.[0-9]+\.", 1))

カスタムUDFの使用

import org.apache.spark.sql.functions.udf

val get_cus_val = udf(filePath: String => filePath.split("\\.")(4))

import org.apache.spark.sql.functions.input_file_name

df.withColumn("cus_val", get_cus_val(input_file_name))

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]