保存的随机森林模型在同一数据集上产生不同的结果

debugcn 发表于 Dev

J·本德

使用保存在磁盘上的随机森林模型并使用完全相同的数据集进行预测时，无法再现结果。换句话说，每次我预测数据集BI获得不同的结果时，我都会使用数据集A训练模型并将其保存在本地计算机上，然后将其加载并用于预测数据集B。

我知道随机森林分类器涉及的随机性，但是据我所知，这种随机性是在训练期间进行的，一旦创建了模型，则如果您使用相同的数据进行预测，则预测不应更改。

训练脚本具有以下结构：

df_train = spark.read.format("csv") \
      .option('header', 'true') \
      .option('inferSchema', 'true') \
      .option('delimiter', ';') \
      .load("C:\2020_05.csv") 

#The problem seems to be related to the StringIndexer/One-Hot Encoding
#If I remove all categorical variables the results can be reproduced
categorical_variables = []
for variable in df_train.dtypes:
    if variable[1] == 'string' :
       categorical_variables.append(variable[0])

indexers = [StringIndexer(inputCol=col, outputCol=col+"_indexed") for col in categorical_variables]

for indexer in indexers:
    df_train =indexer.fit(df_train).transform(df_train)
    df_train = df_train.drop(indexer.getInputCol())
      
indexed_cols = []
for variable in df_train.columns:
    if variable.endswith("_indexed"):
        indexed_cols.append(variable)

encoders = []
for variable in indexed_cols:
    inputCol = variable
    outputCol = variable.replace("_indexed", "_encoded")
    one_hot_encoder_estimator_train = OneHotEncoderEstimator(inputCols=[inputCol], outputCols=[outputCol])

    encoder_model_train = one_hot_encoder_estimator_train.fit(df_train)
    df_train = encoder_model_train.transform(df_train)
    df_train = df_train.drop(inputCol)


inputCols = [x for x in df_train.columns if x != "id" and x != "churn"]

vector_assembler_train = VectorAssembler(
      inputCols=inputCols,
      outputCol='features',
      handleInvalid='keep'
)

df_train = vector_assembler_train.transform(df_train)

df_train = df_train.select('churn', 'features', 'id')

df_train_1 = df_train.filter(df_train['churn'] == 0).sample(withReplacement=False, fraction=0.3, seed=7)
df_train_2 = df_train.filter(df_train['churn'] == 1).sample(withReplacement=True, fraction=20.0, seed=7)
df_train = df_train_1.unionAll(df_train_2) 

rf = RandomForestClassifier(labelCol="churn", featuresCol="features")
  paramGrid = ParamGridBuilder() \
      .addGrid(rf.numTrees, [100]) \
      .addGrid(rf.maxDepth, [15]) \
      .addGrid(rf.maxBins, [32]) \
      .addGrid(rf.featureSubsetStrategy, ['onethird']) \
      .addGrid(rf.subsamplingRate, [1.0])\
      .addGrid(rf.minInfoGain, [0.0])\
      .addGrid(rf.impurity, ['gini']) \
      .addGrid(rf.minInstancesPerNode, [1]) \
      .addGrid(rf.seed, [10]) \
  .build()



  evaluator = BinaryClassificationEvaluator(
      labelCol="churn")

  crossval = CrossValidator(estimator=rf,
                            estimatorParamMaps=paramGrid,
                            evaluator=evaluator,
                            numFolds=3)
  model = crossval.fit(df_train)
  model.save("C:/myModel")

测试脚本如下：

df_test = spark.read.format("csv") \
      .option('header', 'true') \
      .option('inferSchema', 'true') \
      .option('delimiter', ';') \
      .load("C:\2020_06.csv")
  
#The problem seems to be related to the StringIndexer/One-Hot Encoding
#If I remove all categorical variables the results can be reproduced
categorical_variables = []
for variable in df_test.dtypes:
    if variable[1] == 'string' :
       categorical_variables.append(variable[0])

indexers = [StringIndexer(inputCol=col, outputCol=col+"_indexed") for col in categorical_variables]

for indexer in indexers:
    df_test =indexer.fit(df_test).transform(df_test)
    df_test = df_test.drop(indexer.getInputCol())
      
indexed_cols = []
for variable in df_test.columns:
    if variable.endswith("_indexed"):
        indexed_cols.append(variable)

encoders = []
for variable in indexed_cols:
    inputCol = variable
    outputCol = variable.replace("_indexed", "_encoded")
    one_hot_encoder_estimator_test = OneHotEncoderEstimator(inputCols=[inputCol], outputCols=[outputCol])

    encoder_model_test= one_hot_encoder_estimator_test.fit(df_test)
    df_test= encoder_model_test.transform(df_test)
    df_test= df_test.drop(inputCol)


inputCols = [x for x in df_test.columns if x != "id" and x != "churn"]

vector_assembler_test = VectorAssembler(
      inputCols=inputCols,
      outputCol='features',
      handleInvalid='keep'
)

df_test = vector_assembler_test.transform(df_test)

df_test = df_test.select('churn', 'features', 'id')


model = CrossValidatorModel.load("C:/myModel")

result = model.transform(df_test)

areaUnderROC = evaluator.evaluate(result)

tp = result.filter("prediction == 1.0 AND churn == 1").count()
tn = result.filter("prediction == 0.0 AND churn == 0").count()
fp = result.filter("prediction == 1.0 AND churn == 0").count()
fn = result.filter("prediction == 0.0 AND churn == 1").count()

每次我运行测试脚本时，AUC和混淆矩阵总是不同的。我在Windows 10计算机上使用Spark 2.4.5和Python 3.7。任何建议或想法都非常感谢。

编辑：问题与StringIndexer /单热点编码步骤有关。当我只使用数字变量时，我可以重现结果。由于我无法解释为什么会发生，所以这个问题仍然悬而未决。

多维·乔尔（Dovi Joel）

以我的经验，此问题是因为您正在重新评估测试中的OneHotEncoder。

这是docs中的OneHotEncoding的工作方式：

一种单编码器，将一类类别索引映射到一列二进制矢量，每行最多有一个单一的单值指示输入类别索引。例如，对于5个类别，输入值2.0将映射到[0.0、0.0、1.0、0.0]的输出向量。默认情况下不包括最后一个类别（可通过dropLast进行配置），因为它使向量条目的总和为1，因此线性相关。因此，输入值4.0映射为[0.0，0.0，0.0，0.0]。

因此，每次数据不同时（火车与测试自然是这种情况），一个热编码器在矢量中产生的值就不同。

您应该将OneHotEncoder和训练有素的模型一起添加到管道中，进行拟合然后保存，然后在测试中再次加载。这样，每次通过管道运行数据时，都可以确保将一个热编码值与相同值匹配。

有关保存和加载管道的更多详细信息，请参见文档。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。