解释随机森林模型结果

debugcn 发表于 Dev

maa425

我非常感谢您对我的RF模型的解释以及如何总体评估结果的反馈。

57658 samples
   27 predictor
    2 classes: 'stayed', 'left' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 11531, 11531, 11532, 11532, 11532 
Resampling results across tuning parameters:

  mtry  splitrule   ROC        Sens       Spec        
   2    gini        0.6273579  0.9999011  0.0006250729
   2    extratrees  0.6246980  0.9999197  0.0005667791
  14    gini        0.5968382  0.9324610  0.1116113149
  14    extratrees  0.6192781  0.9740323  0.0523004026
  27    gini        0.5584677  0.7546156  0.2977507092
  27    extratrees  0.5589923  0.7635036  0.2905489827

Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.

在对Y变量的函数形式以及拆分数据的方式进行了几次调整之后，我得到了以下结果：我的ROC略有改善，但有趣的是，与初始模型相比，我的Sens＆Spec发生了巨大变化。

35000 samples
   27 predictor
    2 classes: 'stayed', 'left' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 7000, 7000, 7000, 7000, 7000 
Resampling results across tuning parameters:

  mtry  splitrule   ROC        Sens          Spec     
   2    gini        0.6351733  0.0004618204  0.9998685
   2    extratrees  0.6287926  0.0000000000  0.9999899
  14    gini        0.6032979  0.1346653886  0.9170874
  14    extratrees  0.6235212  0.0753069696  0.9631711
  27    gini        0.5725621  0.3016414054  0.7575899
  27    extratrees  0.5716616  0.2998190728  0.7636219

Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.

这次，我随机地而不是按时间分割数据，并使用以下代码尝试了多个mtry值：

```{r Cross Validation Part 1}
set.seed(1992) # setting a seed for replication purposes 

folds <- createFolds(train_data$left_welfare, k = 5) # Partition the data into 5 equal folds

tune_mtry <- expand.grid(mtry = c(2,10,15,20), splitrule = c("variance", "extratrees"), min.node.size = c(1,5,10))

sapply(folds,length)

并得到以下结果：

Random Forest 

84172 samples
   14 predictor
    2 classes: 'stayed', 'left' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 16834, 16834, 16834, 16835, 16835 
Resampling results across tuning parameters:

  mtry  splitrule   ROC        Sens       Spec     
   2    variance    0.5000000        NaN        NaN
   2    extratrees  0.7038724  0.3714761  0.8844723
   5    variance    0.5000000        NaN        NaN
   5    extratrees  0.7042525  0.3870192  0.8727755
   8    variance    0.5000000        NaN        NaN
   8    extratrees  0.7014818  0.4075797  0.8545012
  10    variance    0.5000000        NaN        NaN
  10    extratrees  0.6956536  0.4336180  0.8310368
  12    variance    0.5000000        NaN        NaN
  12    extratrees  0.6771292  0.4701687  0.7777730
  15    variance    0.5000000        NaN        NaN
  15    extratrees  0.5000000        NaN        NaN

Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 5, splitrule = extratrees and min.node.size = 1.

戴维·ND

看来您的随机森林对第二个类别“左”几乎没有预测能力。最好的分数都具有极高的敏感性和低特异性，这基本上意味着您的分类器只是将所有内容分类为“固定”类别，我想这是多数类别。不幸的是，这很糟糕，因为它与天真的分类器说一切都来自头等舱并不过分。
另外，我不太了解您是否仅尝试了mtry 2,14和27的值，但在那种情况下，我强烈建议您尝试整个3-25范围（最佳值很可能在中间）。

除此之外，由于性能看起来很差（根据ROC的判断），我建议您在特征工程上进行更多工作以提取更多信息。否则，如果您对所拥有的内容不满意，或者您认为无法提取更多信息，则只需调整分类的概率阈值，以使您的敏感性和专一性反映出您对类的要求（您可能会更关心将“留下来”而不是“留下来”，反之亦然，我不知道您的问题）。

希望能帮助到你！

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-04-1

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

解释随机森林模型结果

解释随机森林模型结果

随机森林模型中预测结果的差异

随机森林预测模型

保存的随机森林模型在同一数据集上产生不同的结果

结合scikit学习中的随机森林模型

将python随机森林模型保存到文件

在插入符号中拟合随机森林模型后使用partialPlot

互分随机森林模型的文本分类

您如何绘制随机森林模型的学习曲线？

如何根据随机森林模型创建精确的召回曲线？

Sklearn随机森林模型不会从数据帧中删除标题

OpenCV抛出错误。尝试使用随机森林模型

随机森林的模型和输入特征不匹配

使用随机森林创建二进制结果

随机森林的多重分类-如何衡量结果的“稳定性”

随机森林中的“方差解释”与 XGBoost 中的“错误”有什么区别

保存的模型（随机森林）不能用作“新拟合”模型-类别变量存在问题

OnevsrestClassifier和随机森林

Python中的随机森林

接近矩阵-随机森林

剧情图例随机森林

并行化随机森林

随机森林过度拟合

随机森林的可能算法

随机森林预测值

Python中的随机森林

Spark随机森林错误

OnevsrestClassifier和随机森林

剧情图例随机森林

随机森林变量选择