学习曲线是否显示过度拟合？

debugcn 发表于 Dev

特克斯·洛佩兹（Txus Lopez）

我想知道我的分类模型（二元模型）是否遭受过度拟合的困扰，并且获得了学习曲线。数据集为：6836个实例，其中1006个实例为正类。

1）如果我使用SMOTE来平衡类和RandomForest作为技术，则会获得此曲线，并具有以下比率：TPR = 0.887 y FPR = 0.041：

请注意，训练误差是平坦的，几乎为0。

2）如果我使用函数“ balanced_subsample”（最后附有）来平衡类和RandomForest作为技术，则会获得此曲线，其比率为：TPR = 0.866 y FPR = 0.14：

请注意，在这种情况下，测试错误是平坦的。

Do the models suffer from overfitting?
Which of them do make more sense?

The function "balanced_subsample":

def balanced_subsample(x,y,subsample_size):

class_xs = []
min_elems = None

for yi in np.unique(y):
    elems = x[(y == yi)]
    class_xs.append((yi, elems))
    if min_elems == None or elems.shape[0] < min_elems:
        min_elems = elems.shape[0]

use_elems = min_elems
if subsample_size < 1:
    use_elems = int(min_elems*subsample_size)

xs = []
ys = []

for ci,this_xs in class_xs:
    if len(this_xs) > use_elems:
        np.random.shuffle(this_xs)

    x_ = this_xs[:use_elems]
    y_ = np.empty(use_elems)
    y_.fill(ci)

    xs.append(x_)
    ys.append(y_)

xs = np.concatenate(xs)
ys = np.concatenate(ys)

return xs,ys

EDIT1: More info about the code ans the process

X = data
y = X.pop('myclass')


#There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes
arrX = vectorize_attributes(X)

#Here I use some code to balance my class using SMOTE or "balanced_subsample" approach
X_train_balanced, y_train_balanced=mySMOTEfunc(arrX, y)
#X_train_balanced, y_train_balanced=balanced_subsample(arrX, y) 

#TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit)
X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25)

#Estimator
clf=RandomForestClassifier(random_state=np.random.seed()) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}

#Grid search
score_func = metrics.f1_score
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
start = time()
CV_clf.fit(X_train, y_train)

#FIT & PREDICTION
model = CV_clf.best_estimator_
y_pred = model.predict(X_test)

EDIT2: In this case, I try it with Gradient Boosting Classifier (GBC) in 3 scenarios: 1) GBC + SMOTE, 2) GBC + SMOTE + feature selection, and 3) GBC + SMOTE + feature selection + normalization

X = data
y = X.pop('myclass')

#There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes
arrX = vectorize_attributes(X)

#FOR SCENARIO 3: Normalization
standardized_X = preprocessing.normalize(arrX)

#FOR SCENARIO 2 y 3: Removing all but the k highest scoring features
arrX_features_selected = SelectKBest(chi2, k=5).fit_transform(standardized_X , y)

#Here I use some code to balance my class using SMOTE or "balanced_subsample" approach
X_train_balanced, y_train_balanced=mySMOTEfunc(arrX_features_selected , y)
#X_train_balanced, y_train_balanced=balanced_subsample(arrX_features_selected , y) 

#TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit)
X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25)

#Estimator
clf=RandomForestClassifier(random_state=np.random.seed()) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}

#Grid search
score_func = metrics.f1_score
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
start = time()
CV_clf.fit(X_train, y_train)

#FIT & PREDICTION
model = CV_clf.best_estimator_
y_pred = model.predict(X_test)

The learning curves of the 3 proposed scenarios are:

SCENARIO 1:

SCENARIO 2: GBC + SMOTE + feature selection

SCENARIO 3: GBC + SMOTE + feature selection + normalization

Andreus

因此，您的第一个曲线很有意义。您期望随着训练次数的增加，测试错误会减少。当您有一个没有最大深度和100％最大样本的随机树木森林时，您会期望均匀地接近0的火车误差。您可能过度适应，但使用RandomForests可能不会变得更好（或者，取决于数据集，其他任何因素）。

您的第二条曲线没有意义。您应该再次遇到接近0的火车错误，除非发生了完全不正确的事情（例如真正损坏的输入集）。我看不到您的代码有什么问题，我运行了您的函数；似乎工作正常。没有您发布带有代码的完整工作示例，我无能为力。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-06-13

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

学习曲线是否显示过度拟合？

学习曲线是否显示过度拟合？

关于学习曲线

学习曲线的训练量

怪异的线性回归学习曲线

Strongloop回送缓慢学习曲线

无法绘制keras模型的学习曲线

防止过度拟合的机器学习是作弊吗？

scikit-learn：使用SVC构建学习曲线

您如何绘制随机森林模型的学习曲线？

为什么神经网络不学习曲线？

在LSTM中添加CRF层使学习曲线平坦

建立学习曲线以训练doc2vec嵌入

在学习曲线图中如何制作平坦的验证精度曲线

如何使用SciKit随机森林的oob_decision_function_学习曲线？

学习曲线-为什么训练精度开始时如此之高，然后突然下降？

交叉验证在学习曲线中如何工作？Python sklearn

如何在同一张图上绘制来自不同模型的多个学习曲线？

scikit-learn-如何绘制仅与一个班级相关的分数的学习曲线

如何以编程方式可视化路德维希图书馆模型学习曲线？

“state.set 不是函数”作为我的 react-redux-immutable 学习曲线的一部分

LIBSVM过度拟合

随机森林过度拟合

LSTM实施/过度拟合

LIBSVM过度拟合

次优早期停止可防止机器学习中的过度拟合？

曲线拟合是否具有分段功能？

曲线拟合是否具有分段功能？

在SVM中使用内核是否增加了过度拟合的机会？

火车和测试误差之间的差距很小，这是否意味着过度拟合？

简单的曲线拟合