我想将StackingClassifier和VotingClassifier与StratifiedKFold和cross_val_score一起使用。如果使用StackingClassifier或VotingClassifier,我会在cross_val_score中获得nan值。如果我使用任何其他算法代替StackingClassifier或VotingClassifier,则cross_val_score可以正常工作。我正在使用python 3.8.5和sklearn 0.23.2。
将代码更新为工作示例。请使用来自kaggle Parkinsons数据集的Parkinons数据集。这是我一直在努力的数据集,以下是我遵循的确切步骤。
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import preprocessing
from sklearn import metrics
from sklearn import model_selection
from sklearn import feature_selection
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
dataset = pd.read_csv('parkinsons.csv')
FS_X=dataset.iloc[:,:-1]
FS_y=dataset.iloc[:,-1:]
FS_X.drop(['name'],axis=1,inplace=True)
select_k_best = feature_selection.SelectKBest(score_func=feature_selection.f_classif,k=15)
X_k_best = select_k_best.fit_transform(FS_X,FS_y)
supportList = select_k_best.get_support().tolist()
p_valuesList = select_k_best.pvalues_.tolist()
toDrop=[]
for i in np.arange(len(FS_X.columns)):
bool = supportList[i]
if(bool == False):
toDrop.append(FS_X.columns[i])
FS_X.drop(toDrop,axis=1,inplace=True)
smote = SMOTE(random_state=7)
Balanced_X,Balanced_y = smote.fit_sample(FS_X,FS_y)
before = pd.merge(FS_X,FS_y,right_index=True, left_index=True)
after = pd.merge(Balanced_X,Balanced_y,right_index=True, left_index=True)
b=before['status'].value_counts()
a=after['status'].value_counts()
print('Before')
print(b)
print('After')
print(a)
SkFold = model_selection.StratifiedKFold(n_splits=10, random_state=7, shuffle=False)
estimators_list = list()
KNN = KNeighborsClassifier()
RF = RandomForestClassifier(criterion='entropy',random_state = 1)
DT = DecisionTreeClassifier(criterion='entropy',random_state = 1)
GNB = GaussianNB()
LR = LogisticRegression(random_state = 1)
estimators_list.append(LR)
estimators_list.append(RF)
estimators_list.append(DT)
estimators_list.append(GNB)
SCLF = StackingClassifier(estimators = estimators_list,final_estimator = KNN,stack_method = 'predict_proba',cv=SkFold,n_jobs = -1)
VCLF = VotingClassifier(estimators = estimators_list,voting = 'soft',n_jobs = -1)
scores1 = model_selection.cross_val_score(estimator = SCLF,X=Balanced_X.values,y=Balanced_y.values,scoring='accuracy',cv=SkFold)
print('StackingClassifier Scores',scores1)
scores2 = model_selection.cross_val_score(estimator = VCLF,X=Balanced_X.values,y=Balanced_y.values,scoring='accuracy',cv=SkFold)
print('VotingClassifier Scores',scores2)
scores3 = model_selection.cross_val_score(estimator = DT,X=Balanced_X.values,y=Balanced_y.values,scoring='accuracy',cv=SkFold)
print('DecisionTreeClassifier Scores',scores3)
输出量
Before
1 147
0 48
Name: status, dtype: int64
After
1 147
0 147
Name: status, dtype: int64
StackingClassifier Scores [nan nan nan nan nan nan nan nan nan nan]
VotingClassifier Scores [nan nan nan nan nan nan nan nan nan nan]
DecisionTreeClassifier Scores [0.86666667 0.9 0.93333333 0.86666667 0.96551724 0.82758621
0.75862069 0.86206897 0.86206897 0.93103448]
我检查了Stackoverflow上的其他一些相关帖子,但无法解决我的问题。我无法理解我要去哪里。
的estimators_list
,因为它被传递到StackingClassifier
或VotingClassifier
不正确。sklearn上有关StackingClassifier的文档说:
基本估算器,这些估算器将堆叠在一起。列表中的每个元素都定义为字符串(即名称)元组和一个估计器实例。可以使用set_params将估算器设置为“丢弃”。
因此,正确的清单应如下所示:
KNN = KNeighborsClassifier()
DT = DecisionTreeClassifier(criterion="entropy")
GNB = GaussianNB()
estimators_list = [("KNN", KNN), ("DT", DT), ("GNB", GNB)]
包含parkinsons数据的完整的最小工作示例如下所示:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier
dataset = pd.read_csv("parkinsons.csv")
FS_X = dataset.drop(["name", "status"], axis=1)
FS_y = dataset["status"]
estimators_list = [("KNN", KNeighborsClassifier()), ("DT", DecisionTreeClassifier(criterion="entropy")), ("GNB", GaussianNB())]
SCLF = StackingClassifier(estimators=estimators_list)
X_train, X_test, y_train, y_test = train_test_split(FS_X, FS_y)
SCLF.fit(X_train, y_train)
print("SCLF: ", accuracy_score(y_test, SCLF.predict(X_test)))
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句