scikit-learn StratifiedKFold implementation

debugcn 投稿 Dev

Alexander

I have hard time to understand scikit-learn's StratifiedKfold from https://scikit-learn.org/stable/modules/cross_validation.html#stratification

and implemented the example part by adding RandomOversample:

X, y = np.ones((50, 1)), np.hstack(([0] * 45, [1] * 5))

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy='minority',random_state=0)
X_ros, y_ros = ros.fit_sample(X, y)

skf = StratifiedKFold(n_splits=5,shuffle = True)

for train, test in skf.split(X_ros, y_ros):
       print('train -  {}   |   test -  {}'.format(
         np.bincount(y_ros[train]), np.bincount(y_ros[test])))
       print(f"y_ros_test  {y_ros[test]}")

output

train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]

My questions:

where we define train and test split (80%, 20% thing in the stratifiedKfold)? I can see from the straditifiedkfold that n_splits is defining the number of folds but not the split I think. This part confuses me.
Why I'm getting y_ros_test with 9 0's and 9 1's when I have n_splits=5? According to explorations it should be 50/5 = 10 , so isn't it 5 1's and 5 0's in each split ?

desertnaut

Regarding your first question: there is not any train-test split when using cross-validation (CV); what happens is, in each CV round, one fold is used as a test set and the rest as training. As a result, when n_splits=5, like here, in each round 1/5 (i.e. 20%) of the data is used as test set while the remaining 4/5 (i.e. 80%) for training. So yes, determining the n_splits argument uniquely defines the split, and there is no need for any further determination (for n_splits=4 you would get a 75-25 split).

Regarding your second question, you seem to forget that previous to splitting you have oversampled your data. Running your code with the initial X and y (i.e. without oversampling) gives indeed a y_test of size 50/5 = 10, although this is not balanced (balancing is the result of oversampling) but stratified (each fold retains the class analogy of the original data):

skf = StratifiedKFold(n_splits=5,shuffle = True)

for train, test in skf.split(X, y):
       print('train -  {}   |   test -  {}'.format(
         np.bincount(y[train]), np.bincount(y[test])))
       print(f"y_test  {y[test]}")

Result:

train -  [36  4]   |   test -  [9 1]
y_test  [0 0 0 0 0 0 0 0 0 1]
train -  [36  4]   |   test -  [9 1]
y_test  [0 0 0 0 0 0 0 0 0 1]
train -  [36  4]   |   test -  [9 1]
y_test  [0 0 0 0 0 0 0 0 0 1]
train -  [36  4]   |   test -  [9 1]
y_test  [0 0 0 0 0 0 0 0 0 1]
train -  [36  4]   |   test -  [9 1]
y_test  [0 0 0 0 0 0 0 0 0 1]

Since oversampling the minority class actually increases the size of the dataset, it is only expected that you get a y_ros_test that is larger relevant to y_test (here 18 samples instead of 10).

Methodologically speaking, you actually don't need a stratified sampling if you already have oversampled your data to balance the class representation.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-06-13

コメントを追加

サインイン

分類Dev

Related 関連記事

記事

scikit-learn StratifiedKFold implementation

scikit-learn StratifiedKFold implementation

StratifiedKFold vs KFold in scikit-learn

StratifiedKFoldとscikit-learnのKFold

scikit-StratifiedKFoldの実装を学ぶ

scikit learn documentation in PDF

Scikit-Learn Standard Scaler

repeated FeatureUnion in scikit-learn

scikit-learn：最近傍

scikit learnのRandomForestClassifierとExtraTreesClassifier

Scikit-learn tutorial documentation location

Scikit learn split train test for series

Data not persistent in scikit-learn transformers

Balanced Random Forest in scikit-learn (python)

Scikit-Learn Agglomerative Clustering Connectivity Matrix

Custom tokenizer for scikit-learn vectorizers

「KeyError：0」、xgboost、scikit-learn、pandas

Looping scikit-learn machine learning datasets

Scikit-learn tfidf vectorizer in minibatches?

Target transformation and feature selection in scikit-learn

Installing an old version of scikit-learn

anaconda/spyder scikit learn update 0.21.3 to 0.22.2

API calls from NLTK, Gensim, Scikit Learn

Scikit-Learn Not Properly Updating in IPython

scikit-learn Ridge Regression UnboundLocalError

Predict movie reviews with scikit-learn

scikit-learn HashingVectorizer on sparse matrix

SciKit Learn、Keras、またはPytorchの違い

Scikit Learn K-MeansによるBlaze

scikit-learnとsklearnの違い

dictorを渡してscikit learn estimatorに