scikit-learn StratifiedKFold implementation

Alexander

I have hard time to understand scikit-learn's StratifiedKfold from https://scikit-learn.org/stable/modules/cross_validation.html#stratification

and implemented the example part by adding RandomOversample:

X, y = np.ones((50, 1)), np.hstack(([0] * 45, [1] * 5))

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy='minority',random_state=0)
X_ros, y_ros = ros.fit_sample(X, y)

skf = StratifiedKFold(n_splits=5,shuffle = True)

for train, test in skf.split(X_ros, y_ros):
       print('train -  {}   |   test -  {}'.format(
         np.bincount(y_ros[train]), np.bincount(y_ros[test])))
       print(f"y_ros_test  {y_ros[test]}")

output

train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]
train -  [36 36]   |   test -  [9 9]
y_ros_test  [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1]

My questions:

  1. where we define train and test split (80%, 20% thing in the stratifiedKfold)? I can see from the straditifiedkfold that n_splits is defining the number of folds but not the split I think. This part confuses me.

  2. Why I'm getting y_ros_test with 9 0's and 9 1's when I have n_splits=5? According to explorations it should be 50/5 = 10 , so isn't it 5 1's and 5 0's in each split ?

desertnaut

Regarding your first question: there is not any train-test split when using cross-validation (CV); what happens is, in each CV round, one fold is used as a test set and the rest as training. As a result, when n_splits=5, like here, in each round 1/5 (i.e. 20%) of the data is used as test set while the remaining 4/5 (i.e. 80%) for training. So yes, determining the n_splits argument uniquely defines the split, and there is no need for any further determination (for n_splits=4 you would get a 75-25 split).

Regarding your second question, you seem to forget that previous to splitting you have oversampled your data. Running your code with the initial X and y (i.e. without oversampling) gives indeed a y_test of size 50/5 = 10, although this is not balanced (balancing is the result of oversampling) but stratified (each fold retains the class analogy of the original data):

skf = StratifiedKFold(n_splits=5,shuffle = True)

for train, test in skf.split(X, y):
       print('train -  {}   |   test -  {}'.format(
         np.bincount(y[train]), np.bincount(y[test])))
       print(f"y_test  {y[test]}")

Result:

train -  [36  4]   |   test -  [9 1]
y_test  [0 0 0 0 0 0 0 0 0 1]
train -  [36  4]   |   test -  [9 1]
y_test  [0 0 0 0 0 0 0 0 0 1]
train -  [36  4]   |   test -  [9 1]
y_test  [0 0 0 0 0 0 0 0 0 1]
train -  [36  4]   |   test -  [9 1]
y_test  [0 0 0 0 0 0 0 0 0 1]
train -  [36  4]   |   test -  [9 1]
y_test  [0 0 0 0 0 0 0 0 0 1]

Since oversampling the minority class actually increases the size of the dataset, it is only expected that you get a y_ros_test that is larger relevant to y_test (here 18 samples instead of 10).

Methodologically speaking, you actually don't need a stratified sampling if you already have oversampled your data to balance the class representation.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集
0

コメントを追加

0

関連記事

分類Dev

StratifiedKFold vs KFold in scikit-learn

分類Dev

StratifiedKFoldとscikit-learnのKFold

分類Dev

scikit-StratifiedKFoldの実装を学ぶ

分類Dev

scikit learn documentation in PDF

分類Dev

Scikit-Learn Standard Scaler

分類Dev

repeated FeatureUnion in scikit-learn

分類Dev

scikit-learn:最近傍

分類Dev

scikit learnのRandomForestClassifierとExtraTreesClassifier

分類Dev

Scikit-learn tutorial documentation location

分類Dev

Scikit learn split train test for series

分類Dev

Data not persistent in scikit-learn transformers

分類Dev

Balanced Random Forest in scikit-learn (python)

分類Dev

Scikit-Learn Agglomerative Clustering Connectivity Matrix

分類Dev

Custom tokenizer for scikit-learn vectorizers

分類Dev

「KeyError:0」、xgboost、scikit-learn、pandas

分類Dev

Looping scikit-learn machine learning datasets

分類Dev

Scikit-learn tfidf vectorizer in minibatches?

分類Dev

Target transformation and feature selection in scikit-learn

分類Dev

Installing an old version of scikit-learn

分類Dev

anaconda/spyder scikit learn update 0.21.3 to 0.22.2

分類Dev

API calls from NLTK, Gensim, Scikit Learn

分類Dev

Scikit-Learn Not Properly Updating in IPython

分類Dev

scikit-learn Ridge Regression UnboundLocalError

分類Dev

Predict movie reviews with scikit-learn

分類Dev

scikit-learn HashingVectorizer on sparse matrix

分類Dev

SciKit Learn、Keras、またはPytorchの違い

分類Dev

Scikit Learn K-MeansによるBlaze

分類Dev

scikit-learnとsklearnの違い

分類Dev

dictorを渡してscikit learn estimatorに