需要更好地了解Python scikit-learn fit预测循环与线性结果

debugcn 发表于 Dev

bsoplinger

这是我的Python块（2.7，[我学习了Python 3，所以使用将来的print_function来获得我习惯使用的打印格式]），使用了scikit-learn的一些修订版本中的学习代码，由于公司的IT政策而被锁定。它使用SVC引擎。我不明白的是，在第一种情况（使用simple_clf）和第二种情况下，我在+/- 1情况下得到的结果是不同的。但是从结构上讲，我认为它们与一次处理和一次完整的数据数组相同，而第二次只是一次使用一个数组中的数据。然而结果并不相同。为平均（平均）分数生成的值应为十进制百分比（0.0到1.0）。在某些情况下，差异很小，但在其他方面却足够大，足以让我问我的问题。

from __future__ import print_function
import os
import numpy as np
from numpy import array, loadtxt
from sklearn import cross_validation, datasets, svm, preprocessing, grid_search
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score

GRADES = ['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'M']

# Initial processing
featurevecs = loadtxt( FEATUREVECFILE )
f = open( SCORESFILE )
scorelines = f.readlines()[ 1: ] # Skip header line
f.close()
scorenums = [ GRADES.index( l.split( '\t' )[ 1 ] ) for l in scorelines ]
scorenums = array( scorenums )

# Need this step to normalize the feature vectors
scaler = preprocessing.Scaler()
scaler.fit( featurevecs )
featurevecs = scaler.transform( featurevecs )

# Break up the vector into a training and testing vector
# Need to keep the training set somewhat large to get enough of the
# scarce results in the training set or the learning fails
X_train, X_test, y_train, y_test = train_test_split(
    featurevecs, scorenums, test_size = 0.333, random_state = 0 )

# Define a range of parameters we can use to do a grid search
# for the 'best' ones.
CLFPARAMS = {'gamma':[.0025, .005, 0.09, .01, 0.011, .02, .04],
             'C':[200, 300, 400, 500, 600]}

# do a simple cross validation
simple_clf = svm.SVC()
simple_clf = grid_search.GridSearchCV( simple_clf, CLFPARAMS, cv = 3 )
simple_clf.fit( X_train, y_train )
y_true, y_pred = y_test, simple_clf.predict( X_test )
match = 0
close = 0
count = 0
deviation = []
for i in range( len( y_true ) ):
    count += 1
    delta = np.abs( y_true[ i ] - y_pred[ i ] )
    if( delta == 0 ):
        match += 1
    elif( delta == 1 ):
        close += 1
    deviation = np.append( deviation, 
                           float( np.sum( np.abs( delta ) <= 1 ) ) )
avg = float( match ) / float( count )
close_avg = float( close ) / float( count )
#deviation.mean() = avg + close_avg
print( '{0} Accuracy (+/- 0) {1:0.4f} Accuracy (+/- 1) {2:0.4f} (+/- {3:0.4f}) '.format( test_type, avg, deviation.mean(), deviation.std() / 2.0, ), end = "" )

# "Original" code
# do LeaveOneOut item by item
clf = svm.SVC()
clf = grid_search.GridSearchCV( clf, CLFPARAMS, cv = 3 )
toleratePara = 1;
thecurrentScoreGraded = []
loo = cross_validation.LeaveOneOut( n = len( featurevecs ) )
for train, test in loo:
    try:
        clf.fit( featurevecs[ train ], scorenums[ train ] )
        rawPredictionResult = clf.predict( featurevecs[ test ] )

        errorVec = scorenums[ test ] - rawPredictionResult;
        print( len( errorVec ), errorVec )
        thecurrentScoreGraded = np.append( thecurrentScoreGraded, float( np.sum( np.abs( errorVec ) <= toleratePara ) ) / len( errorVec ) )
    except ValueError:
        pass
print( '{0} Accuracy (+/- {1:d}) {2:0.4f} (+/- {3:0.4f})'.format( test_type, toleratePara, thecurrentScoreGraded.mean(), thecurrentScoreGraded.std() / 2 ) )

这是我的结果，您可以看到它们不匹配。我的实际工作任务是查看究竟更改收集的哪种数据以供学习引擎使用将有助于提高准确性，或者将数据组合成更大的教学向量是否有帮助，所以您会发现我正在研究一堆组合。每对线用于一种学习数据。第一行是我的结果，第二行是基于“原始”代码的结果。

original Accuracy (+/- 0) 0.2771 Accuracy (+/- 1) 0.6024 (+/- 0.2447) 
                        original Accuracy (+/- 1) 0.6185 (+/- 0.2429)
upostancurv Accuracy (+/- 0) 0.2718 Accuracy (+/- 1) 0.6505 (+/- 0.2384) 
                        upostancurv Accuracy (+/- 1) 0.6417 (+/- 0.2398)
npostancurv Accuracy (+/- 0) 0.2718 Accuracy (+/- 1) 0.6505 (+/- 0.2384) 
                        npostancurv Accuracy (+/- 1) 0.6417 (+/- 0.2398)
tancurv Accuracy (+/- 0) 0.2330 Accuracy (+/- 1) 0.5825 (+/- 0.2466) 
                        tancurv Accuracy (+/- 1) 0.5831 (+/- 0.2465)
npostan Accuracy (+/- 0) 0.3398 Accuracy (+/- 1) 0.7379 (+/- 0.2199) 
                        npostan Accuracy (+/- 1) 0.7003 (+/- 0.2291)
nposcurv Accuracy (+/- 0) 0.2621 Accuracy (+/- 1) 0.5825 (+/- 0.2466) 
                        nposcurv Accuracy (+/- 1) 0.5961 (+/- 0.2453)
upostan Accuracy (+/- 0) 0.3398 Accuracy (+/- 1) 0.7379 (+/- 0.2199) 
                        upostan Accuracy (+/- 1) 0.7003 (+/- 0.2291)
uposcurv Accuracy (+/- 0) 0.2621 Accuracy (+/- 1) 0.5825 (+/- 0.2466) 
                        uposcurv Accuracy (+/- 1) 0.5961 (+/- 0.2453)
upos Accuracy (+/- 0) 0.3689 Accuracy (+/- 1) 0.6990 (+/- 0.2293) 
                        upos Accuracy (+/- 1) 0.6450 (+/- 0.2393)
npos Accuracy (+/- 0) 0.3689 Accuracy (+/- 1) 0.6990 (+/- 0.2293) 
                        npos Accuracy (+/- 1) 0.6450 (+/- 0.2393)
curv Accuracy (+/- 0) 0.1553 Accuracy (+/- 1) 0.4854 (+/- 0.2499) 
                        curv Accuracy (+/- 1) 0.5570 (+/- 0.2484)
tan Accuracy (+/- 0) 0.3107 Accuracy (+/- 1) 0.7184 (+/- 0.2249) 
                        tan Accuracy (+/- 1) 0.7231 (+/- 0.2237)

安德烈亚斯·穆勒（Andreas Mueller）

“在结构上它们是相同的”是什么意思？您使用不同的子集进行训练和测试，并且它们具有不同的大小。如果您使用的训练数据不完全相同，那么我看不出您为什么期望结果是相同的。

顺便说一句，还请参阅文档中关于LOO的注释。LOO可能有很大的差异。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-06-8

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章