我在使用SKLearn的CPU和使用RAPID的GPU上使用RandomForestClassifier。我正在这两个库之间进行基准测试,以使用Iris数据集进行加速和评分(尝试,将来,我将更改数据集以获得更好的基准测试,我将从这两个库开始)。
问题是,当我在CPU上测量分数时总是得到1.0的值,但是当我在GPU上测量分数时,我得到的变量值在0.2到1.0之间,我不知道为什么会发生这种情况。
首先,我使用的库版本是:
NumPy Version: 1.17.5
Pandas Version: 0.25.3
Scikit-Learn Version: 0.22.1
cuPY Version: 6.7.0
cuDF Version: 0.12.0
cuML Version: 0.12.0
Dask Version: 2.10.1
DaskCuda Version: 0+unknown
DaskCuDF Version: 0.12.0
MatPlotLib Version: 3.1.3
SeaBorn Version: 0.10.0
我用于SKLearn RandomForestClassifier的代码是:
# Read data in host memory
host_s_csv = pd.read_csv('./DataSet/iris.csv', header = 0, delimiter = ',') # Get complete CSV
host_s_data = host_s_csv.iloc[:, [0, 1, 2, 3]].astype('float32') # Get data columns
host_s_labels = host_s_csv.iloc[:, 4].astype('category').cat.codes # Get labels column
# Plot data
#sns.pairplot(host_s_csv, hue = 'variety');
# Split train and test data
host_s_data_train, host_s_data_test, host_s_labels_train, host_s_labels_test = sk_train_test_split(host_s_data, host_s_labels, test_size = 0.2, random_state = 0)
# Create RandomForest model
sk_s_random_forest = skRandomForestClassifier(n_estimators = 40,
max_depth = 16,
max_features = 1.0,
random_state = 10,
n_jobs = 1)
# Fit data in RandomForest
sk_s_random_forest.fit(host_s_data_train, host_s_labels_train)
# Predict data
sk_s_random_forest_labels_predicted = sk_s_random_forest.predict(host_s_data_test)
# Check score
print('accuracy_score: ', sk_accuracy_score(host_s_labels_test, sk_s_random_forest_labels_predicted))
我用于RAPIDs RandomForestClassifier的代码是:
# Read data in device memory
device_s_csv = cudf.read_csv('./DataSet/iris.csv', header = 0, delimiter = ',') # Get complete CSV
device_s_data = device_s_csv.iloc[:, [0, 1, 2, 3]].astype('float32') # Get data columns
device_s_labels = device_s_csv.iloc[:, 4].astype('category').cat.codes # Get labels column
# Plot data
#sns.pairplot(device_s_csv.to_pandas(), hue = 'variety');
# Split train and test data
device_s_data_train, device_s_data_test, device_s_labels_train, device_s_labels_test = cu_train_test_split(device_s_data, device_s_labels, train_size = 0.8, shuffle = True, random_state = 0)
# Use same data as host
#device_s_data_train = cudf.DataFrame.from_pandas(host_s_data_train)
#device_s_data_test = cudf.DataFrame.from_pandas(host_s_data_test)
#device_s_labels_train = cudf.Series.from_pandas(host_s_labels_train).astype('int32')
#device_s_labels_test = cudf.Series.from_pandas(host_s_labels_test).astype('int32')
# Create RandomForest model
cu_s_random_forest = cusRandomForestClassifier(n_estimators = 40,
max_depth = 16,
max_features = 1.0,
n_streams = 1)
# Fit data in RandomForest
cu_s_random_forest.fit(device_s_data_train, device_s_labels_train)
# Predict data
cu_s_random_forest_labels_predicted = cu_s_random_forest.predict(device_s_data_test)
# Check score
print('accuracy_score: ', cu_accuracy_score(device_s_labels_test, cu_s_random_forest_labels_predicted))
我使用的虹膜数据集的一个示例是:
你知道为什么会这样吗?两种模型设置相同,参数相同,...我不知道为什么分数之间会有如此大的差异。
谢谢。
这是由我们的预测代码中的一个已知问题引起的,该问题已在0.13中进行了警告更正,并在多类分类中落到CPU上。在0.12版中,我们没有警告或回退,因此,如果您不知道要predict_model="CPU'
在多类别分类中使用,您的预测得分将比仅使用模型时要低得多适合。
在此处查看问题:https://github.com/rapidsai/cuml/issues/1623
这里有一些代码可以帮助您和其他人。它已经过修改,所以将来对其他人来说会更容易一些。我在GV100和RAPIDS 0.12稳定版上获得〜0.9333。
import cudf as cu
from cuml.ensemble import RandomForestClassifier as cusRandomForestClassifier
from cuml.metrics import accuracy_score as cu_accuracy_score
from cuml.preprocessing.model_selection import train_test_split as cu_train_test_split
import numpy as np
# data link: https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv
# Read data
df = cu.read_csv('./iris.csv', header = 0, delimiter = ',') # Get complete CSV
# Prep data
X = df.iloc[:, [0, 1, 2, 3]].astype(np.float32) # Get data columns. Must be float32 for our Classifier
y = df.iloc[:, 4].astype('category').cat.codes # Get labels column. Will convert to int32
cu_s_random_forest = cusRandomForestClassifier(
n_bins = 16,
n_estimators = 40,
max_depth = 16,
max_features = 1.0,
n_streams = 1)
train_data, test_data, train_label, test_label = cu_train_test_split(X, y, train_size=0.8)
# Fit data in RandomForest
cu_s_random_forest.fit(train_data,train_label)
# Predict data
predict = cu_s_random_forest.predict(test_data, predict_model="CPU") # use CPU to do multi-class classifications
print(predict)
# Check score
print('accuracy_score: ', cu_accuracy_score(test_label, predict))
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句