我正在尝试使用sklearn将以下数据输入到随机森林算法中。
数据(以csv形式表示):
id,CAP,astroturf,fake_follower,financial,other,overall,self-declared,labels
3039154799,0.7828265255249504,0.1,1.8,1.4,3.2,1.4,0.4,1
390617262,1.0,0.8,1.4,1.0,5.0,5.0,0.2,0
4611389296,0.7334998320027682,0.2,0.6,0.1,1.8,1.1,0.0,1
我的代码:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
master_training_set_path = "data_bank/cleaning_data/master_training_data_id/master_train_one_hot.csv"
df = pd.read_csv(master_training_set_path)
labels = np.array(df["labels"].values)
train, test, train_labels, test_labels = train_test_split(df, labels,
stratify=labels,
test_size=0.3)
model = RandomForestClassifier(n_estimators=100, bootstrap=True, max_features='sqrt')
# this is the problematic line
model.fit(train, train_labels)
有问题的行是最后一行,当我运行它时,它返回以下回溯:
Traceback (most recent call last):
File "path\random_forest.py", line 39, in
<module>
model.fit(train, train_labels)
File "path\sklearn\ensemble\forest.py", line 247, in fit
X = check_array(X, accept_sparse="csc", dtype=DTYPE)
File "path\sklearn\utils\validation.py", line 434, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'self-declared'
我试图确保'train'和'train_label'变量是numpy 2d数组,但是我仍然遇到相同的错误
我感到困惑的是,“自我声明”特征不是值,而是数据集中某个特征的名称。为什么sklearn在训练数据之前不丢弃标题?
该代码适用于scikit-learn版本:0.23.1
。如果您使用的是以下版本,则可以尝试更新:
conda install scikit-learn=0.23.1
这个问题可能是,你是提供df
给train_test_split
。这将工作,但是,它因为模型产生的问题train
和test
dataframes(页眉)创建的,而不是功能矩阵。因此,您可以尝试替换:
train, test, train_labels, test_labels = train_test_split(df, labels,
stratify=labels,
test_size=0.3)
有了这个:
df.drop(['labels'],axis=1,inplace=True) #you have labels in the training set as well.
train, test, train_labels, test_labels = train_test_split(df.values, labels,
stratify=labels,
test_size=0.3)
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句