我正在尝试建立公寓价格的预测模型。我使用python scikit-learn工具集。我使用的数据集包含公寓的总建筑面积和位置,已将其转换为虚拟特征。因此,数据集如下所示:然后,我建立一条学习曲线以查看模型的运行情况。我以这种方式建立学习曲线:
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import learning_curve
model = LinearRegression()
training_sizes, training_scores, validation_scores = learning_curve(
estimator = model,
X = X_train,
y = y_train,
train_sizes = np.linspace(5, len(X_train) * 0.8, dtype = int),
cv = 5
)
line1, line2 = plt.plot(
training_sizes, training_scores.mean(axis = 1), 'g',
training_sizes, validation_scores.mean(axis = 1), 'r')
plt.legend((line1, line2), ('Training', 'Cross-validation'))
正常吗
Also I tried to add to add polynomial features of 2nd degree. But this didn't make the model perform any different. And because I have a lot of categorical features (total 106) it takes quite long even for 2nd degree polynomial. So I didn't try for higher degrees.
Also I tried to build a model using as simple cost function and gradient descent as possible using Octave. The result with weird error was same.
Update: Thanks to tolik I made several amendments:
Data preparation: Categorical data are independent. So I can't combine them into one feature. Features were scaled using StandardScaler(). Thank you for that.
Feature extraction: After features transformation with PCA I found out one new feature has explained variance ratio over 99%. Though it's strange I used only this one. That also allowed to increase polynomial degree though it didn't increase performance.
Model selection: I tried several different models but none seem to perform better than LinearRegression. Interesting thing - all models perform worse on full data set. Probably it's because I sorted by price and higher prices are rather outliers. So when I start training sets on 1000 samples and go to the maximum, I get this picture (for nearly all models):
My explanation have 3 steps: The data preparation, feature extraction, and model selection.
Data preparation:
Feature Extraction:
explained_variance_
sums to 99%. Now you have weigh less features.Model Selection: You don't really know what is a good model, because No Free Lunch Theory but in this problem the best results that don't use deep learning , use these: XGBoost-Regressor , Random-Forest-Regressor ,Ada-Boost.
最重要的是数据准备!!!
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句