Scikit-Learn Custom Imputer，均值附近具有随机值

debugcn 发表于 Dev

凯文

我想创建一个自定义Imputer，以用NaN值所在的列mean - std和范围内的随机值替换数据mean + std中的NaN值。

这是到目前为止我拥有的Imputer的代码：

class GroupImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        X = check_array(X, force_all_finite=False)
        self.means = np.nanmean(X, axis=0)
        self.stds = np.nanstd(X, axis=0)
        return self

    def transform(self, X, y=None):
        check_is_fitted(self, 'means')
        check_is_fitted(self, 'stds')
        X = check_array(X, force_all_finite=False)
        # how do i apply to each row of the data?
        return 0

该self.means包含的一个列表means的每一列。

将self.stds包含所有的列表stds为每列。

如何在数据行之间mean - std以及mean + std每个NaN数据行中应用随机值？

我是否需要遍历数据？（for row in X:），然后根据列索引选择正确的均值和std？还是有一种方法可以做到这一点？

柴可夫斯基

不，您不必遍历数据，假设数据的行数和列数分别为5和4

num_rows,num_cols = 5,4

# just fake two arrays of column means and stds
column_means = np.random.uniform(1,8,num_cols)
column_stds = np.random.rand(num_cols)

disp = np.random.uniform(column_means-column_stds,column_means+column_stds, size=(num_rows,num_cols))

数组disp就像

array([[6.29377845, 6.56185572, 5.32590954, 2.14719305],
       [6.36648777, 6.97781432, 4.89773801, 2.21909144],
       [5.38109603, 6.70649396, 5.50100582, 2.26518757],
       [5.59764259, 6.90297057, 5.65199988, 2.25340505],
       [5.80928963, 6.4976407 , 5.23792109, 1.99580784]])

其中该数组的每一列均从范围中均匀采样(the column mean - the column std, the column mean + the column std)。因此，NaN原始数组的条目可以替换为的条目disp。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。