K-Means GridSearchCV超参数调整

debugcn 发表于 Dev

阿伦

我试图通过在带有决策树分类器的管道中使用时空K均值聚类进行超参数调整。想法是使用K-Means聚类算法生成聚类距离空间矩阵和聚类标签，然后将其传递到决策树分类器。对于超参数调整，只需将参数用于K-Means算法即可。

我正在使用python 3.8和sklearn 0.22。

我感兴趣的数据有3列/属性：“时间”，“ x”和“ y”（x和y是空间坐标）。

代码是：

class ST_KMeans(BaseEstimator, TransformerMixin):
# class ST_KMeans():
    """
    Note that K-means clustering algorithm is designed for Euclidean distances.
    It may stop converging with other distances, when the mean is no longer a
    best estimation for the cluster 'center'.

    The 'mean' minimizes squared differences (or, squared Euclidean distance).
    If you want a different distance function, you need to replace the mean with
    an appropriate center estimation.


    Parameters:

    k:  number of clusters

    eps1 : float, default=0.5
        The spatial density threshold (maximum spatial distance) between 
        two points to be considered related.

    eps2 : float, default=10
        The temporal threshold (maximum temporal distance) between two 
        points to be considered related.

    metric : string default='euclidean'
        The used distance metric - more options are
        ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’,
        ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’,
        ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘rogerstanimoto’, ‘sqeuclidean’,
        ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘yule’.

    n_jobs : int or None, default=-1
        The number of processes to start; -1 means use all processors (BE AWARE)


    Attributes:

    labels : array, shape = [n_samples]
        Cluster labels for the data - noise is defined as -1
    """

    def __init__(self, k, eps1 = 0.5, eps2 = 10, metric = 'euclidean', n_jobs = 1):
        self.k = k
        self.eps1 = eps1
        self.eps2 = eps2
        # self.min_samples = min_samples
        self.metric = metric
        self.n_jobs = n_jobs


    def fit(self, X, Y = None):
        """
        Apply the ST K-Means algorithm 

        X : 2D numpy array. The first attribute of the array should be time attribute
            as float. The following positions in the array are treated as spatial
            coordinates.
            The structure should look like this [[time_step1, x, y], [time_step2, x, y]..]

            For example 2D dataset:
            array([[0,0.45,0.43],
            [0,0.54,0.34],...])


        Returns:

        self
        """

        # check if input is correct
        X = check_array(X)

        # type(X)
        # numpy.ndarray

        # Check arguments for DBSCAN algo-
        if not self.eps1 > 0.0 or not self.eps2 > 0.0:
            raise ValueError('eps1, eps2, minPts must be positive')

        # Get dimensions of 'X'-
        # n - number of rows
        # m - number of attributes/columns-
        n, m = X.shape


        # Compute sqaured form Euclidean Distance Matrix for 'time' and spatial attributes-
        time_dist = squareform(pdist(X[:, 0].reshape(n, 1), metric = self.metric))
        euc_dist = squareform(pdist(X[:, 1:], metric = self.metric))

        '''
        Filter the euclidean distance matrix using time distance matrix. The code snippet gets all the
        indices of the 'time_dist' matrix in which the time distance is smaller than 'eps2'.
        Afterward, for the same indices in the euclidean distance matrix the 'eps1' is doubled which results
        in the fact that the indices are not considered during clustering - as they are bigger than 'eps1'.
        '''
        # filter 'euc_dist' matrix using 'time_dist' matrix-
        dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)


        # Initialize K-Means clustering model-
        self.kmeans_clust_model = KMeans(
            n_clusters = self.k, init = 'k-means++',
            n_init = 10, max_iter = 300,
            precompute_distances = 'auto', algorithm = 'auto')

        # Train model-
        self.kmeans_clust_model.fit(dist)


        self.labels = self.kmeans_clust_model.labels_
        self.X_transformed = self.kmeans_clust_model.fit_transform(X)

        return self


    def transform(self, X):
        if not isinstance(X, np.ndarray):
            # Convert to numpy array-
            X = X.values

        # Get dimensions of 'X'-
        # n - number of rows
        # m - number of attributes/columns-
        n, m = X.shape


        # Compute sqaured form Euclidean Distance Matrix for 'time' and spatial attributes-
        time_dist = squareform(pdist(X[:, 0].reshape(n, 1), metric = self.metric))
        euc_dist = squareform(pdist(X[:, 1:], metric = self.metric))

        # filter 'euc_dist' matrix using 'time_dist' matrix-
        dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)

        # return self.kmeans_clust_model.transform(X)
        return self.kmeans_clust_model.transform(dist)


# Initialize ST-K-Means object-
st_kmeans_algo = ST_KMeans(
    k = 5, eps1=0.6,
    eps2=9, metric='euclidean',
    n_jobs=1
    )

Y = np.zeros(shape = (501,))

# Train on a chunk of dataset-
st_kmeans_algo.fit(data.loc[:500, ['time', 'x', 'y']], Y)

# Get clustered data points labels-
kmeans_labels = st_kmeans_algo.labels

kmeans_labels.shape
# (501,)


# Get labels for points clustered using trained model-
# kmeans_transformed = st_kmeans_algo.X_transformed
kmeans_transformed = st_kmeans_algo.transform(data.loc[:500, ['time', 'x', 'y']])

kmeans_transformed.shape
# (501, 5)

dtc = DecisionTreeClassifier()

dtc.fit(kmeans_transformed, kmeans_labels)

y_pred = dtc.predict(kmeans_transformed)

# Get model performance metrics-
accuracy = accuracy_score(kmeans_labels, y_pred)
precision = precision_score(kmeans_labels, y_pred, average='macro')
recall = recall_score(kmeans_labels, y_pred, average='macro')

print("\nDT model metrics are:")
print("accuracy = {0:.4f}, precision = {1:.4f} & recall = {2:.4f}\n".format(
    accuracy, precision, recall
    ))

# DT model metrics are:
# accuracy = 1.0000, precision = 1.0000 & recall = 1.0000




# Hyper-parameter Tuning:

# Define steps of pipeline-
pipeline_steps = [
    ('st_kmeans_algo' ,ST_KMeans(k = 5, eps1=0.6, eps2=9, metric='euclidean', n_jobs=1)),
    ('dtc', DecisionTreeClassifier())
    ]

# Instantiate a pipeline-
pipeline = Pipeline(pipeline_steps)

kmeans_transformed.shape, kmeans_labels.shape
# ((501, 5), (501,))

# Train pipeline-
pipeline.fit(kmeans_transformed, kmeans_labels)




# Specify parameters to be hyper-parameter tuned-
params = [
    {
        'st_kmeans_algo__k': [3, 5, 7]
    }
    ]

# Initialize GridSearchCV object-
grid_cv = GridSearchCV(estimator=pipeline, param_grid=params, cv = 2)

# Train GridSearch on computed data from above-
grid_cv.fit(kmeans_transformed, kmeans_labels)

“ grid_cv.fit（）”调用产生以下错误：

5 6中的ValueError Traceback（最近一次呼叫最后一次）＃在上面的计算数据上训练GridSearch-----> 7 grid_cv.fit（kmeans_transformed，kmeans_labels）

〜/ .local / lib / python3.8 / site-packages / sklearn / model_selection / _search.py in fit（self，X，y，groups，** fit_params）708返回结果709-> 710 self._run_search（evaluate_candidates ）711712＃对于多指标评估，请存储best_index_，best_params_和

〜/ .local / lib / python3.8 / site-packages / sklearn / model_selection / _search.py in _run_search（自己，valuate_candidates）1149 def _run_search（自己，valuate_candidates）：1150“”“”搜索param_grid中的所有候选对象“”“ -> 1151评估候选人（ParameterGrid（self.param_grid））1152 1153

〜/ .local / lib / python3.8 / site-packages / sklearn / model_selection / _search.py invaluate_candidates（candidate_params）680 n_splits，n_candidates，n_candidates * n_splits））681-> 682 out = parallel（delayed（_fit_and_score）（clone（base_estimator），683 X，y，684 train = train，test = test，

调用中的〜/ .local / lib / python3.8 / site-packages / joblib / parallel.py （自身，可迭代）1002＃剩余作业。1003 self._iterating = False-> 1004 if self.dispatch_one_batch（iterator）：1005 self._iterating = self._original_iterator不是None 1006

〜/ .local / lib / python3.8 / site-packages / joblib / parallel.py in dispatch_one_batch（self，iterator）833 return False 834 else：-> 835 self._dispatch（tasks）836 return True 837

〜/ .local / lib / python3.8 / site-packages / joblib / parallel.py在_dispatch（self，batch）752中具有self._lock：753 job_idx = len（self._jobs）-> 754 job = self。 _backend.apply_async（batch，callback = cb）755＃作业完成得比其回调要快756＃在我们到达这里之前调用它，导致self._jobs

〜/ .local / lib / python3.8 / site-packages / joblib / _parallel_backends.py在apply_async（self，func，callback）中207 def apply_async（self，func，callback = None）：208“”“计划一个func到运行“”“-> 209结果= InstantResult（函数）210，如果回调：211回调（结果）

〜/。当地/ lib中/ python3.8 /站点包/ JOBLIB / _parallel_backends.py中的init（个体经营，一批）588＃不要耽误应用程序，以避免保持输入589个＃参数中内存- > 590 self.results = batch（）591592 def get（self）：

〜/ .local / lib / python3.8 / site-packages / joblib / parallel.py调用中（self）253＃使用parallel_backend（self._backend，n_jobs = self._n_jobs）将默认进程数更改为-1 254 ：-> 255 return [func（* args，** kwargs）256 for self.items中的func，args，kwargs] 257

〜..local / lib / python3.8 / site-packages / joblib / parallel.py（.0）253＃使用parallel_backend（self._backend，n_jobs = self._n_jobs）将默认进程数更改为-1 254 ：-> 255 return [func（* args，** kwargs）256 for self.items中的func，args，kwargs] 257

_fit_and_score中的〜/ .local / lib / python3.8 / site-packages / sklearn / model_selection / _validation.py（估算器，X，y，计分器，训练，测试，详细，参数，fit_params，return_train_score，return_parameters，return_n_test_samples，return_times 542 return_estimator，error_score）542 else：543 fit_time = time.time（）-start_time-> 544 test_scores = _score（estimator，X_test，y_test，scorer）545 score_time = time.time（）-start_time-fit_time 546如果return_train_score：

〜/ .local / lib / python3.8 / site-packages / sklearn / model_selection / _validation.py in _score（estimator，X_test，y_test，scorer）589得分= scorer（estimator，X_test）590其他：-> 591得分=记分员（estimator，X_test，y_test）592593 error_msg =（“得分必须返回一个数字，得到％s（％s）”

〜/ .local / lib / python3.8 / site-packages / sklearn / metrics / _scorer.py通话中（self，estimator，* args，** kwargs）87 * args，** kwargs）88其他：- > 89分=得分手（估算器，* args，** kwargs）90分[名称] =得分91得分

〜/ .local / lib / python3.8 / site-packages / sklearn / metrics / _scorer.py in _passthrough_scorer（estimator，* args，** kwargs）369 def _passthrough_scorer（estimator，* args，** kwargs）：370“ “”包装estimator.score的函数“”“”-> 371 return estimator.score（* args，** kwargs）372373

〜/ .local / lib / python3.8 / site-packages / sklearn / utils / metaestimators.py in（* args，** kwargs）114115＃lambda（但不是部分）允许help（）与update_wrapper一起使用- > 116 out = lambda * args，** kwargs：self.fn（obj，* args，** kwargs）117＃更新返回函数的文档字符串118 update_wrapper（out，self.fn）

〜/ .local / lib / python3.8 / site-packages / sklearn / pipeline.py in score（self，X，y，sample_weight）617如果sample_weight不为None：618 score_params ['sample_weight'] = sample_weight-> 619 return self.steps [-1] [-1] .score（Xt，y，** score_params）620621 @property

〜/ .local / lib / python3.8 / site-packages / sklearn / base.py in score（self，X，y，sample_weight）367“”“从.metrics导入precision_score-> 369返回precision_score（y， self.predict（X），sample_weight = sample_weight）370371

〜/ .local / lib / python3.8 / site-packages / sklearn / metrics / _classification.py in precision_score（y_true，y_pred，normalize，sample_weight）183184＃计算每种可能表示形式的准确性-> 185 y_type，y_true， y_pred = _check_targets（y_true，y_pred）186 check_consistent_length（y_true，y_pred，sample_weight）187如果y_type.startswith（'multilabel'）：

_check_targets（y_true，y_pred）中的〜/ .local / lib / python3.8 / site-packages / sklearn / metrics / _classification.py 78 y_pred：数组或指标矩阵79“”“-> 80 check_consistent_length（y_true，y_pred ）81 type_true = type_of_target（y_true）82 type_pred = type_of_target（y_pred）

〜/ .local / lib / python3.8 / site-packages / sklearn / utils / validation.py in check_consistent_length（* arrays）209个uniques = np.unique（lengths）如果len（uniques）> 1：-> 211引发ValueError（“找到不一致的数字” 212“的输入变量：％r”％[int（l）表示长度为l的字符串]）213

ValueError：找到的输入变量样本数量不一致：[251，250]

不同的尺寸/形状是：

kmeans_transformed.shape, kmeans_labels.shape, data.loc[:500, ['time', 'x', 'y']].shape                                       
# ((501, 5), (501,), (501, 3))

我不知道错误如何到达“样本：[251，25]”？

怎么了

谢谢！

马克·塞里亚尼

250和251分别是火车的形状和GridSearchCV中的验证

看看您的自定义估算器...

def transform(self, X):

    return self.X_transformed

原始的变换方法不应用任何类型的操作，它仅返回火车数据。我们需要一个估算器，该估算器能够灵活地转换新数据（在酸性情况下，可以在gridsearch中进行验证）。以这种方式更改变换方法

def transform(self, X):

    return self.kmeans_clust_model.transform(X)

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-04-2

我来说两句

0条评论

登录后参与评论

来自分类Dev

k means produces empty clusters

来自分类Dev

使用GridSearchCV进行GBRT超参数调整

来自分类Dev

使用GridSearchCV进行超参数调整

来自分类Dev

K-means clustering uniqueness of solution

来自分类Dev

(C++) K-Means Clustering trouble

来自分类Dev

（C ++）K-Means聚类问题

来自分类Dev

如何使用TensorFlow实施k-means？

来自分类Dev

k Means中的收敛是什么？

来自分类Dev

openCV k-means调用断言失败

来自分类Dev

（C ++）K-Means聚类问题

来自分类Dev

python k-means聚类文本

来自分类Dev

正确实现k-means算法

来自分类Dev

是什么使距离k-medoid比“ k-means”更好？

来自分类Dev

是什么使距离k-medoid比“ k-means”更好？

来自分类Dev

K-Means VS K-模式？（文本聚类）

来自分类Dev

使用 gridsearchCV 调整改变熊猫 df 的超参数

来自分类Dev

R k-means算法定制中心

来自分类Dev

Cost function value in k-means clustering mahout

来自分类Dev

Apache Spark K-Means集群-输入RDD

来自分类Dev

Python将k-means集群关联到实例

来自分类Dev

kdtree是否用于加快k-means聚类？

来自分类Dev

ggplot K-Means群集中心和群集

来自分类Dev

如何手动设置K-means集群的中心？

来自分类Dev

为sklearn k-means添加标签

来自分类Dev

消息不适合sklearn k-means收敛实现

来自分类Dev

解释K-Means cluster_centers_输出

来自分类Dev

执行K-Means算法时检索索引

来自分类Dev

无法使用SPSS Modeler 16运行k-means

来自分类Dev

使用python K-MEANS进行Hadoop流式传输

Related 相关文章

文章