PYTHON：使用pandas库将转换后的连续属性（转换为类别）追加到原始数据集

Anshul Vyas 发表于 Dev

安舒尔·维亚斯（Anshul Vyas）

我是python的新手，我想使用pandas库实现Naive Bayes分类器。为此，我想将所有连续属性转换为分类属性，反之亦然。

我正在使用第一种策略，并且正在尝试将连续属性转换为分类属性，以确保数据的一致性。我使用的数据集是收入数据集，网址为：http : //archive.ics.uci.edu/ml/datasets/Adult

现在我将连续属性转换为使用

pandas.cut(X, bins, labels = None)

方法。我使用等宽合并方法将标签分配给某些合并宽度。（下面的示例）我将其存储在变量cat_age中。现在，我想用分类年龄属性替换数据集中的年龄属性。

cat_age = pd.cut(age, [0, 25, 45, 65, 95], labels = ["Young", "Middle-aged", "Senior", "Old"], right = True , include_lowest = True)

但是我无法替换数据集中的OLD AGE属性值。我尝试使用DataFrame.replace（）和DataFrame.assign（）方法。

pandas.DataFrame.replace(to_replace='AGE', value = cat_age)   #AGE is the column name in the dataset.

和

pandas.DataFrame.assign(AGE = cat_age)

但这不能让我为数据集中的列替换或分配不同的值。

The DataFrame.replace() method doesn't give any error but doesn't show the new values in the dataset either.

我确定我在犯一些天真的错误。谁能建议我一种方式，要么将新的AGE列添加为分类值，要么将旧的列替换为这些新值。

old values are of type : int64 while new values are of type : str

这只是1列，我想将其转换为以下数据集中的其他值：每周工作小时数，资本收益，资本损失。任何帮助将不胜感激。

标记图

我无法复制您的问题。您的分配方法看起来非常复杂。我想你只需要说df['new_col'] = Series

无论如何，以下作品...

import pandas as pd

# get data
from urllib import urlopen
page = urlopen('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data')
df = pd.read_csv(page, header=None, index_col=None)
df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num',
    'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain',
    'capital-loss', 'hours-per-week', 'native-country', 'income']

# put ages in categories and add as columns to DataFrame
age_bins = [0, 25, 45, 65, 150]
age_labels = ["Young", "Middle-aged", "Senior", "Old"]
df['age_cat'] = pd.cut(df['age'], age_bins, labels=age_labels, 
    right=True, include_lowest=True)