我正在以csv格式处理数据集。观察数为“ 22255”,变量(列)数为“ 35”。
这是我在数据集中有2列的示例(以数据框表示):
df = pd.DataFrame({
'sector': ['Art & Entertainment',
'Art & Entertainment',
'Communication Services',
'Art & Entertainment',
'Consumer Discretionary'],
'region': ['Oregon',
'SF Bay Area',
'SF Bay Area',
'New York City',
'Los Angeles']
})
我希望对此数据集进行如下准备:
Art & Entertainment Communication Services Consumer Discretionary
1 0 0
1 0 0
0 1 0
1 0 0
0 0 1
Portland, Oregon SF Bay Area New York City
1 0 0
0 1 0
0 1 0
0 0 1
0 0 0
Los Angeles
0
0
0
0
1
这是我的代码:
# Import packages
import pandas as pd
# Read the dataset
df = pd.read_csv("C:/Fall 2020 - Clarkson University/Capestone Analytics project/Internship - SeedStages/Sales dataset - Vijay.csv",
engine='python')
ArtEntertainment = []
Technology = []
CommunicationServices = []
ConsumerDiscretionary = []
###
PortlandOregon = []
SFBayArea = []
NewYorkCity = []
LosAngeles = []
###
for line in df['sector']:
if line == "Art & Entertainment":
ArtEntertainment.append(1)
if line != "Art & Entertainment":
ArtEntertainment.append(0)
if line == "Communication Services":
CommunicationServices.append(1)
if line != "Communication Services":
CommunicationServices.append(0)
if line == "Consumer Discretionary":
ConsumerDiscretionary.append(1)
if line != "Consumer Discretionary":
ConsumerDiscretionary.append(0)
for line in df['region']:
if line == "Portland, Oregon":
PortlandOregon.append(1)
if line != "Portland, Oregon":
PortlandOregon.append(0)
if line == "SF Bay Area":
SFBayArea.append(1)
if line != "SF Bay Area":
SFBayArea.append(0)
if line == "New York City":
NewYorkCity.append(1)
if line != "New York City":
NewYorkCity.append(0)
if line == "Los Angeles":
LosAngeles.append(1)
if line != "Los Angeles":
LosAngeles.append(0)
# Collect all the lists into a dataframe
zippedList = list(zip( ArtEntertainment,CommunicationServices,ConsumerDiscretionary,
PortlandOregon,SFBayArea,NewYorkCity,LosAngeles))
df1 = pd.DataFrame(zippedList, columns = ["ArtEntertainment","CommunicationServices","ConsumerDiscretionary",
"PortlandOregon","SFBayArea","NewYorkCity","LosAngeles"])
df = pd.concat([df, df1], axis=1, sort=False)
我想知道是否有可能以更少的代码行数以更专业的方式编写相同的代码。我真的需要你的帮助
如果这是您的起始数据:
import pandas as pd
df = pd.DataFrame({
'sector': ['Art & Entertainment',
'Art & Entertainment',
'Communication Services',
'Art & Entertainment',
'Consumer Discretionary'],
'region': ['Oregon',
'SF Bay Area',
'SF Bay Area',
'New York City',
'Los Angeles']
})
您的数据框将如下所示:
sector region
0 Art & Entertainment Oregon
1 Art & Entertainment SF Bay Area
2 Communication Services SF Bay Area
3 Art & Entertainment New York City
4 Consumer Discretionary Los Angeles
您正在寻找的pandas.get_dummies
功能:https : //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
dummies = pd.get_dummies(df)
结果dummies
数据框将为您提供所需的结果:
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>sector_Art & Entertainment</th>
<th>sector_Communication Services</th>
<th>sector_Consumer Discretionary</th>
<th>region_Los Angeles</th>
<th>region_New York City</th>
<th>region_Oregon</th>
<th>region_SF Bay Area</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<th>2</th>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<th>3</th>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th>4</th>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
如果需要,可以按如下所示从列名称中删除sector_
和region_
前缀:
dummies.columns = [col[col.find("_") + 1:] for col in dummies.columns]
这将找到第一个_
字符的索引,将其加1,然后从该点开始对字符串进行切片。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句