I have the following dataframe as an example.
df_test = pd.DataFrame(data=0, index=["green","yellow","red"], columns=["bear","dog","cat"])
I have the following dictionary with keys and values that are the same or related to the index and columns od my dataframe.
d = {"green":["bear","dog"], "yellow":["bear"], "red":["bear"]}
I filled my dataframe according with the keys and values that are presented, using:
for k, v in d.items():
for x in v:
df_test.loc[k, x] = 1
My problem here is that the dataframe and the dictionary I'm working with are very large and it took too much time to compute. Is there a more efficient way to do it? Maybe iterating over rows in the dataframe instead of keys and values in the dictionary?
Because performance is important use MultiLabelBinarizer
:
d = {"green":["bear","dog"], "yellow":["bear"], "red":["bear"]}
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(list(d.values())),
columns=mlb.classes_,
index=list(d.keys()))
print (df)
bear dog
green 1 1
yellow 1 0
red 1 0
And then add missing columns and index labels by DataFrame.reindex
:
df_test = df.reindex(columns=df_test.columns, index=df_test.index, fill_value=0)
print (df_test)
bear dog cat
green 1 1 0
yellow 1 0 0
red 1 0 0
この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。
侵害の場合は、連絡してください[email protected]
コメントを追加