我正在使用一个相当大的数据集,其中包含许多甚至多行具有相似名称的行。
这是我到目前为止一直在使用的代码:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("dataset_20001_20180801113759.csv")
df = df.set_index(["Small Molecule HMS LINCS ID"])
Chosen_SmallMoleculeName="10104-101-1"
df2 = df.loc[Chosen_SmallMoleculeName, ["Cell count", "% Apoptotic cells"]]
df3 = df2.loc[Chosen_SmallMoleculeName, "Cell count"]
df4 = df.loc[Chosen_SmallMoleculeName, "Cell count"]
print("Cell count")
print(df4.values)
df5 = df.loc[Chosen_SmallMoleculeName, "% Apoptotic cells"]
print("% Apoptotic cells")
print(df5.values)
有了这个,它会打印出“细胞计数”和“细胞凋亡百分比”的整列,这些列太大而无法在此处复制和粘贴。从上图中,我想尝试仅获取第 2-7 行的特定数据。
数据集可以从这里获得:http : //lincs.hms.harvard.edu/db/datasets/20001/results
问题 1:如何选择“细胞计数”和“凋亡细胞百分比”的第 2 至 7 行特定数据?
Question 2 (Not as important, but I am wondering):Is it possible to do this "dynamically"? As in, instead of myself manually having to look at each row to find the unique or related ones, is it possible to write the code that chooses rows 2-7 to be printed, but intuitively chooses, say rows 14 to 19? I feel this would be delving into machine learning territory...
I have looked at the Python API and have not found a similar question.
To retrieve rows from 2
to 7
you can use slicing, once you have considered that you have to subtract 1 for the header and another 1 since arrays start from 0:
result = df[:6][["Cell count", "% Apoptotic cells"]]
With the result being:
Cell count % Apoptotic cells
0 576 60.59
1 373 79.09
2 436 56.19
3 654 43.88
4 284 58.10
5 574 41.81
现在,如果您要更彻底地解释您有兴趣从该数据集中提取的属性是什么,我们也可以帮助您解决这个问题。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句