大型数据集 - 选择列后选择特定行

Question

我使用的是一个相当大的数据集，其中有很多甚至多行具有相似的名称。

这是我目前使用的代码：

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv("dataset_20001_20180801113759.csv")
df = df.set_index(["Small Molecule HMS LINCS ID"])

Chosen_SmallMoleculeName="10104-101-1"
df2 = df.loc[Chosen_SmallMoleculeName, ["Cell count", "% Apoptotic cells"]]
df3 = df2.loc[Chosen_SmallMoleculeName, "Cell count"]

df4 = df.loc[Chosen_SmallMoleculeName, "Cell count"]
print("Cell count")
print(df4.values)

df5 = df.loc[Chosen_SmallMoleculeName, "% Apoptotic cells"]
print("% Apoptotic cells")
print(df5.values)

有了这个，它会打印出 "Cell count" 和“% Apoptotic cells”的整个列，它太大而无法复制和粘贴到这里。从上图中，我想尝试只获取第 2-7 行的特定数据。

数据集可以从这里获取：http://lincs.hms.harvard.edu/db/datasets/20001/results

问题1：如何选择"Cell count"第2行到第7行的特定数据和“凋亡细胞百分比”？

问题2（不那么重要，但我想知道）：是否可以这样做"dynamically"？就像我自己不必手动查看每一行以找到唯一或相关的行一样，是否可以编写选择要打印的第 2-7 行的代码，但直观地选择第 14 到 19 行？我觉得这将深入机器学习领域...

我看过PythonAPI，没有发现类似的问题。

Answer 1

要检索从 2 到 7 的行，您可以使用 slicing，一旦您考虑到必须为 header 减去 1，然后再减去 1，因为数组从 0:

开始

result = df[:6][["Cell count", "% Apoptotic cells"]]

结果为：

          Cell count       % Apoptotic cells
0         576              60.59
1         373              79.09
2         436              56.19
3         654              43.88
4         284              58.10
5         574              41.81

现在，如果您要更彻底地解释您有兴趣从此数据集中提取的属性是什么，我们也可以帮助您。

大型数据集 - 选择列后选择特定行

Large dataset - choosing specific rows after having chosen columns

python

numpy

bioinformatics