在我的情况下，是否可以用更快的东西（例如应用）替换 iterrows？

Question

所以我有一个 table 如下所示：

0	1	2	3	4	5
rs10000911	4	144136193	100.000000	-	AC
rs10000988	4	76010255	99.173554	-	AG
rs10002181	4	142250415	100.000000	+	AG
rs10005140	4	22365603	99.173554	+	AG
rs10005242	4	5949558	100.000000	+	AG

现在我想创建一个额外的行或一个系列，其中包含第 1 列和第 2 列的组合，如下所示：4:144136193、4:76010255、4:142250415 等. 现在我正在使用 iterrows 解决方案：

new_column = pd.Series([])
for index, line in table.iterrows():
    new_column = new_column.append(pd.Series(str(line[1])+':'+str(line[2])))

因为我的 table 包含 800 000 行 iterrows 非常慢。有什么办法可以加快速度吗？

Answer 1

你可以这样做：

new_col = table.apply(lambda line: pd.Series(str(line[1])+':'+str(line[2])),axis=1)

这会给你一个新的数据框new_col:

             0
0  4:144136193
1   4:76010255
2  4:142250415
3   4:22365603
4    4:5949558

（如果只想要一个系列，而不是数据框，new_col[0] 会给你一个。）

Answer 2

df["new"] = df[["1", "2"]].apply(lambda x: ":".join(map(str, x)), axis=1)
print(df)

或者：

df["new"] = df[["1", "2"]].astype(str).apply(":".join, axis=1)
print(df)

打印：

            0  1          2           3  4   5          new
0  rs10000911  4  144136193  100.000000  -  AC  4:144136193
1  rs10000988  4   76010255   99.173554  -  AG   4:76010255
2  rs10002181  4  142250415  100.000000  +  AG  4:142250415
3  rs10005140  4   22365603   99.173554  +  AG   4:22365603
4  rs10005242  4    5949558  100.000000  +  AG    4:5949558

Answer 3

你可以这样做：

df['new_column'] = df[1].astype(str) + ":" + df[2].astype(str)

在我的情况下，是否可以用更快的东西（例如应用）替换 iterrows？

Is it possible to replace iterrows with something faster (e.g. apply) in my case?

python

bioinformatics

dataframe

pandas