使用嵌套 for 和 if 循环时加速 python

Question

我有一个 csv 文件，其中有一列名为“作者”。在该列中，每一行都有几个作者，用逗号分隔。在函数下方的代码中，getAuthorNames 获取该列中的所有作者姓名和 returns 包含所有作者姓名的数组。

然后函数 authCount 计算个人姓名在“作者”列中出现的次数。起初，我用几百行来做，没有任何问题。现在我正在尝试用 20,000 行以上来完成它，它已经花了几个小时，但仍然没有结果。我相信是嵌套的 for 循环和 if 语句导致它花费这么长时间。有关如何加快该过程的任何建议都会有所帮助。我应该使用 lambda 吗？是否有内置的 pandas 函数可以提供帮助？

这是输入数据的样子：

Title,Authors,ID
XXX,"Wang J, Wang H",XXX
XXX,"Wang J,Han H",XXX

这就是输出的样子

Author,Count
Wang J,2
Wang H,1
Han H,1

代码如下：

    import pandas as pd


    df = pd.read_csv (r'C:\Users\amos.epelman\Desktop\Pubmedpull3GC.csv')


    def getAuthorNames(dataFrame):
        arrayOfAuthors = []
        numRows = dataFrame.shape[0]

        cleanDF = dataFrame.fillna("0")

        for i in range (0,numRows):
            miniArray = cleanDF.at[i,"Authors"].split(",")
            arrayOfAuthors += miniArray
    
        return arrayOfAuthors


    def authCount(dataFrame):
        authArray = getAuthorNames(dataFrame)
        numAuthors = len(authArray)
        countOfAuth = [0] * numAuthors

        newDF = pd.DataFrame({"Author Name": authArray, "Count": countOfAuth})
        refDF = dataFrame.fillna("0")


        numRows= refDF.shape[0]


        for i in range (0,numAuthors):
            for j in range (0,numRows):
                if newDF.at[i, "Author Name"] in refDF.at[j,"Authors"]:
                    newDF.at[i,"Count"] += 1
            
        sortedDF = newDF.sort_values(["Count"], ascending = False)

        noDupsDF = sortedDF.drop_duplicates(subset ="Author Name", keep = False)

        return noDupsDF




    finalDF = authCount(df)
    file_name = 'GC Pubmed Pull3 Author Names with Count.xlsx'
    finalDF.to_excel(file_name)

Answer 1

您可以尝试使用 Counter 和 lambda 函数来消除两个数据帧上的嵌套 for 循环，这似乎是添加新列的缓慢方法

from collections import Counter

然后获取“计数”列

author_counts = Counter(list(refDF["Authors"]))

newDF["Count"] = newDF.apply(lambda r: author_counts[r["Author Name"]], axis=1)

Answer 2

# take series of authors and split at comma and expand into dataframe
authors = df['author'].str.split(pat=',', expand=True)
authors.melt().value_counts()

我不确定它是否更快，但这应该会给你独特的价值和计数。

输入：

x y z author book
0 0 0 aa,bb,cc l
0 0 0 a,b,c l
0 0 0 aa,bb,c l
0 0 0 aa,b,c l

输出：

variable  value
0         aa       3
2         c        3
1         b        2
          bb       2
0         a        1
2         cc       1
dtype: int64

更新：
此解决方案在不保存到文件的情况下对输出进行排序，%%timeit% 给出：
7.03 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@ZachYoung 解决方案不排序且不保存输出 %%timeit 给出：
5.64 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

我运行这是一个包含 8000 个名称的测试文件

Answer 3

您可以使用 Python 标准库中的 csv reader and collections Counter 类来做到这一点。

我制作了一个 CSV 样本，其中包含 20K 行的运行domly 生成的名称，如您所述，random_names.csv:

Authors
"Darnel D, Blythe B"
"Wang H, Darnel D, Alice A"
"Wang J, Wang H, Darnel D, Blythe B"
"Han H, Wang J"
"Clarice C, Wang H, Darnel D, Alice A"
"Clarice C, Han H, Blythe B, Wang J"
"Clarice C, Darnel D, Blythe B"
"Clarice C, Wang H, Blythe B"
"Blythe B, Wang J, Darnel D"
...

我的代码没有排序，但指出了插入排序的位置。这运行不到一秒钟（在我的 M1 Macbook Air 上）：

import csv
from collections import Counter

author_counts = Counter()

with open('random_names.csv', newline='') as f:
    reader = csv.reader(f)
    next(reader)  # discard header

    for row in reader:
        authors = row[0]  # !! adjust for your data
        for author in authors.split(','):
            author_counts.update([author.strip()])

# Sort here
print(author_counts.items())

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Author','Count'])
    writer.writerows(author_counts.items())

它打印出这条调试行：

dict_items([('Darnel D', 10690), ('Blythe B', 10645), ('Wang H', 10881), ('Alice A', 10750), ('Wang J', 10613), ('Han H', 10814), ('Clarice C', 10724)])

并将其保存为 output.csv:

Author,Count
Darnel D,10690
Blythe B,10645
Wang H,10881
Alice A,10750
Wang J,10613
Han H,10814
Clarice C,10724

使用嵌套 for 和 if 循环时加速 python

Speeding up python when using nested for and if loops

python

csv

nested

pandas