Python 正在为用户名解析大型 CSV 文件

Question

我有一个非常大的 csv 文件（+50k 行）。

此文件包含 IRC 日志，数据格式如下：

第 1 列：消息类型（1 为消息，2 为系统）
第 2 列：时间戳（自精确日期起的秒数）
第三列：留言者的用户名
第 4 列：消息

这是数据示例：

1,1382445487956,"bob","i don't know how to do such a task"
1,1382025765196,"alice","bro ask Whosebug"
1,1382454875476,"_XxCoder_killerxX_","I'm pretty sure it can be done with python, bob"
2,1380631520410,"helloman","helloman_ join the chan."

例如，_XxCoder_killerxX_提到了bob。

所以，了解所有这些后，我想知道哪对用户名相互提及最多。

我希望计算 messages，所以我只需要处理以数字“1”开头的行（因为有一堆以“2”和其他不相关数字开头的行）

我知道可以使用 csv Python 模块来完成，但我从来没有使用过这么大的文件，所以我真的不知道如何开始所有这些。

Answer 1

您应该执行两次 CSV 传递：一次捕获所有发件人用户名，第二次查找邮件中提到的发件人用户名。

import csv

users = set()

with open("test.csv", "r") as file:
    reader = csv.reader(file)
    for line in reader:
        users.add(line[2])

mentions = {}

with open("test.csv", "r") as file:
    reader = csv.reader(file)
    for line in reader:
        sender, message = line[2], line[3]
        for recipient in users:
            if recipient == sender:
                continue  # can't mention yourself
            if recipient in message:
                key = (sender, recipient)
                mentions[key] = mentions.get(key, 0) + 1

for mention, times in mentions.items():
    print(f"{mention[0]} mentioned {mention[1]} {times} time(s)")


totals = {}

for mention, times in mentions.items():
    key = tuple(sorted(mention))
    totals[key] = totals.get(key, 0) + times

for names, times in totals.items():
    print(f"{names[0]} and {names[1]} mentioned each other {times} time(s)")

这个例子很幼稚，因为它执行的是简单的子字符串匹配。所以，如果有人叫“foo”并且有人在消息中提到“食物”，则表示匹配。

Answer 2

这是一个使用 pandas 和集合的解决方案。 pandas 的使用显着简化了 csv 数据的导入和操作，集合的使用允许计算 {'alice', 'bob'} 和 {'bob', 'alice'} 作为同一组合的两次出现。

df = pd.read_csv('sample.csv', header=None)
df.columns = ['id','timestamp','username','message']

lst = []
for name in df.username:
    for i,m in enumerate(df.message):
        if name in m:
            author = df.iloc[i,2]
            lst.append({author, name})
most_freq = max(lst, key=lst.count)

print(most_freq)
#{'bob', '_XxCoder_killerxX_'}

Python 正在为用户名解析大型 CSV 文件

Python parsing large CSV file for usernames

python

csv

large-files