如何计算 Python 中 Stack Overflow 数据转储的 CSV 文件中特定标签的频率

Question

我最近从 GitHub 上的 Stack Exchange Data Dump. Upon extracting the .7z file, I was left with a Posts.xml file, which I converted to a Posts.csv file using the "stackexchange-xml-converter" 工具下载了 whosebug.com-Posts.7z 文件。Posts.csv 文件包含已下载的所有 posts posted 在整个 Stack Overflow 网站上。Posts.csv 文件的总大小约为 67 GB，因此在 Microsoft Excel、Visual Studio 代码中打开它太大了, 记事本等

该 CSV 文件中的每个行（除了第一行，即 header 行）对应于仅与 [=58= 关联的所有数据]一个，特别是post。例如，这里只是与每个 post 关联的一些数据类别：Title、Tags、ContentLicense、ViewCount、CommentCount、 CreationDate，等等。每个数据类别在 CSV 文件中都是它自己的列。这是它的外观图片：

我的问题是，我正在尝试计算 Posts.csv 文件中 感兴趣的特定标签 的频率，给定一个列表，在 Python.例如，假设我在 Python 中有以下列表：

tagsOfInterest = ['version-control', 'git', 'git-merge', 'bash', 'microservices']

仅在CSV文件的Tags列中，我想统计标签version-control出现了多少次，出现了多少次标签 git 出现，标签 git-merge 出现了多少次等等...

我一直在努力做到这一点，因为您会注意到 Tags 列中的每一行都被格式化为一个连续的字符串，每个不同的标记词仅由 <> 分隔.例如，在第一行中，post 被标记为 <version-control><projects-and-solutions><monorepo>。

我最初的尝试是先读取 Posts.csv 文件，然后将 Tags 列中的每一行添加到列表中，例如：

from pandas import *
import csv

# Read data
data = read_csv("Posts.csv")

# Add each row in the "Tags" column to a list:
tags_col = data['Tags'].tolist()

然后我的想法是将每个标记词标记化。但是，Posts.csv 文件太大，以至于我的计算机刚刚创建列表就耗尽了内存！

因此，我的问题是： 给定一个感兴趣的标签列表，例如 tagsOfInterest = ['version-control', 'git', 'git-merge', 'bash', 'microservices']，我如何计算每个元素的频率该列表来自 Posts.CSV 文件的 Tags 列？

Answer 1

import csv
from collections import Counter

counts = Counter()
for row in csv.reader(open('Posts.csv')):
    for tag in row[1].lstrip('<').rstrip('>').split('><'):
        counts[tag] += 1
print(counts)

您可以根据需要使用 DictReader，使用 row['Tags'] 而不是 row[1]。

如何计算 Python 中 Stack Overflow 数据转储的 CSV 文件中特定标签的频率

How to count the frequency of specific Tags from the Stack Overflow data dump's CSV file in Python

python

csv

list

frequency

stackexchange-api