将推文另存为 .csv，包含字符串文字和实体

Question

我将推文保存在 JSON 文本文件中。我有一个朋友想要包含关键字的推文，并且推文需要保存在 .csv 中。找到推文很容易，但我运行遇到了两个问题，并且正在努力寻找一个好的解决方案。

样本数据为here。我已经包含了无法正常工作的 .csv 文件以及每行都是 JSON 格式的推文的文件。

要进入数据框，我使用 pd.io.json.json_normalize。它工作顺利并且可以很好地处理嵌套字典，但是 pd.to_csv 不起作用，因为据我所知，它不能很好地处理字符串文字。一些推文在 text 字段中包含 '\n'，并且 pandas 会在发生这种情况时写入新行。

没问题，我处理 pd['text'] 以删除 '\n'。生成的文件仍然有太多行，1863 与它应该的 1388 相比。然后我修改了我的代码以替换所有字符串文字：

tweets['text'] = [item.replace('\n', '') for item in tweets['text']]
tweets['text'] = [item.replace('\r', '') for item in tweets['text']]
tweets['text'] = [item.replace('\', '') for item in tweets['text']]
tweets['text'] = [item.replace('\'', '') for item in tweets['text']]
tweets['text'] = [item.replace('\"', '') for item in tweets['text']]
tweets['text'] = [item.replace('\a', '') for item in tweets['text']]
tweets['text'] = [item.replace('\b', '') for item in tweets['text']]
tweets['text'] = [item.replace('\f', '') for item in tweets['text']]
tweets['text'] = [item.replace('\t', '') for item in tweets['text']]
tweets['text'] = [item.replace('\v', '') for item in tweets['text']]

结果相同，pd.to_csv 保存的文件行数比实际推文多。我可以替换所有列中的字符串文字，但这很笨重。

很好，不要使用 pandas。 with open(outpath, 'w') as f: 等创建一个具有正确行数的 .csv 文件。但是，使用 pd.read_csv 或逐行读取文件都将失败。

由于 Twitter 的处理方式，它失败了 entities。如果推文的文本包含 url、提及、主题标签、媒体或 link，则 Twitter returns 包含逗号的字典。当 pandas 扁平化推文时，逗号会保留在一列中，这很好。但是当读入数据的时候，pandas把应该是一列的东西拆分成了多列。例如，一列可能看起来像 [{'screen_name': 'ProfOsinbajo','name': 'Prof Yemi Osinbajo','id': 2914442873,'id_str': '2914442873', 'indices': [0,' 13]}]'，因此用逗号分隔会创建太多列：

 [{'screen_name': 'ProfOsinbajo',
 'name': 'Prof Yemi Osinbajo',
 'id': 2914442873",
 'id_str': '2914442873'",
 'indices': [0,
 13]}]

我用with open(outpath) as f:也是这个结果。使用这种方法，我必须拆分行，所以我用逗号拆分。同样的问题 - 如果它们出现在列表中，我不想用逗号分隔。

我希望这些数据在保存到文件或从文件中读取时被视为一列。我错过了什么？ 就 the repository above 处的数据而言，我想将 forWhosebug2.txt 转换为包含与推文一样多行的 .csv。将此文件命名为 A.csv，假设它有 100 列。打开时，A.csv 也应该有 100 列。

我确定我遗漏了一些细节，所以请告诉我。

Answer 1

使用 csv 模块有效。它在计算行数时将文件写为 .csv，然后读回并再次计算行数。

结果匹配，打开Excel中的.csv也得到191列1338行数据

import json
import csv

with open('forWhosebug2.txt') as f,\
     open('out.csv','w',encoding='utf-8-sig',newline='') as out:
    data = json.loads(next(f))
    print('columns',len(data))
    writer = csv.DictWriter(out,fieldnames=sorted(data))
    writer.writeheader() # write header
    writer.writerow(data) # write the first line of data
    for i,line in enumerate(f,2): # start line count at two
        data = json.loads(line)
        writer.writerow(data)
    print('lines',i)

with open('out.csv',encoding='utf-8-sig',newline='') as f:
    r = csv.DictReader(f)
    lines = list(r)
    print('readback columns',len(lines[0]))
    print('readback lines',len(lines))

输出：

columns 191
lines 1338
readback lines 1338
readback columns 191

Answer 2

@Mark Tolonen 的回答很有帮助，但我最终选择了另一条路线。将推文保存到文件时，我删除了 JSON 中任何位置的所有 \r、\n、\t 和 [=13=] 字符。然后，我将文件保存为制表符分隔，以便 location 或 text 等字段中的逗号不会混淆 read 函数。

将推文另存为 .csv，包含字符串文字和实体

Save Tweets as .csv, Contains String Literals and Entities

python-3.x

csv

text

twitter

string-literals