python:从加载的每一行处理字符串 json

python: processing string from each row of loaded json

我有一个 json 推文数据,通常在开头有一个推特句柄。

import pandas as pd
data = pd.DataFrame(pd.read_json(filename, orient=columnName),columns=columnName)

我可以使用 pandas 加载和索引推文数据,但我想知道如何智能地处理每一行以删除位于推文开头的句柄(忽略所有其他时间)它被使用)

data['full_text']

示例推文:

@ABC hi there, how much for an apple
@ABC hi there, how much for an orange
@ABC hi there, how much @ABC for an pineapple
hi there @ABC, how much for an car
@ABC hi there, how much for an tree

会变成:

hi there, how much for an apple
hi there, how much for an orange
hi there, how much @ABC for an pineapple
hi there @ABC, how much for an car
hi there, how much for an tree

有 iterrows() 命令,但根据我的阅读,不建议修改它,例如更多用于打印行,例如

===================

for datum in data['full_text']:
    print(datum)
    datum=re.sub("@ABC", "",datum,1)
    print(datum)

我也有以上情况,但这不是坏习惯吗?我在控制台中看到的示例看起来不错,尽管我无法验证我是否有一百万条记录

您可以使用 replace - ^ 表示字符串的开始和 \s+ 一个或多个空格:

data = pd.read_json(filename, orient=columnName) 
data['full_text'] = data['full_text'].replace('^@ABC\s+', '', regex=True)
print (data)
                                  full_text
0           hi there, how much for an apple
1          hi there, how much for an orange
2  hi there, how much @ABC for an pineapple
3        hi there @ABC, how much for an car
4            hi there, how much for an tree
data['full_text'] = data['full_text'].str.replace(r'^(?:\@[^\s]+)\s*','')