python:从加载的每一行处理字符串 json
python: processing string from each row of loaded json
我有一个 json 推文数据,通常在开头有一个推特句柄。
import pandas as pd
data = pd.DataFrame(pd.read_json(filename, orient=columnName),columns=columnName)
我可以使用 pandas 加载和索引推文数据,但我想知道如何智能地处理每一行以删除位于推文开头的句柄(忽略所有其他时间)它被使用)
data['full_text']
示例推文:
@ABC hi there, how much for an apple
@ABC hi there, how much for an orange
@ABC hi there, how much @ABC for an pineapple
hi there @ABC, how much for an car
@ABC hi there, how much for an tree
会变成:
hi there, how much for an apple
hi there, how much for an orange
hi there, how much @ABC for an pineapple
hi there @ABC, how much for an car
hi there, how much for an tree
有 iterrows() 命令,但根据我的阅读,不建议修改它,例如更多用于打印行,例如
===================
for datum in data['full_text']:
print(datum)
datum=re.sub("@ABC", "",datum,1)
print(datum)
我也有以上情况,但这不是坏习惯吗?我在控制台中看到的示例看起来不错,尽管我无法验证我是否有一百万条记录
您可以使用 replace
- ^
表示字符串的开始和 \s+
一个或多个空格:
data = pd.read_json(filename, orient=columnName)
data['full_text'] = data['full_text'].replace('^@ABC\s+', '', regex=True)
print (data)
full_text
0 hi there, how much for an apple
1 hi there, how much for an orange
2 hi there, how much @ABC for an pineapple
3 hi there @ABC, how much for an car
4 hi there, how much for an tree
data['full_text'] = data['full_text'].str.replace(r'^(?:\@[^\s]+)\s*','')
我有一个 json 推文数据,通常在开头有一个推特句柄。
import pandas as pd
data = pd.DataFrame(pd.read_json(filename, orient=columnName),columns=columnName)
我可以使用 pandas 加载和索引推文数据,但我想知道如何智能地处理每一行以删除位于推文开头的句柄(忽略所有其他时间)它被使用)
data['full_text']
示例推文:
@ABC hi there, how much for an apple
@ABC hi there, how much for an orange
@ABC hi there, how much @ABC for an pineapple
hi there @ABC, how much for an car
@ABC hi there, how much for an tree
会变成:
hi there, how much for an apple
hi there, how much for an orange
hi there, how much @ABC for an pineapple
hi there @ABC, how much for an car
hi there, how much for an tree
有 iterrows() 命令,但根据我的阅读,不建议修改它,例如更多用于打印行,例如
===================
for datum in data['full_text']:
print(datum)
datum=re.sub("@ABC", "",datum,1)
print(datum)
我也有以上情况,但这不是坏习惯吗?我在控制台中看到的示例看起来不错,尽管我无法验证我是否有一百万条记录
您可以使用 replace
- ^
表示字符串的开始和 \s+
一个或多个空格:
data = pd.read_json(filename, orient=columnName)
data['full_text'] = data['full_text'].replace('^@ABC\s+', '', regex=True)
print (data)
full_text
0 hi there, how much for an apple
1 hi there, how much for an orange
2 hi there, how much @ABC for an pineapple
3 hi there @ABC, how much for an car
4 hi there, how much for an tree
data['full_text'] = data['full_text'].str.replace(r'^(?:\@[^\s]+)\s*','')