如何从 pandas 数据框中的推文中提取主题标签?
How to extract hashtags from Tweets in pandas data frame?
我有一个包含多个变量(列)的推文数据集,我想从推文(文本)中提取所有主题标签并将结果放入一个新列(主题标签)中。以下是我正在尝试的:
import pandas as pd
data = pd.read_csv("Sample.csv", lineterminator='\n')
def hashtags(string):
Hash = data.text.str.findall(r'#.*?(?=\s|$)')
return Hash
data['hashtags'] = data['text'].apply(lambda x: hashtags(x))
然而,当我 运行 函数主题标签时,我的笔记本卡住了(没有完成执行或给出任何错误)。我的文件只有大约 10k 行。
此外,如果此代码 运行 成功,我希望得到这样的结果:
[#asd, #fer, #gtr]
但我希望结果列应该只有标签的名称,如 [asd、fer、gtr]。请建议我应该对代码进行哪些更改。
我试图在之前提出的问题中寻找解决方案,但大多数问题都使用正则表达式,我正在寻找使用 pandas 的解决方案。
提前致谢。
我从这里 https://twitter-sentiment-csv.herokuapp.com/ 下载了一些 .csv 格式的 Twitter 示例数据。在这个例子中,我使用了前 10 行的一部分。
def find_tags(row_string):
# use a list comprehension to find list items that start with #
tags = [x for x in row_string if x.startswith('#')]
return tags
df = pd.DataFrame({'sentiment': {0: 'neutral',
1: 'neutral',
2: 'neutral',
3: 'neutral',
4: 'neutral',
5: 'neutral',
6: 'neutral',
7: 'positive',
8: 'neutral',
9: 'neutral'},
'text': {0: 'RT @fakeTakeDump: TRAMS STELARA BICYCLE PINOCHLE JUMBO INDEX SEPTAVALENT TYPEWRITER HOMEBREWING AND ANTI-LOCK HULLO KITTY IN FORTUNE COOKIE…',
1: 'RT @fauzanzain: Hi warga twitter, sekarang aku lagi cari career coach nih yang punya latar belakang tech recruiter / mid to senior digital…',
2: 'RT @fakeTakeDump: WOODWORKING THE FORUM SHOPS LIKENESS SPECTROHELIOSCOPE CHEEMS FLAVONOIDS ROCKET IS NEITHER SUGAR DADDY CANNED TUNA HANDMA…',
3: 'WOODWORKING THE FORUM SHOPS LIKENESS SPECTROHELIOSCOPE CHEEMS FLAVONOIDS ROCKET IS NEITHER SUGAR DADDY CANNED TUNA…',
4: 'RT @KirkDBorne: Recap of 60 days of #DataScience and #MachineLearning — days 1 through 60: by @NainaChaturved8 \n———…',
5: 'Recap of 60 days of #DataScience and #MachineLearning — days 1 through 60: by… ',
6: 'RT @IBAConservative: @dax_christensen The truth is out! They can’t hold it back. \n#CrimesAgainstHumanity \n#TrudeauTyranny \n#TrudeauMustResi…',
7: "RT @drmwarner: As per these children's health organizations, keeping masks on in schools 2wks post March break would have made much more se…",
8: 'RT @cryptotommy88: TL;DR\n✅ Collective analytics business \n✅ Draw power from data science & crowd-sourced knowledge\n✅ 1st product PFPscore:…',
9: 'RT @cryptotommy88: TL;DR\n✅ Collective analytics business \n✅ Draw power from data science & crowd-sourced knowledge\n✅ 1st product PFPscore:…'},
'user': {0: 'BotDuran',
1: 'ezash',
2: 'BlkHwk0ps',
3: 'fakeTakeDump',
4: 'RobotProud',
5: 'KirkDBorne',
6: 'cloudcnworld',
7: 'NeuroTeck',
8: 'BIGwinCutiejoy8',
9: 'luckbigw1n'}})
df['split'] = df['text'].str.split(' ')
df['tags'] = df['split'].apply(lambda row : find_tags(row))
# replace # as requested in OP, replace for new lines and \ as needed.
df['tags'] = df['tags'].apply(lambda x : str(x).replace('#', '').replace('\n', ',').replace('\', '').replace("'", ""))
输出df['tags']
:
0 []
1 []
2 []
3 []
4 [DataScience, MachineLearning]
5 [DataScience, MachineLearning]
6 []
7 []
8 []
9 []
Name: tags, dtype: object
我有一个包含多个变量(列)的推文数据集,我想从推文(文本)中提取所有主题标签并将结果放入一个新列(主题标签)中。以下是我正在尝试的:
import pandas as pd
data = pd.read_csv("Sample.csv", lineterminator='\n')
def hashtags(string):
Hash = data.text.str.findall(r'#.*?(?=\s|$)')
return Hash
data['hashtags'] = data['text'].apply(lambda x: hashtags(x))
然而,当我 运行 函数主题标签时,我的笔记本卡住了(没有完成执行或给出任何错误)。我的文件只有大约 10k 行。
此外,如果此代码 运行 成功,我希望得到这样的结果:
[#asd, #fer, #gtr]
但我希望结果列应该只有标签的名称,如 [asd、fer、gtr]。请建议我应该对代码进行哪些更改。
我试图在之前提出的问题中寻找解决方案,但大多数问题都使用正则表达式,我正在寻找使用 pandas 的解决方案。
提前致谢。
我从这里 https://twitter-sentiment-csv.herokuapp.com/ 下载了一些 .csv 格式的 Twitter 示例数据。在这个例子中,我使用了前 10 行的一部分。
def find_tags(row_string):
# use a list comprehension to find list items that start with #
tags = [x for x in row_string if x.startswith('#')]
return tags
df = pd.DataFrame({'sentiment': {0: 'neutral',
1: 'neutral',
2: 'neutral',
3: 'neutral',
4: 'neutral',
5: 'neutral',
6: 'neutral',
7: 'positive',
8: 'neutral',
9: 'neutral'},
'text': {0: 'RT @fakeTakeDump: TRAMS STELARA BICYCLE PINOCHLE JUMBO INDEX SEPTAVALENT TYPEWRITER HOMEBREWING AND ANTI-LOCK HULLO KITTY IN FORTUNE COOKIE…',
1: 'RT @fauzanzain: Hi warga twitter, sekarang aku lagi cari career coach nih yang punya latar belakang tech recruiter / mid to senior digital…',
2: 'RT @fakeTakeDump: WOODWORKING THE FORUM SHOPS LIKENESS SPECTROHELIOSCOPE CHEEMS FLAVONOIDS ROCKET IS NEITHER SUGAR DADDY CANNED TUNA HANDMA…',
3: 'WOODWORKING THE FORUM SHOPS LIKENESS SPECTROHELIOSCOPE CHEEMS FLAVONOIDS ROCKET IS NEITHER SUGAR DADDY CANNED TUNA…',
4: 'RT @KirkDBorne: Recap of 60 days of #DataScience and #MachineLearning — days 1 through 60: by @NainaChaturved8 \n———…',
5: 'Recap of 60 days of #DataScience and #MachineLearning — days 1 through 60: by… ',
6: 'RT @IBAConservative: @dax_christensen The truth is out! They can’t hold it back. \n#CrimesAgainstHumanity \n#TrudeauTyranny \n#TrudeauMustResi…',
7: "RT @drmwarner: As per these children's health organizations, keeping masks on in schools 2wks post March break would have made much more se…",
8: 'RT @cryptotommy88: TL;DR\n✅ Collective analytics business \n✅ Draw power from data science & crowd-sourced knowledge\n✅ 1st product PFPscore:…',
9: 'RT @cryptotommy88: TL;DR\n✅ Collective analytics business \n✅ Draw power from data science & crowd-sourced knowledge\n✅ 1st product PFPscore:…'},
'user': {0: 'BotDuran',
1: 'ezash',
2: 'BlkHwk0ps',
3: 'fakeTakeDump',
4: 'RobotProud',
5: 'KirkDBorne',
6: 'cloudcnworld',
7: 'NeuroTeck',
8: 'BIGwinCutiejoy8',
9: 'luckbigw1n'}})
df['split'] = df['text'].str.split(' ')
df['tags'] = df['split'].apply(lambda row : find_tags(row))
# replace # as requested in OP, replace for new lines and \ as needed.
df['tags'] = df['tags'].apply(lambda x : str(x).replace('#', '').replace('\n', ',').replace('\', '').replace("'", ""))
输出df['tags']
:
0 []
1 []
2 []
3 []
4 [DataScience, MachineLearning]
5 [DataScience, MachineLearning]
6 []
7 []
8 []
9 []
Name: tags, dtype: object