解析输入并构建输出 # 来自推文的关键词
Parse Input and structure the output # Keywords from tweets
我试图将 tweetText
中的所有 #keywords
与其他列一起放入单独的列中。我没有提到其他专栏,因为它们只会造成混乱。
没有#keywords
的tweetText
删掉,有的捞出来放在不同的栏目。
我有点迷失在需要从 tweetText
.
中过滤 #Keywords
的部分
输入:TweetsID、Tweets(有更多列)
714602054988275712,I'm at MK Appartaments in Dobele
714600471512670212,"Baana bicycle counter.Today: 9 Same time last week: 7 Trend: ↑28% This year: 60 811 Last year: 802 079 #Helsinki #pyöräily #cycling"
714598616703320065,"Just posted a photo @ Moscow, Russia"
714593900053180416,We're #hiring! Read about our latest #job opening here: CRM Specialist #lifeinspiringcareers #Moscow #Sales
714591942949138434,Just posted a photo @ Kfc
714591380660731904,Homeless guide on my festival of tours from locals for locals #открытаякарта. Shot by Alexandr
714591338977579009,"Who we are? #edmonton #edm #edmlife #edms #edmlifestyle #edmfamily #edmgirls #edmlov"
预期输出:tweetId、hashKey(也会有其他列)
714600471512670212,#Helsinki #pyöräily #cycling
714593900053180416,#hiring! #lifeinspiringcareers #Moscow #Sales
714591380660731904,#открытаякарта
714591338977579009,#edmonton #edm #edmlife #edms #edmlifestyle #edmfamily #edmgirls #edmlov"
代码:
import pandas as pd
df1 = pd.read_csv('Turkey_28.csv')
key_word = df1[['tweetID', 'tweetText']].set_index('tweetID')['tweetText']
key_word = key_word.dropna().apply(lambda x: eval(x))
key_word = key_word[key_word.apply(type) == dict]
#I am lost in this section on how to select the hash keywords?
def get_key_words(x):
return pd.Series(x['tweetText'],
key_word = key_word.apply(get_key_word)
df2 = pd.concat([coords, df1.set_index('tweetID').reindex(coords.index)], axis=1)
df2.to_csv('Turkey_key_word.csv', index=True)
感谢您的建议。
编辑一:
在选择的答案中解析输入时,我遇到了一些语法错误
代码:
import re
import pandas as pd
df = pd.readcsv('Turkey_Text.csv')
tweet_column = ['tweetText']
for idx in range(len(tweet_column)):
tweet = tweet_column[idx]
hashtag_list = re.findall(r('#\w+)', tweet)
tweet_column[idx] = " ".join(hashtag_list)
print tweet_column[idx]
错误:
File "keyword_split.py", line 9
tweet_column[idx] = " ".join(hashtag_list)
^
SyntaxError: invalid syntax
预期输出
714600471512670212,#Helsinki
714600471512670212,#pyöräily
714600471512670212,#cycling
714593900053180416,#hiring!
714593900053180416,#lifeinspiringcareers
714593900053180416,#Moscow
714593900053180416,#Sales
714591380660731904,#открытаякарта
714591338977579009,#edmonton
714591338977579009,#edm
714591338977579009,#edmlife
714591338977579009,#edms
714591338977579009,#edmlifestyle
714591338977579009,#edmfamily
714591338977579009,#edmgirls
714591338977579009,#edmlov"
使用python and regular expressions。它会让你的生活更轻松。
正则表达式 r'#(\w+)'
在这种情况下效果很好。
我不完全理解你的代码流程,因为我没有太多用熊猫搜索 CSV 的经验,但如果你要隔离推文和 return keywords/hashtags 根据我对常规 python 逻辑的理解,该专栏可能看起来像这样...
import re
for idx in range(len(tweet_column)):
tweet = tweet_column[idx]
hashtag_list = re.findall(r('#\w+)', tweet)
tweet_column[idx] = " ".join(hashtag_list)
Here's another example
我试图将 tweetText
中的所有 #keywords
与其他列一起放入单独的列中。我没有提到其他专栏,因为它们只会造成混乱。
没有#keywords
的tweetText
删掉,有的捞出来放在不同的栏目。
我有点迷失在需要从 tweetText
.
#Keywords
的部分
输入:TweetsID、Tweets(有更多列)
714602054988275712,I'm at MK Appartaments in Dobele
714600471512670212,"Baana bicycle counter.Today: 9 Same time last week: 7 Trend: ↑28% This year: 60 811 Last year: 802 079 #Helsinki #pyöräily #cycling"
714598616703320065,"Just posted a photo @ Moscow, Russia"
714593900053180416,We're #hiring! Read about our latest #job opening here: CRM Specialist #lifeinspiringcareers #Moscow #Sales
714591942949138434,Just posted a photo @ Kfc
714591380660731904,Homeless guide on my festival of tours from locals for locals #открытаякарта. Shot by Alexandr
714591338977579009,"Who we are? #edmonton #edm #edmlife #edms #edmlifestyle #edmfamily #edmgirls #edmlov"
预期输出:tweetId、hashKey(也会有其他列)
714600471512670212,#Helsinki #pyöräily #cycling
714593900053180416,#hiring! #lifeinspiringcareers #Moscow #Sales
714591380660731904,#открытаякарта
714591338977579009,#edmonton #edm #edmlife #edms #edmlifestyle #edmfamily #edmgirls #edmlov"
代码:
import pandas as pd
df1 = pd.read_csv('Turkey_28.csv')
key_word = df1[['tweetID', 'tweetText']].set_index('tweetID')['tweetText']
key_word = key_word.dropna().apply(lambda x: eval(x))
key_word = key_word[key_word.apply(type) == dict]
#I am lost in this section on how to select the hash keywords?
def get_key_words(x):
return pd.Series(x['tweetText'],
key_word = key_word.apply(get_key_word)
df2 = pd.concat([coords, df1.set_index('tweetID').reindex(coords.index)], axis=1)
df2.to_csv('Turkey_key_word.csv', index=True)
感谢您的建议。
编辑一:
在选择的答案中解析输入时,我遇到了一些语法错误
代码:
import re
import pandas as pd
df = pd.readcsv('Turkey_Text.csv')
tweet_column = ['tweetText']
for idx in range(len(tweet_column)):
tweet = tweet_column[idx]
hashtag_list = re.findall(r('#\w+)', tweet)
tweet_column[idx] = " ".join(hashtag_list)
print tweet_column[idx]
错误:
File "keyword_split.py", line 9
tweet_column[idx] = " ".join(hashtag_list)
^
SyntaxError: invalid syntax
预期输出
714600471512670212,#Helsinki
714600471512670212,#pyöräily
714600471512670212,#cycling
714593900053180416,#hiring!
714593900053180416,#lifeinspiringcareers
714593900053180416,#Moscow
714593900053180416,#Sales
714591380660731904,#открытаякарта
714591338977579009,#edmonton
714591338977579009,#edm
714591338977579009,#edmlife
714591338977579009,#edms
714591338977579009,#edmlifestyle
714591338977579009,#edmfamily
714591338977579009,#edmgirls
714591338977579009,#edmlov"
使用python and regular expressions。它会让你的生活更轻松。
正则表达式 r'#(\w+)'
在这种情况下效果很好。
我不完全理解你的代码流程,因为我没有太多用熊猫搜索 CSV 的经验,但如果你要隔离推文和 return keywords/hashtags 根据我对常规 python 逻辑的理解,该专栏可能看起来像这样...
import re
for idx in range(len(tweet_column)):
tweet = tweet_column[idx]
hashtag_list = re.findall(r('#\w+)', tweet)
tweet_column[idx] = " ".join(hashtag_list)
Here's another example