为什么我的情绪分析 运行 这么慢?
Why is my sentiment analysis running so slow?
我正在尝试制作一个 GUI 应用程序,您可以在其中输入两个不同事物的 Twitter 主题标签,然后使用情感分析对它们进行比较(我现在以电影为例)。我的代码尚未完成,因为到目前为止我只显示了一个主题标签。最终结果应该是一个显示推文极性的图表(到目前为止,它只显示一部电影的极性)。虽然 运行 宁我的代码工作并会弹出一个图表,但它花费了大部分时间。有时它会像我预期的那样快速加载,但其他任何时候都需要很长时间,我会不耐烦并重新 运行 程序。代码 arranged/the 模块的使用方式是否导致了这种情况?或者情绪分析通常很慢?这是我的第一个情绪分析项目,所以我不太确定。这是我的代码,我已经取出了推特密钥和令牌,因为我不确定我是否可以将它们留在那里:
import tweepy as tw
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''
# authenticate twitter
auth = tw.OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_token,access_token_secret)
api = tw.API(auth,wait_on_rate_limit= True)
# GET TWEETS HERE
hashtag = ("#GreenKnight",)
query = tw.Cursor(api.search, q = hashtag).items(1000)
tweets = [{'Tweets':tweet.text, 'Timestamp':tweet.created_at}for tweet in query]
# put tweets in pandas dataframe
df = pd.DataFrame.from_dict(tweets)
df.head()
# green knight movie references
green_knight_references = ["GreenKnight", "Green Knight", "green knight", "greenknight", "'The Green Knight'"]
def identify_subject(tweet,refs):
flag = 0
for ref in refs:
if tweet.find(ref) != - 1:
flag = 1
return flag
df['Green Knight'] = df['Tweets'].apply(lambda x: identify_subject(x, green_knight_references))
df.head(10)
# time for stop words, to clear out the language not needed
import nltk
from nltk.corpus import stopwords
from textblob import Word, TextBlob
stop_words = stopwords.words("english")
custom_stopwords = ['RT']
def preprocess_tweets(tweet,custom_stopwords):
preprocessed_tweet = tweet
preprocessed_tweet.replace('{^\w\s}',"")
preprocessed_tweet = " ".join(word for word in preprocessed_tweet.split() if word not in stop_words)
preprocessed_tweet = " ".join(word for word in preprocessed_tweet.split() if word not in custom_stopwords)
preprocessed_tweet = " ".join(Word(word).lemmatize() for word in preprocessed_tweet.split())
return (preprocessed_tweet)
df['Processed Tweet'] = df['Tweets'].apply(lambda x: preprocess_tweets(x, custom_stopwords))
df.head()
#visualize
df['polarity'] = df['Processed Tweet'].apply(lambda x: TextBlob(x).sentiment[0])
df['subjectivity'] = df['Processed Tweet'].apply(lambda x: TextBlob(x).sentiment[1])
df.head()
(df[df['Green Knight']==1][['Green Knight','polarity','subjectivity']].groupby('Green Knight').agg([np.mean, np.max, np.min, np.median]))
green_knight = df[df['Green Knight']==1][['Timestamp', 'polarity']]
green_knight = green_knight.sort_values(by='Timestamp', ascending=True)
green_knight['MA Polarity'] = green_knight.polarity.rolling(10, min_periods=3).mean()
green_knight.head()
fig, axes = plt.subplots(2, 1, figsize=(13, 10))
axes[0].plot(green_knight['Timestamp'], green_knight['MA Polarity'])
axes[0].set_title("\n".join(["Green Knight Tweets"]))
fig.suptitle("\n".join(["Movie tweet polarity"]), y=0.98)
plt.show()
我以前用 tweepy
工作过,最慢的是 Twitter 的 API。它很快就会筋疲力尽,而且不付钱给他们,这会令人沮丧:(。
使用 TextBlob
的情绪分析应该不会很慢。
但是,您最好的选择是使用 cProfile
选项,如评论中提到的@osint_alex,或者对于一个简单的解决方案,只需在代码的主要 'blocks' 之间放置一些打印语句。
我正在尝试制作一个 GUI 应用程序,您可以在其中输入两个不同事物的 Twitter 主题标签,然后使用情感分析对它们进行比较(我现在以电影为例)。我的代码尚未完成,因为到目前为止我只显示了一个主题标签。最终结果应该是一个显示推文极性的图表(到目前为止,它只显示一部电影的极性)。虽然 运行 宁我的代码工作并会弹出一个图表,但它花费了大部分时间。有时它会像我预期的那样快速加载,但其他任何时候都需要很长时间,我会不耐烦并重新 运行 程序。代码 arranged/the 模块的使用方式是否导致了这种情况?或者情绪分析通常很慢?这是我的第一个情绪分析项目,所以我不太确定。这是我的代码,我已经取出了推特密钥和令牌,因为我不确定我是否可以将它们留在那里:
import tweepy as tw
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''
# authenticate twitter
auth = tw.OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_token,access_token_secret)
api = tw.API(auth,wait_on_rate_limit= True)
# GET TWEETS HERE
hashtag = ("#GreenKnight",)
query = tw.Cursor(api.search, q = hashtag).items(1000)
tweets = [{'Tweets':tweet.text, 'Timestamp':tweet.created_at}for tweet in query]
# put tweets in pandas dataframe
df = pd.DataFrame.from_dict(tweets)
df.head()
# green knight movie references
green_knight_references = ["GreenKnight", "Green Knight", "green knight", "greenknight", "'The Green Knight'"]
def identify_subject(tweet,refs):
flag = 0
for ref in refs:
if tweet.find(ref) != - 1:
flag = 1
return flag
df['Green Knight'] = df['Tweets'].apply(lambda x: identify_subject(x, green_knight_references))
df.head(10)
# time for stop words, to clear out the language not needed
import nltk
from nltk.corpus import stopwords
from textblob import Word, TextBlob
stop_words = stopwords.words("english")
custom_stopwords = ['RT']
def preprocess_tweets(tweet,custom_stopwords):
preprocessed_tweet = tweet
preprocessed_tweet.replace('{^\w\s}',"")
preprocessed_tweet = " ".join(word for word in preprocessed_tweet.split() if word not in stop_words)
preprocessed_tweet = " ".join(word for word in preprocessed_tweet.split() if word not in custom_stopwords)
preprocessed_tweet = " ".join(Word(word).lemmatize() for word in preprocessed_tweet.split())
return (preprocessed_tweet)
df['Processed Tweet'] = df['Tweets'].apply(lambda x: preprocess_tweets(x, custom_stopwords))
df.head()
#visualize
df['polarity'] = df['Processed Tweet'].apply(lambda x: TextBlob(x).sentiment[0])
df['subjectivity'] = df['Processed Tweet'].apply(lambda x: TextBlob(x).sentiment[1])
df.head()
(df[df['Green Knight']==1][['Green Knight','polarity','subjectivity']].groupby('Green Knight').agg([np.mean, np.max, np.min, np.median]))
green_knight = df[df['Green Knight']==1][['Timestamp', 'polarity']]
green_knight = green_knight.sort_values(by='Timestamp', ascending=True)
green_knight['MA Polarity'] = green_knight.polarity.rolling(10, min_periods=3).mean()
green_knight.head()
fig, axes = plt.subplots(2, 1, figsize=(13, 10))
axes[0].plot(green_knight['Timestamp'], green_knight['MA Polarity'])
axes[0].set_title("\n".join(["Green Knight Tweets"]))
fig.suptitle("\n".join(["Movie tweet polarity"]), y=0.98)
plt.show()
我以前用 tweepy
工作过,最慢的是 Twitter 的 API。它很快就会筋疲力尽,而且不付钱给他们,这会令人沮丧:(。
使用 TextBlob
的情绪分析应该不会很慢。
但是,您最好的选择是使用 cProfile
选项,如评论中提到的@osint_alex,或者对于一个简单的解决方案,只需在代码的主要 'blocks' 之间放置一些打印语句。