使用 R 分析 Twitter 数据

Question

我正在尝试使用 R 分析 Twitter 数据，方法是绘制一段时间内的推文数量，当我写

plot(tweet_df$created_at, tweet_df$text)

我收到此错误消息：

Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
2: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
5: In min(x) : no non-missing arguments to min; returning Inf
6: In max(x) : no non-missing arguments to max; returning -Inf

这是我使用的代码：

library("rjson")
json_file <- "tweet.json"
json_data <- fromJSON(file=json_file)
library("streamR")
tweet_df <- parseTweets(tweets=file)
#using the twitter data frame
tweet_df$created_at
tweet_df$text
plot(tweet_df$created_at, tweet_df$text)

Answer 1

您遇到了一些问题，但没有什么是无法克服的。如果您想随着时间的推移跟踪推文，您实际上是在请求每个 x 时间范围内创建的推文（每分钟、每秒等的推文）。所以这意味着您只需要 created_at 列，并且您可以使用 R 的 hist 函数构建图形。

如果你想按文本中提到的单词或其他任何内容进行拆分，那也是可行的，但你可能应该使用 ggplot2 来进行拆分，并且可能会问一个不同的问题。无论如何，它看起来像 parseTweets 将推特时间戳转换为字符字段，因此您需要将其转换为 R 可以理解的 POSIXct 时间戳字段。假设您有一个看起来像这样的数据框：

❥ head(tweet_df[,c("id_str","created_at")])
              id_str                     created_at
1 597862782101561346 Mon May 11 20:36:09 +0000 2015
2 597862782097346560 Mon May 11 20:36:09 +0000 2015
3 597862782105694208 Mon May 11 20:36:09 +0000 2015
4 597862782105694210 Mon May 11 20:36:09 +0000 2015
5 597862782076198912 Mon May 11 20:36:09 +0000 2015
6 597862782114078720 Mon May 11 20:36:09 +0000 2015

你可以这样做：

❥ dated_tweets <- as.POSIXct(tweet_df$created_at, format = "%a %b %d %H:%M:%S +0000 %Y")

这将为您提供 R 的时间戳格式的日期推文向量。然后你可以像这样绘制它们。我将示例 Twitter 提要打开 15 分钟左右。这是结果：

❥ hist(dated_tweets, breaks ="secs", freq = TRUE)

使用 R 分析 Twitter 数据

Analyzing Twitter data using R

twitter

r

rjson