使用 Stringr 简化清理推文文本

Streamlining cleaning Tweet text with Stringr

我正在学习文本挖掘和 rTweet,目前我正在集思广益,想出清理从推文中获得的文本的最简单方法。我一直在使用这个link上推荐的方法 要删除 URL,请删除英文字母或 space 以外的任何内容,删除停用词,删除多余的白色space,删除数字,删除标点符号。

此方法同时使用 gsub 和 tm_map(),我想知道是否可以使用 stringr 简化清洁过程,将它们简单地添加到清洁管道中。 推荐了以下功能,但出于某种原因我无法 运行 它。

clean_tweets <- function(x) {
        x %>%
        str_remove_all(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%
        str_replace_all("&amp;", "and") %>%
        str_remove_all("[[:punct:]]") %>%
        str_remove_all("^RT:? ") %>%
        str_remove_all("@[[:alnum:]]+") %>%
        str_remove_all("#[[:alnum:]]+") %>%
        str_replace_all("\\n", " ") %>%
        str_to_lower() %>%
        str_trim("both")
    }
    

清洁解决方案:

tweetsClean <- df %>% 
  mutate(clean = clean_tweets(text))

最后,是否可以保留表情符号以计算使用表情符号的频率并可能为每个表情符号创建自定义情绪?

表情符号解决方案:

library(emo)
TopEmoji <- tweetsClean %>%
          mutate(emoji = ji_extract_all(text)) %>%
          unnest(cols = c(emoji)) %>%
          count(emoji, sort = TRUE) %>%
          top_n(5)

文本值清理干净后,我的过程是 select 相关列,添加行号以保留每个单词所属的推文,并取消嵌套标记

tweetsClean <- tweets %>%
    select(created_at,text) %>%
    mutate(linenumber = row_number()) %>%
    select(linenumber,everything()) %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words)

接下来我分配所需的情绪并根据使用 AFINN 获得的情绪总和为每一行分配一个值:

sentiment_bing <- get_sentiments("bing") 
sentiment_AFINN <- get_sentiments("afinn")

tweetsValue <- tweetsClean %>%
  inner_join(sentiment_bing) %>%
  inner_join(sentiment_AFINN) %>%
  group_by(linenumber,created_at) %>%
  mutate(TweetValue = sum(value))

多谢指点!

测试数据:

df <- structure(list(created_at = structure(c(1622854597, 1622853904, 
1622853716, 1622778852, 1622448379, 1622450951, 1622777623, 1622853561, 
1622466544, 1622853192), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), text = c("@elonmusk can the dogefather ride @CumRocketCrypto into the night. #SpaceX @dogecoin https://twitter.com/", 
"@CryptoCrunchApp @CumRocketCrypto @vergecurrency @InuSanshu @Mettalex @UniLend_Finance @NuCypher @Chiliz @JulSwap @CurveFinance @PolyDoge Wrong this twitt shansu", 
"9am AEST Sunday morning!!!\nI will be hosting on the @CumRocketCrypto twitch channel!\n\nSo cum say Hi! https://twitter.com/", 
"@SamInCrypt1 @IamMars34147875 @DylanMcKitten @elonmusk @CumRocketCrypto Cumrocket <U+0001F4A6> https://twitter.com/", 
"@DK19663019 @CumRocketCrypto Oh hey, that's me! Did you grab one?", 
"@DK19663019 @CumRocketCrypto Thank you! <U+2764><U+FE0F>", "@CumRocketInfo @elonmusk @CumRocketCrypto Maybe he'd like to meet the CUMrocket models? https://twitter.com/", 
"@AerotyneToken @CumRocketCrypto Is there a way to make sure ones wallet ID is on the list?", 
"@AerotyneToken @CumRocketCrypto Does one have to attend the giveaway stream, or just hold 0.2 BNB of #CUMMIES and #ATYNE?\nAnd what happens if I bought about 0.2BNB each and the BNB price rises? Do I have to check every day if they're still worth at least 0.2?", 
"@Don_Santino1 @brandank_cr @PAWGcoinbsc @Tyga @CumRocketCrypto Massive bull flag. 10x is imminent!"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))

为了回答您的主要问题,clean_tweets() 函数在“Clean <- tweets %>% clean_tweets”行中不起作用,大概是因为您正在为其提供数据框。但是,函数的内部结构(即 str_ 函数)需要字符向量(字符串)。

清洁问题

我在这里说“大概”是因为我不确定你的tweets对象是什么样子的,所以我不能确定。但是,至少在您的测试数据上,以下解决了问题。

df %>% 
  mutate(clean = clean_tweets(text))

如果你只想返回字符向量,你也可以这样做

clean_tweets(df$text)

表情符号问题

关于保留表情符号并赋予它们情感的可能性,是的,我认为你基本上会按照处理其余文本的方式进行:标记它们,为每个表情符号分配数值,然后汇总。