从 R 中的数据框中识别无意义或乱码的文本。有没有办法将 string/words 部分匹配到字典？

Question

我希望在我的数据框中创建一个变量（列）来识别可疑的无意义文本（例如 "asdkjhfas"），或者相反。这是帮助我的团队清理调查数据的通用脚本的一部分。

我在 Whosebug 上找到的一个函数（link & 信用如下）允许我将单个单词匹配到字典，它不识别多个单词。

有什么方法可以用字典进行部分匹配（而不是严格匹配）？

library(qdapDictionaries) # install.packages(qdap)

is.word  <- function(x) x %in% GradyAugmented

x <- c(1, 2, 3, 4, 5, 6)
y <- c("this is text", "word", "random", "Coca-cola", "this is meaningful                 
asdfasdf", "sadfsdf")
df <- data.frame(x,y)


df$z  [is.word(df$y)] <- TRUE
df

在理想情况下，我会得到一个专栏：df$z TRUE TRUE TRUE TRUE TRUE NA

我的实际结果是：df$z NA TRUE TRUE NA NA NA

我会非常高兴：df$z TRUE TRUE TRUE NA TRUE NA

我在这里找到了函数 is.word 感谢用户 parth

Answer 1

这适用于 dplyr 和 tidytext。比我预期的要长一点。某处可能有捷径。

我检查一个句子中是否有单词并计算 TRUE 值的数量。如果它大于 0，则它有文本，否则没有。

library(tidytext)
library(dplyr)
df %>% unnest_tokens(words, y) %>% 
  mutate(text = words %in% GradyAugmented) %>% 
  group_by(x) %>% 
  summarise(z = sum(text)) %>% 
  inner_join(df) %>% 
  mutate(z = if_else(z > 0, TRUE, FALSE))


Joining, by = "x"
# A tibble: 6 x 3
      x z     y                          
  <dbl> <lgl> <chr>                      
1     1 TRUE  this is text               
2     2 TRUE  word                       
3     3 TRUE  random                     
4     4 TRUE  Coca-cola                  
5     5 TRUE  this is meaningful asdfasdf
6     6 FALSE sadfsdf

Answer 2

这是一个使用 purrr（以及 dplyr 和 stringr）的解决方案：

library(tidyverse)

your_data <- tibble(text = c("this is text", "word", "random", "Coca-cola", "this is meaningful asdfasdf", "sadfsdf"))

your_data %>%
 # split the text on spaces and punctuation
 mutate(text_split = str_split(text, "\s|[:punct:]")) %>% 
 # see if some element of the provided text is an element of your dictionary
 mutate(meaningful = map_lgl(text_split, some, is.element, GradyAugmented)) 

# A tibble: 6 x 3
  text                        text_split meaningful
  <chr>                       <list>     <lgl>     
1 this is text                <chr [3]>  TRUE      
2 word                        <chr [1]>  TRUE      
3 random                      <chr [1]>  TRUE      
4 Coca-cola                   <chr [2]>  TRUE      
5 this is meaningful asdfasdf <chr [4]>  TRUE      
6 sadfsdf                     <chr [1]>  FALSE

Answer 3

谢谢@Ben G 和@phiver

两种解决方案均有效。需要注意的一件事是 tidytext 仅适用于 tibbles。我做了一些微小的调整以将其放回数据框中，并认为我也会分享（以防万一其他人需要这种格式）。

x <- c(1, 2, 3, 4, 5, 6)
y <- c("this is text", "word", "random", "Coca-cola", "this is meaningful asdfasdf", 
"sadfsdf")
my_tibble <- tibble(x,y)

my_tibble_new = my_tibble %>%
   unnest_tokens(output=word, input="y", token = "words") %>%
   mutate(text = word %in% GradyAugmented) %>%
   group_by(x) %>%
   summarise(z = sum(text)) %>%
   inner_join(my_tibble) %>%
   mutate(z = if_else(z > 0, TRUE, FALSE))

df = as.data.frame(my_tibble_new)

从 R 中的数据框中识别无意义或乱码的文本。有没有办法将 string/words 部分匹配到字典？

Identify meaningless or gibberish text from a data frame in R. Is there a way to partially match string/words to a dictionary?

string

r

tm