删除正则表达式时出错,将文本拆分为段落,然后在 R 中应用 ifelse

Error in Removing regex, Split Text into Paragraph, and then apply ifelse in R

我正在努力删除正则表达式将文本拆分为段落,然后将 IFELSE 应用于数据框。我期待着你的帮助。 谢谢。

我希望为数据框中的每个文本搜索第一段中的单词。此后,我有了要搜索的搜索词。如果出现的话,输入 1,否则输入 0。

下面是table.

data<-structure(list(ID = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "1", "2"), class = "factor"), 
    Text = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("", "\n\t\t\t\t \n\t\t\t\t\tPublication Date: October 31, 2017\n\t\t\t\t October 31, 2017  he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.", 
    "\n\t\t\t\t\t \n \n   The soccer world cup is entralling. \nEveryone  acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
    ), class = "factor")), .Names = c("ID", "Text"), row.names = c(NA, 
-15L), class = "data.frame")

对于文本栏中的条目数,我正在搜索以下词

library(stringr)
library(stringi)
library(tidyverse)
library(tidytext)
library(tokenizers)
library(dplyr)
words<-c("field", "ocean", "glamor showcases")

我试过以下方法:

删除不需要的正则表达式。

当我尝试删除“\t”和“\n”时,出现以下错误:

data1<-data %>% mutate(Text=gsub("\t",Text,""))

Warning message: In gsub("\t", Text, "") : argument 'replacement' has length > 1 and only the first element will be used

按段落拆分

data1<-data %>% mutate(Text2=Text) %>% unnest_tokens("Text3",Text2,token="paragraphs")

如果单词存在,则为 1,否则为 0,最后 table。

finaldata<-structure(list(ID = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "1", "2"), class = "factor"), 
    Text = structure(c(2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("", "\n\t\t\t\t \n\t\t\t\t\tPublication Date: October 31, 2017\n\t\t\t\t October 31, 2017  he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.", 
    "\n\t\t\t\t\t \n \n   The soccer world cup is entralling. \nEveryone  acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
    ), class = "factor"), field = structure(c(2L, 3L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "0", "1"), class = "factor"), country = structure(c(3L, 2L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "0", "1"), class = "factor"), glamor.showcases = structure(c(2L, 
    3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "0", "1"), class = "factor")), .Names = c("ID", "Text", "field", 
"country", "glamor.showcases"), row.names = c(NA, -15L), class = "data.frame")

如有任何帮助,我们将不胜感激。 谢谢。

我看过以下资源-

  1. Count word occurrences in R

  2. Split text file into paragraph files in R

假设 df$Text 中的新段落从 \n\n

开始,您可以试试这个
#search df$Text to find if it contains strings present in 'words' vector in its first paragraph
words_df <- do.call(cbind, lapply(words, function(x) 
  as.numeric(grepl(x, gsub("\n\n.*$", "", df$Text), ignore.case = T))))
colnames(words_df) <- words

#above outcome is combined with original dataframe to have the final result
final_df <- cbind(df, words_df)

这给出了

> final_df[, -(1:2)]
  field country glamor showcases
1     0       1                0
2     1       0                1


示例数据:

df <- structure(list(ID = structure(2:3, .Label = c("", "1", "2"), class = "factor"), 
    Text = structure(2:3, .Label = c("", "\n\t\t\t\t \n\t\t\t\t\tPublication Date: October 31, 2017\n\t\t\t\t October 31, 2017  he world is an amazing place. It is filled with wonders. Not just in one country but in any country you live in.\n\nYou just must open yourself to seeing it. It is in the architecture. It is in the ocean. It is in the people. It is in the animals.", 
    "\n\t\t\t\t\t \n \n   The soccer world cup is entralling. \nEveryone  acknowledge ieach other on the field. \nIt is only going to get better. The glitz and glamor showcases reflects the spirit the game is played in."
    ), class = "factor")), .Names = c("ID", "Text"), row.names = 1:2, class = "data.frame")

words<-c("field", "country", "glamor showcases")