如何解析出特定的文本部分?
How do I parse out a specific section of text?
我的目标是根据关键字在一组word文档中拉出特定的部分。我无法从更大的文本文件数据集中解析出特定的文本部分。数据集本来是这样的,用"title 1"和"title 2"表示我感兴趣的文本的开始和结束,不重要的词表示文本文件中我不感兴趣的部分:
**Text** **Text File**
title one Text file 1
sentence one Text file 1
sentence two Text file 1
title two Text file 1
unimportant words Text file 1
title one Text file 2
sentence one Text file 2
然后我用as.character把数据转成字符,用unnest_tokens整理数据
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
tidy_df <- df %>% unnest_tokens(word, Text, token = "words")
我现在只想查看数据集中的句子,排除不重要的词。每个文本文件中的标题一和标题二相同,但它们之间的句子不同。我试过下面这段代码,但它似乎不起作用。
filtered_resume <- lapply(tidy_resume, (tidy_resume %>% select(Name) %>% filter(title:two)))
不熟悉 tidytext
包,所以这里有一个替代的基础 R 解决方案。使用此扩展示例数据(底部包含创建代码):
> df
Text File
1 title one Text file 1
2 sentence one Text file 1
3 sentence two Text file 1
4 title two Text file 1
5 unimportant words Text file 1
6 title one Text file 2
7 sentence one Text file 2
8 sentence two Text file 2
9 sentence three Text file 2
10 title two Text file 2
11 unimportant words Text file 2
创建一个函数,根据 Text
列中的值,创建一个单独的列,指示给定行是应该保留还是删除。评论详情:
get_important_sentences <- function(df_) {
# Create some variables for filtering
val = 1
keep = c()
# For every text row
for (x in df_$Text) {
# Multiply the current val by 2
val = val * 2
# If the current text includes "title",
# set val to 1 for 'title one', and to 2
# for 'title two'
if (grepl("title", x)) {
val = ifelse(grepl("one", x), 1, 0)
}
# append val to keep each time
keep = c(keep, val)
}
# keep is now a numeric vector- add it to
# the data frame
df_$keep = keep
# exclude any rows where 'keep' is 1 (for
# 'title one') or 0 (for 'title 2' or any
# unimportant words). Also, drop the
return(df_[df_$keep > 1, c("Text", "File")])
}
然后你可以在整个数据帧上调用它:
> get_important_sentences(df)
Text File
2 sentence one Text file 1
3 sentence two Text file 1
7 sentence one Text file 2
8 sentence two Text file 2
9 sentence three Text file 2
或基于每个文件源 lapply
:
> lapply(split(df, df$File), get_important_sentences)
$`Text file 1`
Text File
2 sentence one Text file 1
3 sentence two Text file 1
$`Text file 2`
Text File
7 sentence one Text file 2
8 sentence two Text file 2
9 sentence three Text file 2
数据:
df <-
data.frame(
Text = c(
"title one",
"sentence one",
"sentence two",
"title two",
"unimportant words",
"title one",
"sentence one",
"sentence two",
"sentence three",
"title two",
"unimportant words"
),
File = c(rep("Text file 1", 5), rep("Text file 2", 6)),
stringsAsFactors = FALSE
)
如果您想要一个代码行很少的 tidyverse 选项,请看一下。您可以使用 case_when()
和 str_detect()
在数据框中查找包含 important/not 重要信号的行。
library(tidyverse)
df1 <- df %>%
mutate(important = case_when(str_detect(Text, "title one") ~ TRUE,
str_detect(Text, "title two") ~ FALSE))
df1
#> # A tibble: 11 x 3
#> Text File important
#> <chr> <chr> <lgl>
#> 1 title one Text file 1 TRUE
#> 2 sentence one Text file 1 NA
#> 3 sentence two Text file 1 NA
#> 4 title two Text file 1 FALSE
#> 5 unimportant words Text file 1 NA
#> 6 title one Text file 2 TRUE
#> 7 sentence one Text file 2 NA
#> 8 sentence two Text file 2 NA
#> 9 sentence three Text file 2 NA
#> 10 title two Text file 2 FALSE
#> 11 unimportant words Text file 2 NA
现在您可以使用 tidyr 中的 fill()
来填充这些值。
df1 %>%
fill(important, .direction = "down")
#> # A tibble: 11 x 3
#> Text File important
#> <chr> <chr> <lgl>
#> 1 title one Text file 1 TRUE
#> 2 sentence one Text file 1 TRUE
#> 3 sentence two Text file 1 TRUE
#> 4 title two Text file 1 FALSE
#> 5 unimportant words Text file 1 FALSE
#> 6 title one Text file 2 TRUE
#> 7 sentence one Text file 2 TRUE
#> 8 sentence two Text file 2 TRUE
#> 9 sentence three Text file 2 TRUE
#> 10 title two Text file 2 FALSE
#> 11 unimportant words Text file 2 FALSE
由 reprex package (v0.2.0) 创建于 2018-08-14。
此时,您可以filter(important)
只保留您想要的文本,然后您可以使用 tidytext 中的函数对您留下的重要文本进行文本挖掘。
我的目标是根据关键字在一组word文档中拉出特定的部分。我无法从更大的文本文件数据集中解析出特定的文本部分。数据集本来是这样的,用"title 1"和"title 2"表示我感兴趣的文本的开始和结束,不重要的词表示文本文件中我不感兴趣的部分:
**Text** **Text File**
title one Text file 1
sentence one Text file 1
sentence two Text file 1
title two Text file 1
unimportant words Text file 1
title one Text file 2
sentence one Text file 2
然后我用as.character把数据转成字符,用unnest_tokens整理数据
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
tidy_df <- df %>% unnest_tokens(word, Text, token = "words")
我现在只想查看数据集中的句子,排除不重要的词。每个文本文件中的标题一和标题二相同,但它们之间的句子不同。我试过下面这段代码,但它似乎不起作用。
filtered_resume <- lapply(tidy_resume, (tidy_resume %>% select(Name) %>% filter(title:two)))
不熟悉 tidytext
包,所以这里有一个替代的基础 R 解决方案。使用此扩展示例数据(底部包含创建代码):
> df
Text File
1 title one Text file 1
2 sentence one Text file 1
3 sentence two Text file 1
4 title two Text file 1
5 unimportant words Text file 1
6 title one Text file 2
7 sentence one Text file 2
8 sentence two Text file 2
9 sentence three Text file 2
10 title two Text file 2
11 unimportant words Text file 2
创建一个函数,根据 Text
列中的值,创建一个单独的列,指示给定行是应该保留还是删除。评论详情:
get_important_sentences <- function(df_) {
# Create some variables for filtering
val = 1
keep = c()
# For every text row
for (x in df_$Text) {
# Multiply the current val by 2
val = val * 2
# If the current text includes "title",
# set val to 1 for 'title one', and to 2
# for 'title two'
if (grepl("title", x)) {
val = ifelse(grepl("one", x), 1, 0)
}
# append val to keep each time
keep = c(keep, val)
}
# keep is now a numeric vector- add it to
# the data frame
df_$keep = keep
# exclude any rows where 'keep' is 1 (for
# 'title one') or 0 (for 'title 2' or any
# unimportant words). Also, drop the
return(df_[df_$keep > 1, c("Text", "File")])
}
然后你可以在整个数据帧上调用它:
> get_important_sentences(df)
Text File
2 sentence one Text file 1
3 sentence two Text file 1
7 sentence one Text file 2
8 sentence two Text file 2
9 sentence three Text file 2
或基于每个文件源 lapply
:
> lapply(split(df, df$File), get_important_sentences)
$`Text file 1`
Text File
2 sentence one Text file 1
3 sentence two Text file 1
$`Text file 2`
Text File
7 sentence one Text file 2
8 sentence two Text file 2
9 sentence three Text file 2
数据:
df <-
data.frame(
Text = c(
"title one",
"sentence one",
"sentence two",
"title two",
"unimportant words",
"title one",
"sentence one",
"sentence two",
"sentence three",
"title two",
"unimportant words"
),
File = c(rep("Text file 1", 5), rep("Text file 2", 6)),
stringsAsFactors = FALSE
)
如果您想要一个代码行很少的 tidyverse 选项,请看一下。您可以使用 case_when()
和 str_detect()
在数据框中查找包含 important/not 重要信号的行。
library(tidyverse)
df1 <- df %>%
mutate(important = case_when(str_detect(Text, "title one") ~ TRUE,
str_detect(Text, "title two") ~ FALSE))
df1
#> # A tibble: 11 x 3
#> Text File important
#> <chr> <chr> <lgl>
#> 1 title one Text file 1 TRUE
#> 2 sentence one Text file 1 NA
#> 3 sentence two Text file 1 NA
#> 4 title two Text file 1 FALSE
#> 5 unimportant words Text file 1 NA
#> 6 title one Text file 2 TRUE
#> 7 sentence one Text file 2 NA
#> 8 sentence two Text file 2 NA
#> 9 sentence three Text file 2 NA
#> 10 title two Text file 2 FALSE
#> 11 unimportant words Text file 2 NA
现在您可以使用 tidyr 中的 fill()
来填充这些值。
df1 %>%
fill(important, .direction = "down")
#> # A tibble: 11 x 3
#> Text File important
#> <chr> <chr> <lgl>
#> 1 title one Text file 1 TRUE
#> 2 sentence one Text file 1 TRUE
#> 3 sentence two Text file 1 TRUE
#> 4 title two Text file 1 FALSE
#> 5 unimportant words Text file 1 FALSE
#> 6 title one Text file 2 TRUE
#> 7 sentence one Text file 2 TRUE
#> 8 sentence two Text file 2 TRUE
#> 9 sentence three Text file 2 TRUE
#> 10 title two Text file 2 FALSE
#> 11 unimportant words Text file 2 FALSE
由 reprex package (v0.2.0) 创建于 2018-08-14。
此时,您可以filter(important)
只保留您想要的文本,然后您可以使用 tidytext 中的函数对您留下的重要文本进行文本挖掘。