R Tidy_text:计算字符串模式,而不是单词
R Tidy_text: Count String Patterns, not Words
我想计算向量(列)中博客标签的出现次数。这是专栏:
> head(df$tags)
[1] "blog / thank you / NSW / ndoa / " "election / WA / blog / voting system / "
[3] "blog / " "euthanasia / media / Labor / Qld / assisted suicide / "
[5] "abortion / SA / blog / abortion-to-birth / "
[6] "euthanasia / media / Tas / assisted suicide / mike gaffney / "
/'tag'/
是格式。我可以使用 tidy_text 来计算单词的数量,代码如下:
wordCount <- df %>%
unnest_tokens(word, tags) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
但是,这显然打散了标签,只统计了单词。我需要计算标签本身的出现次数,而不是单个单词的出现次数。
我实际上不知道从哪里开始,所以感谢您的帮助。
用简单的 strsplit 就可以做到
df %>%
mutate( word = strsplit( gdata::trim(tags), "\s*\/\s*" ) ) %>%
unnest( cols=word )
我明白了:
> df
tags
1 blog / thank you / NSW / ndoa /
2 election / WA / blog / voting system /
3 blog /
4 euthanasia / media / Labor / Qld / assisted suicide /
5 abortion / SA / blog / abortion-to-birth /
6 euthanasia / media / Tas / assisted suicide / mike gaffney /
> df %>%
+ mutate( word = strsplit( gdata::trim(tags), "\s*\/\s*" ) ) %>%
+ unnest( cols=word )
# A tibble: 23 x 2
tags word
<chr> <chr>
1 "blog / thank you / NSW / ndoa / " blog
2 "blog / thank you / NSW / ndoa / " thank you
3 "blog / thank you / NSW / ndoa / " NSW
4 "blog / thank you / NSW / ndoa / " ndoa
5 "election / WA / blog / voting system / " election
6 "election / WA / blog / voting system / " WA
7 "election / WA / blog / voting system / " blog
8 "election / WA / blog / voting system / " voting system
9 "blog / " blog
10 "euthanasia / media / Labor / Qld / assisted suicide / " euthanasia
# … with 13 more rows
>
您可能可以将其与其余数据流联系起来。
您可以在 tidytext 中使用正则表达式进行分词,这可能正是您要查找的内容:
library(tidyverse)
library(tidytext)
df <- tibble(tags = c("blog / thank you / NSW / ndoa / ",
"election / WA / blog / voting system / ",
"blog / ",
"euthanasia / media / Labor / Qld / assisted suicide / ",
"abortion / SA / blog / abortion-to-birth / ",
"euthanasia / media / Tas / assisted suicide / mike gaffney / "))
df %>%
unnest_tokens(tag, tags, token = "regex", pattern = "\s*\/\s*")
#> # A tibble: 23 x 1
#> tag
#> <chr>
#> 1 blog
#> 2 thank you
#> 3 nsw
#> 4 ndoa
#> 5 election
#> 6 wa
#> 7 blog
#> 8 voting system
#> 9 blog
#> 10 euthanasia
#> # … with 13 more rows
由 reprex package (v1.0.0)
于 2021 年 3 月 11 日创建
我想计算向量(列)中博客标签的出现次数。这是专栏:
> head(df$tags)
[1] "blog / thank you / NSW / ndoa / " "election / WA / blog / voting system / "
[3] "blog / " "euthanasia / media / Labor / Qld / assisted suicide / "
[5] "abortion / SA / blog / abortion-to-birth / "
[6] "euthanasia / media / Tas / assisted suicide / mike gaffney / "
/'tag'/
是格式。我可以使用 tidy_text 来计算单词的数量,代码如下:
wordCount <- df %>%
unnest_tokens(word, tags) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
但是,这显然打散了标签,只统计了单词。我需要计算标签本身的出现次数,而不是单个单词的出现次数。
我实际上不知道从哪里开始,所以感谢您的帮助。
用简单的 strsplit 就可以做到
df %>%
mutate( word = strsplit( gdata::trim(tags), "\s*\/\s*" ) ) %>%
unnest( cols=word )
我明白了:
> df
tags
1 blog / thank you / NSW / ndoa /
2 election / WA / blog / voting system /
3 blog /
4 euthanasia / media / Labor / Qld / assisted suicide /
5 abortion / SA / blog / abortion-to-birth /
6 euthanasia / media / Tas / assisted suicide / mike gaffney /
> df %>%
+ mutate( word = strsplit( gdata::trim(tags), "\s*\/\s*" ) ) %>%
+ unnest( cols=word )
# A tibble: 23 x 2
tags word
<chr> <chr>
1 "blog / thank you / NSW / ndoa / " blog
2 "blog / thank you / NSW / ndoa / " thank you
3 "blog / thank you / NSW / ndoa / " NSW
4 "blog / thank you / NSW / ndoa / " ndoa
5 "election / WA / blog / voting system / " election
6 "election / WA / blog / voting system / " WA
7 "election / WA / blog / voting system / " blog
8 "election / WA / blog / voting system / " voting system
9 "blog / " blog
10 "euthanasia / media / Labor / Qld / assisted suicide / " euthanasia
# … with 13 more rows
>
您可能可以将其与其余数据流联系起来。
您可以在 tidytext 中使用正则表达式进行分词,这可能正是您要查找的内容:
library(tidyverse)
library(tidytext)
df <- tibble(tags = c("blog / thank you / NSW / ndoa / ",
"election / WA / blog / voting system / ",
"blog / ",
"euthanasia / media / Labor / Qld / assisted suicide / ",
"abortion / SA / blog / abortion-to-birth / ",
"euthanasia / media / Tas / assisted suicide / mike gaffney / "))
df %>%
unnest_tokens(tag, tags, token = "regex", pattern = "\s*\/\s*")
#> # A tibble: 23 x 1
#> tag
#> <chr>
#> 1 blog
#> 2 thank you
#> 3 nsw
#> 4 ndoa
#> 5 election
#> 6 wa
#> 7 blog
#> 8 voting system
#> 9 blog
#> 10 euthanasia
#> # … with 13 more rows
由 reprex package (v1.0.0)
于 2021 年 3 月 11 日创建