如何将一个句子分解成单词
How to separate a sentence into words
在 r 中,我目前正在处理对话数据集。当前的数据如下所示:
Mike, "Hello how are you"
Sally, "Good you"
我计划最终创建此数据的词云,并且需要它看起来像这样:
Mike, Hello
Mike, how
Mike, are
Mike, you
Sally, good
Sally, you
也许像这样使用 reshape2::melt
?
# Sample data
df <- read.csv(text =
'Mike, "Hello how are you"
Sally, "Good you"', header = F)
# Split on words
lst <- strsplit(trimws(as.character(df[, 2])), "\s");
names(lst) <- trimws(df[, 1]);
# Reshape into long dataframe
library(reshape2);
df.long <- (melt(lst))[2:1];
# L1 value
#1 Mike Hello
#2 Mike how
#3 Mike are
#4 Mike you
#5 Sally Good
#6 Sally you
解释:在空白 \s
的第二列中拆分 trailing/leading 空白修剪 (trimws
) 条目并存储在 list
中。从第一列中取出 list
个条目名称,并使用 reshape2::melt
.
重塑为长 data.frame
我把它变成逗号分隔的 data.frame
由你决定......
使用分词器,例如通过 tidytext::unnest_tokens
:
library(tidyverse)
library(tidytext)
dialogue <- read_csv(
'Mike, "Hello how are you"
Sally, "Good you"',
col_names = c('speaker', 'sentence')
)
dialogue %>% unnest_tokens(word, sentence)
#> # A tibble: 6 x 2
#> speaker word
#> <chr> <chr>
#> 1 Mike hello
#> 2 Mike how
#> 3 Mike are
#> 4 Mike you
#> 5 Sally good
#> 6 Sally you
在 r 中,我目前正在处理对话数据集。当前的数据如下所示:
Mike, "Hello how are you"
Sally, "Good you"
我计划最终创建此数据的词云,并且需要它看起来像这样:
Mike, Hello
Mike, how
Mike, are
Mike, you
Sally, good
Sally, you
也许像这样使用 reshape2::melt
?
# Sample data
df <- read.csv(text =
'Mike, "Hello how are you"
Sally, "Good you"', header = F)
# Split on words
lst <- strsplit(trimws(as.character(df[, 2])), "\s");
names(lst) <- trimws(df[, 1]);
# Reshape into long dataframe
library(reshape2);
df.long <- (melt(lst))[2:1];
# L1 value
#1 Mike Hello
#2 Mike how
#3 Mike are
#4 Mike you
#5 Sally Good
#6 Sally you
解释:在空白 \s
的第二列中拆分 trailing/leading 空白修剪 (trimws
) 条目并存储在 list
中。从第一列中取出 list
个条目名称,并使用 reshape2::melt
.
data.frame
我把它变成逗号分隔的 data.frame
由你决定......
使用分词器,例如通过 tidytext::unnest_tokens
:
library(tidyverse)
library(tidytext)
dialogue <- read_csv(
'Mike, "Hello how are you"
Sally, "Good you"',
col_names = c('speaker', 'sentence')
)
dialogue %>% unnest_tokens(word, sentence)
#> # A tibble: 6 x 2
#> speaker word
#> <chr> <chr>
#> 1 Mike hello
#> 2 Mike how
#> 3 Mike are
#> 4 Mike you
#> 5 Sally good
#> 6 Sally you