在名字和姓氏的向量上使用 DocumentTermMatrix
Using DocumentTermMatrix on a Vector of First and Last Names
我的数据框 (df) 中有一列如下:
> people = df$people
> people[1:3]
[1] "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"
[2] "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"
[3] "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"
该列有 4k+ 个唯一的 first/last/nick 姓名作为每行的全名列表,如上所示。我想为此列创建一个 DocumentTermMatrix,在其中找到全名匹配项,并且仅将出现次数最多的名称用作列。我试过以下代码:
> people_list = strsplit(people, ", ")
> corp = Corpus(VectorSource(people_list))
> dtm = DocumentTermMatrix(corp, people_dict)
其中 people_dict 是 people_list 中最常出现的人(约 150 个人的全名)的列表,如下所示:
> people_dict[1:3]
[[1]]
[1] "Christian Slater"
[[2]]
[1] "Tara Reid"
[[3]]
[1] "Stephen Dorff"
但是,DocumentTermMatrix 函数似乎根本没有使用 people_dict,因为我的列比 people_dict 中的列多得多。另外,我认为 DocumentTermMatrix 函数将每个名称字符串拆分为多个字符串。例如,"Danny Devito" 成为 "Danny" 和 "Devito" 的列。
> inspect(actors_dtm[1:5,1:10])
<<DocumentTermMatrix (documents: 5, terms: 10)>>
Non-/sparse entries: 0/50
Sparsity : 100%
Maximal term length: 9
Weighting : term frequency (tf)
Terms
Docs 'g. 'jojo' 'ole' 'piolin' 'rampage' 'spank' 'stevvi' a.d. a.j. aaliyah
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
我通读了所有我能找到的 TM 文档,并且花了数小时在 Whosebug 上搜索解决方案。请帮忙!
默认分词器将文本拆分为单个单词。您需要提供自定义函数
commasplit_tokenizer <- function(x)
unlist(strsplit(as.character(x), ", "))
请注意,在创建语料库之前不要分离演员。
people <- character(3)
people[1] <- "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"
people[2] <- "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"
people[3] <- "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"
people_dict <- c("Stephen Dorff", "Nia Long", "Uma Thurman")
控制选项不适用于 Coprus,我使用了 VCorpus
corp = VCorpus(VectorSource(people))
dtm = DocumentTermMatrix(corp, control = list(tokenize =
commasplit_tokenizer, dictionary = people_dict, tolower = FALSE))
所有选项都在控制范围内传递,包括:
- 标记化-函数
- 词典
- 降低 = 假
结果:
as.matrix(dtm)
Terms
Docs Nia LOng Stephen Dorff Uma Thurman
1 0 1 0
2 0 0 0
3 0 0 1
希望对您有所帮助
我的数据框 (df) 中有一列如下:
> people = df$people
> people[1:3]
[1] "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"
[2] "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"
[3] "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"
该列有 4k+ 个唯一的 first/last/nick 姓名作为每行的全名列表,如上所示。我想为此列创建一个 DocumentTermMatrix,在其中找到全名匹配项,并且仅将出现次数最多的名称用作列。我试过以下代码:
> people_list = strsplit(people, ", ")
> corp = Corpus(VectorSource(people_list))
> dtm = DocumentTermMatrix(corp, people_dict)
其中 people_dict 是 people_list 中最常出现的人(约 150 个人的全名)的列表,如下所示:
> people_dict[1:3]
[[1]]
[1] "Christian Slater"
[[2]]
[1] "Tara Reid"
[[3]]
[1] "Stephen Dorff"
但是,DocumentTermMatrix 函数似乎根本没有使用 people_dict,因为我的列比 people_dict 中的列多得多。另外,我认为 DocumentTermMatrix 函数将每个名称字符串拆分为多个字符串。例如,"Danny Devito" 成为 "Danny" 和 "Devito" 的列。
> inspect(actors_dtm[1:5,1:10])
<<DocumentTermMatrix (documents: 5, terms: 10)>>
Non-/sparse entries: 0/50
Sparsity : 100%
Maximal term length: 9
Weighting : term frequency (tf)
Terms
Docs 'g. 'jojo' 'ole' 'piolin' 'rampage' 'spank' 'stevvi' a.d. a.j. aaliyah
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
我通读了所有我能找到的 TM 文档,并且花了数小时在 Whosebug 上搜索解决方案。请帮忙!
默认分词器将文本拆分为单个单词。您需要提供自定义函数
commasplit_tokenizer <- function(x)
unlist(strsplit(as.character(x), ", "))
请注意,在创建语料库之前不要分离演员。
people <- character(3)
people[1] <- "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"
people[2] <- "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"
people[3] <- "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"
people_dict <- c("Stephen Dorff", "Nia Long", "Uma Thurman")
控制选项不适用于 Coprus,我使用了 VCorpus
corp = VCorpus(VectorSource(people))
dtm = DocumentTermMatrix(corp, control = list(tokenize =
commasplit_tokenizer, dictionary = people_dict, tolower = FALSE))
所有选项都在控制范围内传递,包括:
- 标记化-函数
- 词典
- 降低 = 假
结果:
as.matrix(dtm)
Terms
Docs Nia LOng Stephen Dorff Uma Thurman
1 0 1 0
2 0 0 0
3 0 0 1
希望对您有所帮助