R 中用于文本分析的常见名字列表?
List of common first names for text analysis in R?
在分析文本时,识别文本数据中的人名非常有用。
tidytext
中预打包的对象包括:
- 英语否定词、情态词和副词 (
nma_words
)
- 词性(
parts_of_speech
)
- 情绪(
sentiments
),以及
- 停用词(参见:
?stop_words
)
R 中是否有包含规范名称列表的类似对象(或其他地方的可访问格式)?
供参考,这里是随 tidytext
提供的现有 data.frame
nma_words
# # A tibble: 44 x 2
# word modifier
# <chr> <chr>
# 1 cannot negator
# 2 could not negator
# 3 did not negator
# 4 does not negator
# 5 had no negator
# 6 have no negator
# 7 may not negator
# 8 never negator
# 9 no negator
# 10 not negator
# # … with 34 more rows
parts_of_speech
# # A tibble: 208,259 x 2
# word pos
# <chr> <chr>
# 1 3-d Adjective
# 2 3-d Noun
# 3 4-f Noun
# 4 4-h'er Noun
# 5 4-h Adjective
# 6 a' Adjective
# 7 a-1 Noun
# 8 a-axis Noun
# 9 a-bomb Noun
# 10 a-frame Noun
# # … with 208,249 more rows
sentiments
# # A tibble: 6,786 x 2
# word sentiment
# <chr> <chr>
# 1 2-faces negative
# 2 abnormal negative
# 3 abolish negative
# 4 abominable negative
# 5 abominably negative
# 6 abominate negative
# 7 abomination negative
# 8 abort negative
# 9 aborted negative
# 10 aborts negative
# # … with 6,776 more rows
stop_words
# # A tibble: 1,149 x 2
# word lexicon
# <chr> <chr>
# 1 a SMART
# 2 a's SMART
# 3 able SMART
# 4 about SMART
# 5 above SMART
# 6 according SMART
# 7 accordingly SMART
# 8 across SMART
# 9 actually SMART
# 10 after SMART
# # … with 1,139 more rows
像这样的数据集非常复杂,必须小心使用。此类数据的来源之一是 genderdata 数据包,其中包含多个姓名数据集,其中有几个来自美国社会保障局。
library(genderdata)
head(ssa_national)
#> name year female male
#> 1 aaban 2007 0 5
#> 2 aaban 2009 0 6
#> 3 aaban 2010 0 9
#> 4 aaban 2011 0 11
#> 5 aaban 2012 0 11
#> 6 aabha 2011 7 0
由 reprex package (v0.3.0)
于 2020-04-27 创建
在分析文本时,识别文本数据中的人名非常有用。
tidytext
中预打包的对象包括:
- 英语否定词、情态词和副词 (
nma_words
) - 词性(
parts_of_speech
) - 情绪(
sentiments
),以及 - 停用词(参见:
?stop_words
)
R 中是否有包含规范名称列表的类似对象(或其他地方的可访问格式)?
供参考,这里是随 tidytext
data.frame
nma_words
# # A tibble: 44 x 2
# word modifier
# <chr> <chr>
# 1 cannot negator
# 2 could not negator
# 3 did not negator
# 4 does not negator
# 5 had no negator
# 6 have no negator
# 7 may not negator
# 8 never negator
# 9 no negator
# 10 not negator
# # … with 34 more rows
parts_of_speech
# # A tibble: 208,259 x 2
# word pos
# <chr> <chr>
# 1 3-d Adjective
# 2 3-d Noun
# 3 4-f Noun
# 4 4-h'er Noun
# 5 4-h Adjective
# 6 a' Adjective
# 7 a-1 Noun
# 8 a-axis Noun
# 9 a-bomb Noun
# 10 a-frame Noun
# # … with 208,249 more rows
sentiments
# # A tibble: 6,786 x 2
# word sentiment
# <chr> <chr>
# 1 2-faces negative
# 2 abnormal negative
# 3 abolish negative
# 4 abominable negative
# 5 abominably negative
# 6 abominate negative
# 7 abomination negative
# 8 abort negative
# 9 aborted negative
# 10 aborts negative
# # … with 6,776 more rows
stop_words
# # A tibble: 1,149 x 2
# word lexicon
# <chr> <chr>
# 1 a SMART
# 2 a's SMART
# 3 able SMART
# 4 about SMART
# 5 above SMART
# 6 according SMART
# 7 accordingly SMART
# 8 across SMART
# 9 actually SMART
# 10 after SMART
# # … with 1,139 more rows
像这样的数据集非常复杂,必须小心使用。此类数据的来源之一是 genderdata 数据包,其中包含多个姓名数据集,其中有几个来自美国社会保障局。
library(genderdata)
head(ssa_national)
#> name year female male
#> 1 aaban 2007 0 5
#> 2 aaban 2009 0 6
#> 3 aaban 2010 0 9
#> 4 aaban 2011 0 11
#> 5 aaban 2012 0 11
#> 6 aabha 2011 7 0
由 reprex package (v0.3.0)
于 2020-04-27 创建