R 中用于文本分析的常见名字列表？

Question

在分析文本时，识别文本数据中的人名非常有用。

tidytext 中预打包的对象包括：

英语否定词、情态词和副词 (nma_words)
词性（parts_of_speech）
情绪（sentiments），以及
停用词（参见：?stop_words）

R 中是否有包含规范名称列表的类似对象（或其他地方的可访问格式）？

供参考，这里是随 tidytext

提供的现有 data.frame

nma_words
# # A tibble: 44 x 2
# word      modifier
# <chr>     <chr>   
#   1 cannot    negator 
# 2 could not negator 
# 3 did not   negator 
# 4 does not  negator 
# 5 had no    negator 
# 6 have no   negator 
# 7 may not   negator 
# 8 never     negator 
# 9 no        negator 
# 10 not       negator 
# # … with 34 more rows


parts_of_speech
# # A tibble: 208,259 x 2
#    word    pos      
#    <chr>   <chr>    
#  1 3-d     Adjective
#  2 3-d     Noun     
#  3 4-f     Noun     
#  4 4-h'er  Noun     
#  5 4-h     Adjective
#  6 a'      Adjective
#  7 a-1     Noun     
#  8 a-axis  Noun     
#  9 a-bomb  Noun     
# 10 a-frame Noun     
# # … with 208,249 more rows


sentiments
# # A tibble: 6,786 x 2
#    word        sentiment
#    <chr>       <chr>    
#  1 2-faces     negative 
#  2 abnormal    negative 
#  3 abolish     negative 
#  4 abominable  negative 
#  5 abominably  negative 
#  6 abominate   negative 
#  7 abomination negative 
#  8 abort       negative 
#  9 aborted     negative 
# 10 aborts      negative 
# # … with 6,776 more rows


stop_words
# # A tibble: 1,149 x 2
#    word        lexicon
#    <chr>       <chr>  
#  1 a           SMART  
#  2 a's         SMART  
#  3 able        SMART  
#  4 about       SMART  
#  5 above       SMART  
#  6 according   SMART  
#  7 accordingly SMART  
#  8 across      SMART  
#  9 actually    SMART  
# 10 after       SMART  
# # … with 1,139 more rows

Answer 1

像这样的数据集非常复杂，必须小心使用。此类数据的来源之一是 genderdata 数据包，其中包含多个姓名数据集，其中有几个来自美国社会保障局。

library(genderdata)

head(ssa_national)
#>    name year female male
#> 1 aaban 2007      0    5
#> 2 aaban 2009      0    6
#> 3 aaban 2010      0    9
#> 4 aaban 2011      0   11
#> 5 aaban 2012      0   11
#> 6 aabha 2011      7    0

^{由 reprex package (v0.3.0)}

于 2020-04-27 创建

R 中用于文本分析的常见名字列表？

List of common first names for text analysis in R?

nlp

r

tidytext