语料库中最常提到的国家;从摘要 R 中提取国家名称

Most commonly mentioned countries in the corpus; extracting country names from abstracts R

我有一个包含几千个文档的语料库,我正试图在摘要中找到最常提到的国家。

图书馆 countrycode 似乎有一个完整的国家名称列表,我可以与之匹配:

# country.name.alt shows multiple potential namings for 'Congo' (yay!):
install.packages(countrycode)
countrycode::countryname_dict |> filter(grepl('congo', tolower(country.name.alt)))
# Also seems to work for ones like "China"/"People's Republic of China"

数据的表示如下所示:

df <- data.frame(entry_number = 1:5,
                 text = c("a few paragraphs that might contain the country name congo or democratic republic of congo",
                          "More text that might contain myanmar or burma, as well as thailand",
                          "sentences that do not contain a country name can be returned as NA",
                          "some variant of U.S or the united states",
                          "something with an accent samóoa"))

我想减少“文本”列中的每个条目,使其只包含一个国家/地区名称。理想情况下是这样的(注意重复条目号):

desired_df <- data.frame(entry_number = c(1, 2, 2, 3, 4, 5),
                     text = c("congo",
                              "myanmar",
                              "thailand",
                              NA,
                              "united states",
                              "samoa"))

我尝试过 str_extract 和其他各种失败的尝试!语料库是英文的,但 countrycode::countryname_dict$country.name.alt 中包含的国际字母确实会抛出规则错误。 countrycode::countryname_dict$country.name.alt 包含 countrycode::countryname_dict$country.name.en 没有的所有选项...

接受任何方法(dplyrdata.table...)来回答每个国家在语料库中被提及多少次的初始问题。唯一的要求是它对不同的潜在国家名称、口音和任何其他隐藏的陷阱尽可能稳健!

感谢社区!

P.S,我已经复习了以下问题,但我自己的例子没有成功:

这似乎适用于示例数据。

library(tidyverse)

all_country <- countrycode::countryname_dict %>% 
                  filter(grepl('[A-Za-z]', country.name.alt)) %>%
                  pull(country.name.alt) %>% 
                  tolower()
pattern <- str_c(all_country, collapse = '|')

df %>%
  mutate(country = str_extract_all(tolower(text), pattern)) %>%
  select(-text) %>%
  unnest(country, keep_empty = TRUE)

#  entry_number country                     
#         <int> <chr>                       
#1            1 congo                       
#2            1 democratic republic of congo
#3            2 myanma                      
#4            2 burma                       
#5            2 thailand                    
#6            3 NA                          
#7            4 united states               
#8            5 samóoa