是否有可能让 R 识别数据框中的国家?
Is it possible to get R to identify countries in a dataframe?
这是我的数据集目前的样子。我希望添加一个列,其中包含与 'paragraph' 列相对应的国家/地区名称,但我什至不知道如何开始。我应该上传所有国家名称的列表然后使用匹配功能吗?
任何关于更优化方式的建议将不胜感激!谢谢。
dput(head(dataset, 20))的输出如下:
structure(list(category = c("State Ownership and Privatization;...row.names = c(NA, 20L), class = "data.frame")
使用包“国家代码”:
玩具数据:
df <- data.frame(entry_number = 1:5,
text = c("a few paragraphs that might contain the country name congo or democratic republic of congo",
"More text that might contain myanmar or burma, as well as thailand",
"sentences that do not contain a country name can be returned as NA",
"some variant of U.S or the united states",
"something with an accent samóoa"))
这是在单独的列中匹配国家/地区名称的方法:
library(tidyr)
library(dplyr)
#install.packages("countrycode")
library(countrycode)
all_country <- countryname_dict %>%
# filter out non-ASCII country names:
filter(grepl('[A-Za-z]', country.name.alt)) %>%
# define column `country.name.alt` as an atomic vector:
pull(country.name.alt) %>%
# change to lower-case:
tolower()
# define alternation pattern of all country names:
library(stringr)
pattern <- str_c(all_country, collapse = '|') # A huge alternation pattern!
df %>%
# extract country name matches
mutate(country = str_extract_all(tolower(text), pattern))
entry_number text
1 1 a few paragraphs that might contain the country name congo or democratic republic of congo
2 2 More text that might contain myanmar or burma, as well as thailand
3 3 sentences that do not contain a country name can be returned as NA
4 4 some variant of U.S or the united states
5 5 something with an accent samóoa
country
1 congo, democratic republic of congo
2 myanma, burma, thailand
3
4 united states
5 samóoa
这是我的数据集目前的样子。我希望添加一个列,其中包含与 'paragraph' 列相对应的国家/地区名称,但我什至不知道如何开始。我应该上传所有国家名称的列表然后使用匹配功能吗?
任何关于更优化方式的建议将不胜感激!谢谢。
dput(head(dataset, 20))的输出如下:
structure(list(category = c("State Ownership and Privatization;...row.names = c(NA, 20L), class = "data.frame")
使用包“国家代码”:
玩具数据:
df <- data.frame(entry_number = 1:5,
text = c("a few paragraphs that might contain the country name congo or democratic republic of congo",
"More text that might contain myanmar or burma, as well as thailand",
"sentences that do not contain a country name can be returned as NA",
"some variant of U.S or the united states",
"something with an accent samóoa"))
这是在单独的列中匹配国家/地区名称的方法:
library(tidyr)
library(dplyr)
#install.packages("countrycode")
library(countrycode)
all_country <- countryname_dict %>%
# filter out non-ASCII country names:
filter(grepl('[A-Za-z]', country.name.alt)) %>%
# define column `country.name.alt` as an atomic vector:
pull(country.name.alt) %>%
# change to lower-case:
tolower()
# define alternation pattern of all country names:
library(stringr)
pattern <- str_c(all_country, collapse = '|') # A huge alternation pattern!
df %>%
# extract country name matches
mutate(country = str_extract_all(tolower(text), pattern))
entry_number text
1 1 a few paragraphs that might contain the country name congo or democratic republic of congo
2 2 More text that might contain myanmar or burma, as well as thailand
3 3 sentences that do not contain a country name can be returned as NA
4 4 some variant of U.S or the united states
5 5 something with an accent samóoa
country
1 congo, democratic republic of congo
2 myanma, burma, thailand
3
4 united states
5 samóoa