通过连续的冒号从字符串中提取字符

Question

我正在尝试从数据框中的变量中提取一些信息。我正在使用 R 3.3.3。

信息格式如下：

t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

我想将每个部分分解成一个单独的变量，如下所示：

w = "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."

x = "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."

y = "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south."

z = "DOMINCAN REPUBLIC: Is a country located in the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

我在尝试提取此信息时遇到了一些困难。诸如 this and this 之类的问题非常有帮助。从这些中，我收集到某种形式的 stringr/ gsub 可用于提取此信息，但我无法弄清楚如何在 gsub 语句中指定范围。

我已经能够弄清楚如何拉出第一部分：

>test4 <- gsub("(.*{1})(:.*)","\1", t)

这给出了

[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"

我的总体问题是：

[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"

如果我不必清理字符串末尾的 "DOMINICAN REPUBLIC" 部分就好了。

总结：

1.如何通过连续的冒号从字符串中提取字符？（第 1 到第 2 个冒号，第 2 到第 3 个等）

2。有没有办法让冒号前面的单词也一样？

任何信息或指导将不胜感激。

Answer 1

基数 R 中的以下内容如何？

# Your sample string
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region.";

# Get position of regexp matches
matches <- data.frame(
    idx = unlist(gregexpr(pattern = "([A-Z]*\s*[A-Z]+:|\w+:)", t)),
    len = c(diff(unlist(gregexpr(pattern = "([A-Z]*\s*[A-Z]+:|\w+:)", t))), nchar(t))
);

# Get substrings based on positions and store in list
lst <- apply(matches, 1, function(x) {
    trimws(substr(t, x[1], sum(x) - 1));
})
lst;

#[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
#[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
#[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN"
#[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

注意：正则表达式匹配国家有点尴尬，因为您的示例包含所有大写多词国家（DOMINCAN REPUBLIC）、所有大写单词国家（例如GUAM）和"first-letter-caps" 个国家（China）。

Answer 2

您可以将 strsplit 与适当的正则表达式一起使用：

strsplit(t, "\.\s(?=[\w\s]+:)", perl=TRUE)

或

stringr::str_split(t, "\.\s(?=[\w\s]+:)")

备注：

\.\s 匹配文字点和 space.
(?=[\w\s]+:) 是一个积极的前瞻，它匹配一个单词字符或 space 一次或多次跟在冒号之后。
\.\s(?=[\w\s]+:) 因此匹配一个点和一个 space 只有当它紧跟一个单词字符或一个或多次 space 和一个冒号时。这将是每个段落的结尾。
因为我在 strsplit 中使用正则表达式，所以我根据正则表达式匹配的任何内容进行拆分。这导致在每个段落的末尾拆分。
perl=TRUE 才能启用 lookaheads/behinds.

结果：

[[1]]
[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"                                         
[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"                                                                                                         
[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

通过连续的冒号从字符串中提取字符

Extract characters from a string by a succession of colons

string

r

stringr

grepl