通过连续的冒号从字符串中提取字符

Extract characters from a string by a succession of colons

我正在尝试从数据框中的变量中提取一些信息。我正在使用 R 3.3.3。

信息格式如下:

t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

我想将每个部分分解成一个单独的变量,如下所示:

w = "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."

x = "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."

y = "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south."

z = "DOMINCAN REPUBLIC: Is a country located in the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

我在尝试提取此信息时遇到了一些困难。诸如 this and this 之类的问题非常有帮助。从这些中,我收集到某种形式的 stringr/ gsub 可用于提取此信息,但我无法弄清楚如何在 gsub 语句中指定范围。

我已经能够弄清楚如何拉出第一部分:

>test4 <- gsub("(.*{1})(:.*)","\1", t)

这给出了

[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"

我的总体问题是:

[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"

如果我不必清理字符串末尾的 "DOMINICAN REPUBLIC" 部分就好了。

总结:

1.如何通过连续的冒号从字符串中提取字符? (第 1 到第 2 个冒号,第 2 到第 3 个等)

2。有没有办法让冒号前面的单词也一样?

任何信息或指导将不胜感激。

基数 R 中的以下内容如何?

# Your sample string
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region.";

# Get position of regexp matches
matches <- data.frame(
    idx = unlist(gregexpr(pattern = "([A-Z]*\s*[A-Z]+:|\w+:)", t)),
    len = c(diff(unlist(gregexpr(pattern = "([A-Z]*\s*[A-Z]+:|\w+:)", t))), nchar(t))
);

# Get substrings based on positions and store in list
lst <- apply(matches, 1, function(x) {
    trimws(substr(t, x[1], sum(x) - 1));
})
lst;

#[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
#[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
#[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN"
#[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

注意:正则表达式匹配国家有点尴尬,因为您的示例包含所有大写多词国家(DOMINCAN REPUBLIC)、所有大写单词国家(例如GUAM)和"first-letter-caps" 个国家(China)。

您可以将 strsplit 与适当的正则表达式一起使用:

strsplit(t, "\.\s(?=[\w\s]+:)", perl=TRUE)

stringr::str_split(t, "\.\s(?=[\w\s]+:)")

备注:

  1. \.\s 匹配文字点和 space.
  2. (?=[\w\s]+:) 是一个积极的前瞻,它匹配一个单词字符或 space 一次或多次跟在冒号之后。
  3. \.\s(?=[\w\s]+:) 因此匹配一个点和一个 space 只有当它紧跟一个单词字符或一个或多次 space 和一个冒号时。这将是每个段落的结尾。
  4. 因为我在 strsplit 中使用正则表达式,所以我根据正则表达式匹配的任何内容进行拆分。这导致在每个段落的末尾拆分。
  5. 需要
  6. perl=TRUE 才能启用 lookaheads/behinds.

结果:

[[1]]
[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"                                         
[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"                                                                                                         
[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."