通过连续的冒号从字符串中提取字符
Extract characters from a string by a succession of colons
我正在尝试从数据框中的变量中提取一些信息。我正在使用 R 3.3.3。
信息格式如下:
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
我想将每个部分分解成一个单独的变量,如下所示:
w = "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
x = "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
y = "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south."
z = "DOMINCAN REPUBLIC: Is a country located in the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
我在尝试提取此信息时遇到了一些困难。诸如 this and this 之类的问题非常有帮助。从这些中,我收集到某种形式的 stringr/ gsub 可用于提取此信息,但我无法弄清楚如何在 gsub 语句中指定范围。
我已经能够弄清楚如何拉出第一部分:
>test4 <- gsub("(.*{1})(:.*)","\1", t)
这给出了
[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"
我的总体问题是:
[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"
如果我不必清理字符串末尾的 "DOMINICAN REPUBLIC" 部分就好了。
总结:
1.如何通过连续的冒号从字符串中提取字符? (第 1 到第 2 个冒号,第 2 到第 3 个等)
2。有没有办法让冒号前面的单词也一样?
任何信息或指导将不胜感激。
基数 R 中的以下内容如何?
# Your sample string
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region.";
# Get position of regexp matches
matches <- data.frame(
idx = unlist(gregexpr(pattern = "([A-Z]*\s*[A-Z]+:|\w+:)", t)),
len = c(diff(unlist(gregexpr(pattern = "([A-Z]*\s*[A-Z]+:|\w+:)", t))), nchar(t))
);
# Get substrings based on positions and store in list
lst <- apply(matches, 1, function(x) {
trimws(substr(t, x[1], sum(x) - 1));
})
lst;
#[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
#[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
#[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN"
#[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
注意:正则表达式匹配国家有点尴尬,因为您的示例包含所有大写多词国家(DOMINCAN REPUBLIC
)、所有大写单词国家(例如GUAM
)和"first-letter-caps" 个国家(China
)。
您可以将 strsplit
与适当的正则表达式一起使用:
strsplit(t, "\.\s(?=[\w\s]+:)", perl=TRUE)
或
stringr::str_split(t, "\.\s(?=[\w\s]+:)")
备注:
\.\s
匹配文字点和 space.
(?=[\w\s]+:)
是一个积极的前瞻,它匹配一个单词字符或 space 一次或多次跟在冒号之后。
\.\s(?=[\w\s]+:)
因此匹配一个点和一个 space 只有当它紧跟一个单词字符或一个或多次 space 和一个冒号时。这将是每个段落的结尾。
- 因为我在
strsplit
中使用正则表达式,所以我根据正则表达式匹配的任何内容进行拆分。这导致在每个段落的末尾拆分。
需要 perl=TRUE
才能启用 lookaheads/behinds.
结果:
[[1]]
[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"
[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"
[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
我正在尝试从数据框中的变量中提取一些信息。我正在使用 R 3.3.3。
信息格式如下:
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
我想将每个部分分解成一个单独的变量,如下所示:
w = "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
x = "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
y = "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south."
z = "DOMINCAN REPUBLIC: Is a country located in the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
我在尝试提取此信息时遇到了一些困难。诸如 this and this 之类的问题非常有帮助。从这些中,我收集到某种形式的 stringr/ gsub 可用于提取此信息,但我无法弄清楚如何在 gsub 语句中指定范围。
我已经能够弄清楚如何拉出第一部分:
>test4 <- gsub("(.*{1})(:.*)","\1", t)
这给出了
[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"
我的总体问题是:
[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"
如果我不必清理字符串末尾的 "DOMINICAN REPUBLIC" 部分就好了。
总结:
1.如何通过连续的冒号从字符串中提取字符? (第 1 到第 2 个冒号,第 2 到第 3 个等)
2。有没有办法让冒号前面的单词也一样?
任何信息或指导将不胜感激。
基数 R 中的以下内容如何?
# Your sample string
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region.";
# Get position of regexp matches
matches <- data.frame(
idx = unlist(gregexpr(pattern = "([A-Z]*\s*[A-Z]+:|\w+:)", t)),
len = c(diff(unlist(gregexpr(pattern = "([A-Z]*\s*[A-Z]+:|\w+:)", t))), nchar(t))
);
# Get substrings based on positions and store in list
lst <- apply(matches, 1, function(x) {
trimws(substr(t, x[1], sum(x) - 1));
})
lst;
#[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
#[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
#[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN"
#[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
注意:正则表达式匹配国家有点尴尬,因为您的示例包含所有大写多词国家(DOMINCAN REPUBLIC
)、所有大写单词国家(例如GUAM
)和"first-letter-caps" 个国家(China
)。
您可以将 strsplit
与适当的正则表达式一起使用:
strsplit(t, "\.\s(?=[\w\s]+:)", perl=TRUE)
或
stringr::str_split(t, "\.\s(?=[\w\s]+:)")
备注:
\.\s
匹配文字点和 space.(?=[\w\s]+:)
是一个积极的前瞻,它匹配一个单词字符或 space 一次或多次跟在冒号之后。\.\s(?=[\w\s]+:)
因此匹配一个点和一个 space 只有当它紧跟一个单词字符或一个或多次 space 和一个冒号时。这将是每个段落的结尾。- 因为我在
strsplit
中使用正则表达式,所以我根据正则表达式匹配的任何内容进行拆分。这导致在每个段落的末尾拆分。
需要 perl=TRUE
才能启用 lookaheads/behinds.
结果:
[[1]]
[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"
[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"
[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."