在 lappy 函数中查找不需要的删除的原因
Finding the cause of an unwanted deletion within an lappy function
我将一个 .txt 文件上传到 R,如下所示:Election_Parties <- readr::read_lines("Election_Parties.txt")
文件中包含以下文本:pastebin link.
文字大致如下(请以实际文件为准!):
BOLIVIA
P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento
Nacionalista Revolucionario [MNR])
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])
COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas
de Colombia)
我想在一条线上获得关于派对的所有信息,不管它有多长。
期望输出:
BOLIVIA
P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento Nacionalista Revolucionario
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])
COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas de Colombia)
我有一个解决方案,几乎可以完全解决@JBGruber 的问题,可以找到 :
lines <- readr::read_lines("https://pastebin.com/raw/jSrvTa7G")
head(lines)
entries <- split(lines, cumsum(grepl("^$|^ $", lines)))
library(stringr)
library(dplyr)
df <- lapply(entries, function(entry) {
entry <- entry[!grepl("^$|^ $", entry)] # remove empty elements
header <- entry[1] # first non empty is the header
entry <- tail(entry, -1) # remove header from entry
desc <- str_extract(entry, "^P\d+-") # extract description
for (l in which(is.na(desc))) { # collapse lines that go over 2 elements
entry[l - 1] <- paste(entry[l - 1], entry[l], sep = " ")
}
entry <- entry[!is.na(desc)]
desc <- desc[!is.na(desc)]
# turn into nice format
df <- tibble::tibble(
header,
desc,
entry
)
df$entry <- str_replace_all(df$entry, fixed(df$desc), "") # remove description from entry
return(df)
}) %>%
bind_rows() # turn list into one data.frame
但它以某种方式删除了信息。例如,此信息:
P1-Movement for a Prosperous Czechoslovakia (Hnutie za prosperujúce Česko + Slovensko
[HZPČS])
P2-Social Democracy (Sociálna demokracia [SD])
P3-Association for Workers in Slovakia (Združenie robotníkov Slovenska [ZRS])
我对代码的理解不够深入,无法了解此删除可能发生的位置,或者如何逐步检查它发生的位置(因为一切都发生在 lapply
内)。有人可以帮忙吗?
请注意,使用 data.table
的解决方案同样受欢迎。
编辑:
答案不再正常工作的原因是文件已略有更改。最初的答案是基于条目由空行分隔的事实。这些线不见了。但是条目现在由仅包含 "P00-" 的行分隔。我们可以用它作为分隔符。
lines <- readr::read_lines("https://pastebin.com/raw/KKu9FmF6")
entries <- split(lines, cumsum(grepl("P00-$", lines)))
library(stringr)
library(dplyr)
df <- lapply(entries, function(entry) {
entry <- entry[!grepl("P00-$", entry)] # remove empty elements
header <- entry[1] # first non empty is the header
entry <- tail(entry, -1) # remove header from entry
desc <- str_extract(entry, "^P\d+-") # extract description
for (l in which(is.na(desc))) { # collapse lines that go over 2 elements
entry[l - 1] <- paste(entry[l - 1], entry[l], sep = " ")
}
entry <- entry[!is.na(desc)]
desc <- desc[!is.na(desc)]
# turn into nice format
df <- tibble::tibble(
header,
desc,
entry
)
df$entry <- str_replace_all(df$entry, fixed(df$desc), "") # remove description from entry
return(df)
}) %>%
bind_rows() # turn list into one data.frame
我检查了您上面列出的信息是否仍然缺失,事实并非如此:
df %>%
filter(str_detect(entry, "Movement for a Prosperous Czechoslovakia|Sociálna demokraci|Association for Workers in Slovakia"))
#> # A tibble: 3 x 3
#> header desc entry
#> <chr> <chr> <chr>
#> 1 P00-SLOVAK… P1- Movement for a Prosperous Czechoslovakia (Hnutie za prosper…
#> 2 P00-SLOVAK… P2- Social Democracy (Sociálna demokracia [SD])
#> 3 P00-SLOVAK… P3- Association for Workers in Slovakia (Združenie robotníkov S…
由 reprex package (v0.3.0)
于 2019-12-16 创建
我试图让答案尽可能清楚,但我知道通常很难理解别人的代码。总是对我有帮助的一件事是逐行 运行 解决方案并检查对象如何变化。由于大部分重要的东西都隐藏在循环中,您可以通过创建一个示例条目来模拟 lapply
中的一个 运行:entry <- entries[[1]]
。现在你可以在 lapply
.
@JBGruber 答案的纯基础 R 替代方案:
txt <- readLines("https://pastebin.com/raw/KKu9FmF6")
txtgrps <- split(txt, cumsum(grepl("P00-$", txt)))
l <- lapply(txtgrps, function(grp) {
grp <- tail(grp, -1)
country <- gsub("^P\d+-", "", grp[1])
grp <- tail(grp, -1)
grp <- tapply(grp, cumsum(grepl("^P\d+-", grp)), paste, collapse = " ")
code <- sub("(P\d+)-.*", "\1", grp)
party <- gsub("^P\d+-", "", grp)
df <- data.frame(country, code, party)
return(df)
})
df <- do.call(rbind, l)
给出:
> head(df)
country code party
1.1 ALBANIA P1 Democratic Alliance Party (Partia Aleanca Democratike [AD])
1.2 ALBANIA P2 National Unity Party (Partia Uniteti Kombëtar [PUK])
1.3 ALBANIA P3 Social Spectrum Parties-Party of National Unity (Partitë e Spektrit Social-Partia e Unitetit Kombëtar [PSHS-PUK])
1.4 ALBANIA P4 Alliance Party for Solidarity and Welfare (Partia Aleanca për Mirëqenie dhe Solidaritet [AMS])
1.5 ALBANIA P5 Albanian Democratic Union-Alliance for Freedom, Justice and Welfare (Partia Bashkimi Demokrat Shqiptar-Aleanca për Liri, Drejtësi dhe Mirëqenie [BDSH])
1.6 ALBANIA P6 Liberal Democrat Party (Partia Bashkimi Liberal Demokrat [BLD])
对于新的输入,您可以将解决方案调整为:
txt <- readLines("https://pastebin.com/raw/FTV3Gded")
txtgrps <- split(txt, cumsum(grepl("^$|^ $", txt)))
# based on:
l <- lapply(txtgrps, function(grp) {
grp <- tail(grp, -1)
country <- grp[1]
grp <- tail(grp, -1)
grp <- tapply(grp, cumsum(grepl("^P\d+", grp)), paste, collapse = " ")
code <- sub("(P\d+).*", "\1", grp)
party <- substring(sub("^P\d+", "", grp), 2)
df <- data.frame(country, code, party)
return(df)
})
df <- do.call(rbind, l)
我将一个 .txt 文件上传到 R,如下所示:Election_Parties <- readr::read_lines("Election_Parties.txt")
文件中包含以下文本:pastebin link.
文字大致如下(请以实际文件为准!):
BOLIVIA
P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento
Nacionalista Revolucionario [MNR])
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])
COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas
de Colombia)
我想在一条线上获得关于派对的所有信息,不管它有多长。
期望输出:
BOLIVIA
P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento Nacionalista Revolucionario
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])
COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas de Colombia)
我有一个解决方案,几乎可以完全解决@JBGruber 的问题,可以找到
lines <- readr::read_lines("https://pastebin.com/raw/jSrvTa7G")
head(lines)
entries <- split(lines, cumsum(grepl("^$|^ $", lines)))
library(stringr)
library(dplyr)
df <- lapply(entries, function(entry) {
entry <- entry[!grepl("^$|^ $", entry)] # remove empty elements
header <- entry[1] # first non empty is the header
entry <- tail(entry, -1) # remove header from entry
desc <- str_extract(entry, "^P\d+-") # extract description
for (l in which(is.na(desc))) { # collapse lines that go over 2 elements
entry[l - 1] <- paste(entry[l - 1], entry[l], sep = " ")
}
entry <- entry[!is.na(desc)]
desc <- desc[!is.na(desc)]
# turn into nice format
df <- tibble::tibble(
header,
desc,
entry
)
df$entry <- str_replace_all(df$entry, fixed(df$desc), "") # remove description from entry
return(df)
}) %>%
bind_rows() # turn list into one data.frame
但它以某种方式删除了信息。例如,此信息:
P1-Movement for a Prosperous Czechoslovakia (Hnutie za prosperujúce Česko + Slovensko
[HZPČS])
P2-Social Democracy (Sociálna demokracia [SD])
P3-Association for Workers in Slovakia (Združenie robotníkov Slovenska [ZRS])
我对代码的理解不够深入,无法了解此删除可能发生的位置,或者如何逐步检查它发生的位置(因为一切都发生在 lapply
内)。有人可以帮忙吗?
请注意,使用 data.table
的解决方案同样受欢迎。
编辑:
答案不再正常工作的原因是文件已略有更改。最初的答案是基于条目由空行分隔的事实。这些线不见了。但是条目现在由仅包含 "P00-" 的行分隔。我们可以用它作为分隔符。
lines <- readr::read_lines("https://pastebin.com/raw/KKu9FmF6")
entries <- split(lines, cumsum(grepl("P00-$", lines)))
library(stringr)
library(dplyr)
df <- lapply(entries, function(entry) {
entry <- entry[!grepl("P00-$", entry)] # remove empty elements
header <- entry[1] # first non empty is the header
entry <- tail(entry, -1) # remove header from entry
desc <- str_extract(entry, "^P\d+-") # extract description
for (l in which(is.na(desc))) { # collapse lines that go over 2 elements
entry[l - 1] <- paste(entry[l - 1], entry[l], sep = " ")
}
entry <- entry[!is.na(desc)]
desc <- desc[!is.na(desc)]
# turn into nice format
df <- tibble::tibble(
header,
desc,
entry
)
df$entry <- str_replace_all(df$entry, fixed(df$desc), "") # remove description from entry
return(df)
}) %>%
bind_rows() # turn list into one data.frame
我检查了您上面列出的信息是否仍然缺失,事实并非如此:
df %>%
filter(str_detect(entry, "Movement for a Prosperous Czechoslovakia|Sociálna demokraci|Association for Workers in Slovakia"))
#> # A tibble: 3 x 3
#> header desc entry
#> <chr> <chr> <chr>
#> 1 P00-SLOVAK… P1- Movement for a Prosperous Czechoslovakia (Hnutie za prosper…
#> 2 P00-SLOVAK… P2- Social Democracy (Sociálna demokracia [SD])
#> 3 P00-SLOVAK… P3- Association for Workers in Slovakia (Združenie robotníkov S…
由 reprex package (v0.3.0)
于 2019-12-16 创建我试图让答案尽可能清楚,但我知道通常很难理解别人的代码。总是对我有帮助的一件事是逐行 运行 解决方案并检查对象如何变化。由于大部分重要的东西都隐藏在循环中,您可以通过创建一个示例条目来模拟 lapply
中的一个 运行:entry <- entries[[1]]
。现在你可以在 lapply
.
@JBGruber 答案的纯基础 R 替代方案:
txt <- readLines("https://pastebin.com/raw/KKu9FmF6")
txtgrps <- split(txt, cumsum(grepl("P00-$", txt)))
l <- lapply(txtgrps, function(grp) {
grp <- tail(grp, -1)
country <- gsub("^P\d+-", "", grp[1])
grp <- tail(grp, -1)
grp <- tapply(grp, cumsum(grepl("^P\d+-", grp)), paste, collapse = " ")
code <- sub("(P\d+)-.*", "\1", grp)
party <- gsub("^P\d+-", "", grp)
df <- data.frame(country, code, party)
return(df)
})
df <- do.call(rbind, l)
给出:
> head(df)
country code party
1.1 ALBANIA P1 Democratic Alliance Party (Partia Aleanca Democratike [AD])
1.2 ALBANIA P2 National Unity Party (Partia Uniteti Kombëtar [PUK])
1.3 ALBANIA P3 Social Spectrum Parties-Party of National Unity (Partitë e Spektrit Social-Partia e Unitetit Kombëtar [PSHS-PUK])
1.4 ALBANIA P4 Alliance Party for Solidarity and Welfare (Partia Aleanca për Mirëqenie dhe Solidaritet [AMS])
1.5 ALBANIA P5 Albanian Democratic Union-Alliance for Freedom, Justice and Welfare (Partia Bashkimi Demokrat Shqiptar-Aleanca për Liri, Drejtësi dhe Mirëqenie [BDSH])
1.6 ALBANIA P6 Liberal Democrat Party (Partia Bashkimi Liberal Demokrat [BLD])
对于新的输入,您可以将解决方案调整为:
txt <- readLines("https://pastebin.com/raw/FTV3Gded")
txtgrps <- split(txt, cumsum(grepl("^$|^ $", txt)))
# based on:
l <- lapply(txtgrps, function(grp) {
grp <- tail(grp, -1)
country <- grp[1]
grp <- tail(grp, -1)
grp <- tapply(grp, cumsum(grepl("^P\d+", grp)), paste, collapse = " ")
code <- sub("(P\d+).*", "\1", grp)
party <- substring(sub("^P\d+", "", grp), 2)
df <- data.frame(country, code, party)
return(df)
})
df <- do.call(rbind, l)