从大量混乱的数据中提取价值
Extracting values from a messy bulk of data
我有大量混乱的数据,我想从中提取信息。现在,我还没有找到一种方便的方法来提取信息,希望您能提供帮助。我的数据如下所示:
"\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nChannels\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\
n\r\nDates\r\nSeptember 25th 2016 To September 26th
2016\r\n\r\n\r\nPlatform\r\nIdea\r\n\r\n\r\nCountry\r\nUnited
States\r\n\r\n\r\nRestricted Countries\r\n\r\n\t\t\t\t\t\t\t\t\tUnited
States\t\t\t\t\t\t\t\t\r\n\r\n\r\nInitial Price\r\n[=10=].0692\r\n\r\n\r\n"
现在,我想从中得到的是:
Channels -
Dates September 25th 2016 To September 26th 2016
Platform Idea
Country United States
Restricted Countries United States
Initial Price [=11=].0692
我将需要针对大量观察执行此任务,然后将每个变量存储为所有观察的向量。因此,我真的不需要存储变量的名称(即 "Platform"),而只需要存储结果("Idea")。但是要做到这一点,我需要 "Platform" 变量名称作为 "Identifier" 我会假设,因为变量在文本中的位置随观察变化而变化(变量的数量也是如此 - 虽然只是一点点) .
现在,我认为 stringr 包是一个很好的方法,但我还没有找到一个方便的方法来做到这一点。
以下正则表达式提取您想要的值。这些值存储在结果矩阵的第 2-7 列中。该代码使用输入向量(每个条目在矩阵中形成一个新行)
library(stringr)
input <- "\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nChannels\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nDates\r\nSeptember 25th 2016 To September 26th 2016\r\n\r\n\r\nPlatform\r\nIdea\r\n\r\n\r\nCountry\r\nUnited States\r\n\r\n\r\nRestricted Countries\r\n\r\n\t\t\t\t\t\t\t\t\tUnited States\t\t\t\t\t\t\t\t\r\n\r\n\r\nInitial Price\r\n[=10=].0692\r\n\r\n\r\n"
str_match(input, paste0("[[:space:]]*Channels[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*Dates[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*Platform[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*Country[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*Restricted Countries[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*Initial Price[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*"))
编辑:抱歉,我忽略了变量在文本中的位置可以在不同的输入之间改变。在那种情况下,您无法使用此方法轻松地一次提取所有变量。但是,您仍然可以通过使用上面正则表达式中的适当行来一一提取它们。如果变量不存在(如示例中的 "Channels"),这不是问题 - 它将显示为 NA
).
基础 R 解决方案:
yourstring1 <- "\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nChannels\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\
n\r\nDates\r\nSeptember 25th 2016 To September 26th
2016\r\n\r\n\r\nPlatform\r\nIdea\r\n\r\n\r\nCountry\r\nUnited
States\r\n\r\n\r\nRestricted Countries\r\n\r\n\t\t\t\t\t\t\t\t\tUnited
States\t\t\t\t\t\t\t\t\r\n\r\n\r\nInitial Price\r\n[=10=].0692\r\n\r\n\r\n"
# make a placeholder (useful when manipulating strings for easier regex)
yourstring2 <- gsub("\r|\t|\nn|\n", "@", yourstring1, perl = T) # please note the double nn - this is so because a newline character is added when copying from here to R
# split on placeholder if it appears twice or more
yourstring2 <- unlist(strsplit(yourstring2, split = "@{2,}"))
# little cleaning needed
yourstring2 <- gsub(" @", " ", yourstring2)
yourstring2[1:2] <- c(yourstring2[2], "-") # this hard-coded solution works for the particular example, if you have many strings with arbitrarily missing values, you may want to make a little condition for that
# prepare your columns by indexing the character vector
variables <- yourstring2[seq(from = 1, to = length(yourstring2), by = 2)]
values <- yourstring2[seq(from = 2, to = length(yourstring2), by = 2)]
# bind them to dataframe
df <- data.frame(variables, values)
结果 df:
df
variables values
1 Channels -
2 Dates September 25th 2016 To September 26th 2016
3 Platform Idea
4 Country United States
5 Restricted Countries United States
6 Initial Price [=11=].0692
编辑:直到现在我才正确地读到,期望的结果可能是位置向量而不是数据帧……这是一个两行的解决方案
yourstring2 <- gsub("\r|\t|\nn|\n", "", yourstring1, perl = T) #clean the original string (see above yourstring1)
yourvector <- unlist(strsplit(yourstring2, split = "Channels|Dates|Platform|Country|Restricted Countries|Initial Price", perl = T))[-1] # extract
结果向量:
> yourvector
[1] ""
[2] "September 25th 2016 To September 26th 2016"
[3] "Idea"
[4] "United States"
[5] "United States"
[6] "[=13=].0692"
将 a 作为您的输入字符串,结果将是一个数据框,每个关键字一个变量(未使用的关键字缺少值),每个输入一行:
a <- gsub("\t*(\r\n)+\t*","/",a)
a <- gsub("(^/|/$)","",a)
a <- gsub("(Channels|Dates|Platform|Country|Restricted Countries|Initial Price)","<\1>",a)
a <- gsub(">/<",">//<",a)
b <- strsplit(a,"/")
c <- purrr::map(b,
function(x) {
dim(x) <- c(2,length(x)/2)
tidyr::spread(as.data.frame(t(x),stringsAsFactors=FALSE),V1,V2)
})
replyr::replyr_bind_rows(c)
我有大量混乱的数据,我想从中提取信息。现在,我还没有找到一种方便的方法来提取信息,希望您能提供帮助。我的数据如下所示:
"\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nChannels\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\
n\r\nDates\r\nSeptember 25th 2016 To September 26th
2016\r\n\r\n\r\nPlatform\r\nIdea\r\n\r\n\r\nCountry\r\nUnited
States\r\n\r\n\r\nRestricted Countries\r\n\r\n\t\t\t\t\t\t\t\t\tUnited
States\t\t\t\t\t\t\t\t\r\n\r\n\r\nInitial Price\r\n[=10=].0692\r\n\r\n\r\n"
现在,我想从中得到的是:
Channels -
Dates September 25th 2016 To September 26th 2016
Platform Idea
Country United States
Restricted Countries United States
Initial Price [=11=].0692
我将需要针对大量观察执行此任务,然后将每个变量存储为所有观察的向量。因此,我真的不需要存储变量的名称(即 "Platform"),而只需要存储结果("Idea")。但是要做到这一点,我需要 "Platform" 变量名称作为 "Identifier" 我会假设,因为变量在文本中的位置随观察变化而变化(变量的数量也是如此 - 虽然只是一点点) .
现在,我认为 stringr 包是一个很好的方法,但我还没有找到一个方便的方法来做到这一点。
以下正则表达式提取您想要的值。这些值存储在结果矩阵的第 2-7 列中。该代码使用输入向量(每个条目在矩阵中形成一个新行)
library(stringr)
input <- "\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nChannels\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nDates\r\nSeptember 25th 2016 To September 26th 2016\r\n\r\n\r\nPlatform\r\nIdea\r\n\r\n\r\nCountry\r\nUnited States\r\n\r\n\r\nRestricted Countries\r\n\r\n\t\t\t\t\t\t\t\t\tUnited States\t\t\t\t\t\t\t\t\r\n\r\n\r\nInitial Price\r\n[=10=].0692\r\n\r\n\r\n"
str_match(input, paste0("[[:space:]]*Channels[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*Dates[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*Platform[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*Country[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*Restricted Countries[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*Initial Price[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*"))
编辑:抱歉,我忽略了变量在文本中的位置可以在不同的输入之间改变。在那种情况下,您无法使用此方法轻松地一次提取所有变量。但是,您仍然可以通过使用上面正则表达式中的适当行来一一提取它们。如果变量不存在(如示例中的 "Channels"),这不是问题 - 它将显示为 NA
).
基础 R 解决方案:
yourstring1 <- "\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nChannels\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\
n\r\nDates\r\nSeptember 25th 2016 To September 26th
2016\r\n\r\n\r\nPlatform\r\nIdea\r\n\r\n\r\nCountry\r\nUnited
States\r\n\r\n\r\nRestricted Countries\r\n\r\n\t\t\t\t\t\t\t\t\tUnited
States\t\t\t\t\t\t\t\t\r\n\r\n\r\nInitial Price\r\n[=10=].0692\r\n\r\n\r\n"
# make a placeholder (useful when manipulating strings for easier regex)
yourstring2 <- gsub("\r|\t|\nn|\n", "@", yourstring1, perl = T) # please note the double nn - this is so because a newline character is added when copying from here to R
# split on placeholder if it appears twice or more
yourstring2 <- unlist(strsplit(yourstring2, split = "@{2,}"))
# little cleaning needed
yourstring2 <- gsub(" @", " ", yourstring2)
yourstring2[1:2] <- c(yourstring2[2], "-") # this hard-coded solution works for the particular example, if you have many strings with arbitrarily missing values, you may want to make a little condition for that
# prepare your columns by indexing the character vector
variables <- yourstring2[seq(from = 1, to = length(yourstring2), by = 2)]
values <- yourstring2[seq(from = 2, to = length(yourstring2), by = 2)]
# bind them to dataframe
df <- data.frame(variables, values)
结果 df:
df
variables values
1 Channels -
2 Dates September 25th 2016 To September 26th 2016
3 Platform Idea
4 Country United States
5 Restricted Countries United States
6 Initial Price [=11=].0692
编辑:直到现在我才正确地读到,期望的结果可能是位置向量而不是数据帧……这是一个两行的解决方案
yourstring2 <- gsub("\r|\t|\nn|\n", "", yourstring1, perl = T) #clean the original string (see above yourstring1)
yourvector <- unlist(strsplit(yourstring2, split = "Channels|Dates|Platform|Country|Restricted Countries|Initial Price", perl = T))[-1] # extract
结果向量:
> yourvector
[1] ""
[2] "September 25th 2016 To September 26th 2016"
[3] "Idea"
[4] "United States"
[5] "United States"
[6] "[=13=].0692"
将 a 作为您的输入字符串,结果将是一个数据框,每个关键字一个变量(未使用的关键字缺少值),每个输入一行:
a <- gsub("\t*(\r\n)+\t*","/",a)
a <- gsub("(^/|/$)","",a)
a <- gsub("(Channels|Dates|Platform|Country|Restricted Countries|Initial Price)","<\1>",a)
a <- gsub(">/<",">//<",a)
b <- strsplit(a,"/")
c <- purrr::map(b,
function(x) {
dim(x) <- c(2,length(x)/2)
tidyr::spread(as.data.frame(t(x),stringsAsFactors=FALSE),V1,V2)
})
replyr::replyr_bind_rows(c)