字符串拆分后行消失

Disappearing row after string split

我有一列坐标,我用 strsplit() 拆分并用 gsub() 删除不需要的字符。请注意,有 3034 行

> head(bike_parking$Geom)
[1] "(37.7606289177, -122.410647009)" "(37.752476948, -122.410625009)" 
[3] "(37.7871729481, -122.402401009)" "(37.7776039475, -122.422764009)"
[5] "(37.7658325695, -122.46649784)"  "(37.7693399479, -122.432820008)"

> length(bike_parking$Geom)
[1] 3034

 > sum(is.na(bike_parking$Geom))
[1] 0

出于某种原因,在我 运行

dat <- data.frame(do.call(rbind, strsplit(as.vector(gsub("[()]", "", bike_parking$Geom)), split = ",")))

我剩下 3033。这是怎么发生的,我应该采取什么步骤来找出问题所在?

> head(dat)
             X1              X2
1 37.7606289177  -122.410647009
2  37.752476948  -122.410625009
3 37.7871729481  -122.402401009
4 37.7776039475  -122.422764009
5 37.7658325695   -122.46649784
6 37.7693399479  -122.432820008

> nrow(dat)
[1] 3033

你的字符串似乎到处都没有相同的结构。您将以某种方式必须知道它们都有哪些共同结构才能正确拆分它们。从问题下方的评论中,我得出一些字符串可能不包含用于分隔坐标的逗号。您可以删除所有逗号并在空 space 处拆分字符串。我将 post 一个基于 R 的解决方案和一个带有 stringr-package 的解决方案。

选项 1:基础 R: 我们可以使用 gsub() 从您的字符串中删除括号和逗号。然后我们可以使用 strsplit() 在 space 处拆分字符串。结果将是:

splitted <- strsplit(gsub("[(),]", "", bike_parking$Geom), " ")
# [[1]]
# [1] "37.7606289177"  "-122.410647009"
# [[2]]
# [1] "37.752476948"   "-122.410625009"
# [[3]]
# [1] "37.7871729481"  "-122.402401009"
# [[4]]
# [1] "37.7776039475"  "-122.422764009"
# [[5]]
# [1] "37.7658325695" "-122.46649784"
# [[6]]
# [1] "37.7693399479"  "-122.432820008"

我们必须稍微重新组织这些结果,所以您最终会得到一个包含两列的 data.frame:

sapply(1:2, function(x) sapply(splitted, `[[`, x))
#      [,1]            [,2]            
# [1,] "37.7606289177" "-122.410647009"
# [2,] "37.752476948"  "-122.410625009"
# [3,] "37.7871729481" "-122.402401009"
# [4,] "37.7776039475" "-122.422764009"
# [5,] "37.7658325695" "-122.46649784" 
# [6,] "37.7693399479" "-122.432820008"

选项 2: Stringr: 这个包包含一个函数 str_split() (not strsplit()!),这允许您跳过基本 R 解决方案的最后一步,因为您可以立即获得 data.frame 而不是带有向量的列表:

str_split(gsub("[(),]", "", bike_parking$Geom), " ", simplify=TRUE)