R组坐标从字符串中提取

Question

我正在尝试从字符串中提取坐标集并更改格式。

我已经尝试了一些 stringr 包，但在模式提取方面一无所获。这是我第一次处理正则表达式，创建模式仍然有点混乱。

有一列包含一组或多组坐标的数据框。将 Lat 与 Long 分开的唯一模式（大多数）是 (-)，并且将一组坐标与另一组坐标分开是 (/)

这里是一些数据的例子：

ID  Coordinates
1   3438-5150
2   3346-5108/3352-5120 East island, South port
3   West coast (284312 472254)
4   28.39.97-47.05.62/29.09.13-47.44.03
5   2843-4722/3359-5122(1H-2H-3H-4F)

大部分数据都是十进制度数，例如（id 1 是纬度 34.38 经度 51.50），其他一些在 00º00'00''，例如（id 4 是纬度 28º 39' 97'' 经度 47º 05' 62''）

我需要分几步制作

1 - 提取所有坐标集，为每组记录创建一个新行；

2 - 将记录的文本标签提取到新列中，将它们连接起来；

3- 将坐标从 00º00'00''(28.39.97) 转换为 00.0000º（28.6769 - 十进制 dregree），以便所有坐标都采用相同的格式。如果它们是数字，我可以轻松转换。

4 - 添加点 (.) 以分隔十进制值（从 3438 到 34.38）并添加 (-) 以标识为 (-34.38) 西南半球。所有值必须有 (-) 符号。

我想得到这样的东西：

第 1 步和第 2 步 - 提取坐标集和名称

ID  x           y          label
1   3438        5150      
2   3346        5108      East island, South port
2   3352        5120      East island, South port
3   284312      472254    West coast
4   28.39.97    47.05.62    
4   29.09.13    47.44.03
5   2843        4722      1H-2H-3H-4F
5   3359        5122      1H-2H-3H-4F

第 3 步 - 将坐标格式转换为十进制度数 (ID 4)

ID  x           y       label
1   3438        5150    
2   3346        5108    East island, South port
2   3352        5120    East island, South port
3   284312      472254  West coast
4   286769      471005  
4   291536      470675
5   2843        4722      1H-2H-3H-4F
5   3359        5122      1H-2H-3H-4F

第 4 步 - 更改显示格式

ID   x          y         label
1   -34.38      -51.50    
2   -33.46      -51.08    East island, South port
2   -33.52      -51.20    East island, South port
3   -28.43      -47.22    West coast
4   -28.6769    -47.1005    
4   -29.1536    -47.0675
5   -28.43      -47.22    1H-2H-3H-4F
5   -33.59      -51.22    1H-2H-3H-4F

我已经编辑了问题以更好地阐明我的问题并改变我的一些需求。才发现理解起来很乱

那么，有人用过类似的东西吗？任何其他建议都会有很大帮助。

再次感谢您花时间提供帮助。

Answer 1

注意：第一个答案针对问题的原始提问，最后一个答案针对其当前状态。 data1 中的数据应针对每个解决方案适当设置。

根据您提供的数据和预期的输出（使用 dplyr 和 tidyr），下面应该解决您的第一个问题。

library(dplyr)
library(tidyr)

### Load Data
data1 <- structure(list(ID = 1:4, Coordinates = c("3438-5150", "3346-5108/3352-5120", 
"2843-4722/3359-5122(1H-2H-3H-4F)", "28.39.97-47.05.62/29.09.13-47.44.03"
)), .Names = c("ID", "Coordinates"), class = "data.frame", row.names = c(NA, 
-4L))

### This is a helper function to transform data that is like '1234'
### but should be '12.34', and leaves alone '12.34'.
### You may have to change this based on your use case.
div100 <- function(x) { return(ifelse(x > 100, x / 100, x)) }

### Remove items like "(...)" and change "12.34.56" to "12.34"
### Split into 4 columns and xform numeric value.
data1 %>%
    mutate(Coordinates = gsub('\([^)]+\)', '', Coordinates),
           Coordinates = gsub('(\d+[.]\d+)[.]\d+', '\1', Coordinates)) %>%
    separate(Coordinates, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE) %>%
    mutate_at(vars(matches('^[xy][.]')), div100) # xform columns x.N and y.N
##   ID   x.1   y.1   x.2   y.2
## 1  1 34.38 51.50    NA    NA
## 2  2 33.46 51.08 33.52 51.20
## 3  3 28.43 47.22 33.59 51.22
## 4  4 28.39 47.05 29.09 47.44

对 mutate 的调用修改了 Coordinates 两次以使替换更容易。

编辑

使用另一个正则表达式替换而不是 mutate_at 的变体。

data1 %>%
mutate(Coordinates = gsub('\([^)]+\)', '', Coordinates),
       Coordinates = gsub('(\d{2}[.]\d{2})[.]\d{2}', '\1', Coordinates),
       Coordinates = gsub('(\d{2})(\d{2})', '\1.\2', Coordinates)) %>%
separate(Coordinates, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE)

编辑 2：以下解决方案解决了问题的更新版本

以下解决方案进行了一些转换来转换数据。这些是分开的，这样更容易思考（相对来说容易多了）。

library(dplyr)
library(tidyr)

data1 <- structure(list(ID = 1:5, Coordinates = c("3438-5150", "3346-5108/3352-5120 East island, South port", 
"East coast (284312 472254)", "28.39.97-47.05.62/29.09.13-47.44.03", 
"2843-4722/3359-5122(1H-2H-3H-4F)")), .Names = c("ID", "Coordinates"
), class = "data.frame", row.names = c(NA, -5L))

### Function for converting to numeric values and
### handles case of "12.34.56" (hours/min/sec)
hms_convert <- function(llval) {
  nres <- rep(0, length(llval))
  coord3_match_idx <- grepl('^\d{2}[.]\d{2}[.]\d{2}$', llval)
  nres[coord3_match_idx] <- sapply(str_split(llval[coord3_match_idx], '[.]', 3), function(x) { sum(as.numeric(x) / c(1,60,3600))})
  nres[!coord3_match_idx] <- as.numeric(llval[!coord3_match_idx])
  nres
}

### Each mutate works to transform the various data formats
### into a single format.  The 'separate' commands then split
### the data into the appropriate columns.  The action of each
### 'mutate' can be seen by progressively viewing the results
### (i.e. adding one 'mutate' command at a time).
data1 %>%
  mutate(Coordinates_new = Coordinates) %>%
  mutate(Coordinates_new = gsub('\([^) ]+\)', '', Coordinates_new)) %>%
  mutate(Coordinates_new = gsub('(.*?)\(((\d{6})[ ](\d{6}))\).*', '\3-\4 \1', Coordinates_new)) %>%
  mutate(Coordinates_new = gsub('(\d{2})(\d{2})(\d{2})', '\1.\2.\3', Coordinates_new)) %>%
  mutate(Coordinates_new = gsub('(\S+)[\s]+(.+)', '\1|\2', Coordinates_new, perl = TRUE)) %>%
  separate(Coordinates_new, c('Coords', 'label'), fill = 'right', sep = '[|]', convert = TRUE) %>%
  mutate(Coords = gsub('(\d{2})(\d{2})', '\1.\2', Coords)) %>%
  separate(Coords, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE) %>%
  mutate_at(vars(matches('^[xy][.]')), hms_convert) %>%
  mutate_at(vars(matches('^[xy][.]')), function(x) ifelse(!is.na(x), -x, x))

##   ID                                 Coordinates       x.1       y.1       x.2       y.2                   label
## 1  1                                   3438-5150 -34.38000 -51.50000        NA        NA                    <NA>
## 2  2 3346-5108/3352-5120 East island, South port -33.46000 -51.08000 -33.52000 -51.20000 East island, South port
## 3  3                  East coast (284312 472254) -28.72000 -47.38167        NA        NA             East coast 
## 4  4         28.39.97-47.05.62/29.09.13-47.44.03 -28.67694 -47.10056 -29.15361 -47.73417                    <NA>
## 5  5            2843-4722/3359-5122(1H-2H-3H-4F) -28.43000 -47.22000 -33.59000 -51.22000                    <NA>

Answer 2

我们可以使用stringi。我们用 gsub 在 4 位数字之间创建一个 .，使用 stri_extract_all（来自 stringi）提取两位数字，然后是一个点，然后是两位数字（\d{2}\.\d{2}) 以获得 list 输出。由于 list 元素长度不等，我们可以在长度小于最大长度的元素末尾填充 NA 并转换为 matrix（使用 stri_list2matrix）。转换为 data.frame 后，将 character 列更改为 numeric，并将 cbind 更改为原始数据集的 'ID' 列。

library(stringi)
d1 <- as.data.frame(stri_list2matrix(stri_extract_all_regex(gsub("(\d{2})(\d{2})", 
  "\1.\2", data1$Coordinates), "\d{2}\.\d{2}"), byrow=TRUE), stringsAsFactors=FALSE)
d1[] <- lapply(d1, as.numeric)
colnames(d1) <-  paste0(c("x.", "y."), rep(1:2,each = 2))

cbind(data1[1], d1)
#  ID   x.1   y.1   x.2   y.2
#1  1 34.38 51.50    NA    NA
#2  2 33.46 51.08 33.52 51.20
#3  3 28.43 47.22 33.59 51.22
#4  4 28.39 47.05 29.09 47.44

但是，这也可以用 base R 来完成。

#Create the dots for the 4-digit numbers
str1 <- gsub("(\d{2})(\d{2})", "\1.\2", data1$Coordinates)
#extract the numbers in a list with gregexpr/regmatches
lst <- regmatches(str1, gregexpr("\d{2}\.\d{2}", str1))
#convert to numeric
lst <- lapply(lst, as.numeric)
#pad with NA's at the end and convert to data.frame
d1 <- do.call(rbind.data.frame, lapply(lst, `length<-`, max(lengths(lst))))
#change the column names
colnames(d1) <-  paste0(c("x.", "y."), rep(1:2,each = 2))
#cbind with the first column of 'data1'
cbind(data1[1], d1)

R组坐标从字符串中提取

R sets of coordinates extract from string

regex

r

extract

coordinates

stringr