将以不同单位找到的坐标数据框转换为一个单位
Convert a Data Frame of Coordinates found in different units to one unit
我想修改数据框列以具有相同单位的坐标。它们存在于这些单元中; dec_deg、deg_dec_min 或 NA。这是一个可重现的例子:
Long <- c("","E 9.64740","E 9°35.988'","","-16.5708666666667","109.395389",
"-16.6455666666667","W047 22.044", "-16.5437166666667")
Lat <- c("","S 2.40889","N 00°27.799","14.0011","","-0.632361",
"13.9622333333333","S00 37.952", "14.0532")
Date <- as.Date(c('2010-11-1','2008-3-25','2007-3-14', '2010-11-1','2008-3-25','2007-3-14','2010-11-1','2008-3-25','2007-3-14'))
Site.ID <- c("MWA-S", "MWA-S","MWA-S","BAM","BAM","BAM","BAM","BAM","BAO")
No.ID <- c(34, 5,16,46,2,85,60,1,30)
DF <- data.frame(No.ID, Site.ID, Date, Lat, Long)
我引用了这个 question to clean up my columns by using the measurements
library and removing unwanted characters. But that fails b/c the coordinates are not in the same units. I want to create a function like this 有条件地进行转换。
library(measurements)
coord2dec <- function(x) {
x <- as.character(x)
x <- do.call(rbind, strsplit(x, split='N'|'E'|'S'|'W'|'°'))#maybe where to #apply my conditions
x <- apply(x, col, function(y) {
y <- as.numeric(y)
measurements::conv_unit(y$col, from = 'deg_dec_min', to = 'dec_deg')
})
return(x)
}
new_df <- apply(DF2, coord2dec)
以上失败 b/c 我可能遗漏了其他条件和格式。我的目标是创建一个函数来识别坐标是 deg_dec_min (dd mm.mmmm) 还是 dec_deg (dd.ddddd)。然后它将 W/S
转换为 -
;删除 "NSEW"
和 whitespace 并将 degree 替换为 space。所需的输出会将示例数据框转换为以下内容。
No.ID Site.ID Date Lat Long
1 34 MWA-S 2010-11-01 NA NA
2 5 MWA-S 2008-03-25 -2.408890 9.647400
3 16 MWA-S 2007-03-14 0.463317 9.599800
4 46 BAM 2010-11-01 14.0011 -16.5708667
5 2 BAM 2008-03-25 NA NA
6 85 BAM 2007-03-14 -0.632361 109.395389
7 60 BAM 2010-11-01 13.96223333 -16.6455666666667
8 1 BAM 2008-03-25 -0.632533 -47.367400
9 30 BAO 2007-03-14 14.0532 -16.5437166666667
您提出的解决方案看起来很适合正则表达式解决方案,但我认为更简单的方法可能是:
- 把S和W变成负数
- 移除S/N/E/W
- 用 space
替换 °
- 在任何 space 处拆分并假设右边的所有内容都以分钟为单位
- 组合符号、度数和 minute/60。
我使用 pivot_longer 以便我可以将纬度和经度值放入一列并同时对两者应用这些相同的转换,然后使用 pivot_wider 将它们放回去。
library(tidyverse)
DF %>%
pivot_longer(Lat:Long) %>%
mutate(sign = if_else(str_detect(value, "S|W"), -1, 1)) %>%
mutate(value = value %>%
str_replace_all(c("S|N|W|E" = "", "°" = " ")) %>%
str_trim()) %>%
separate(value, c("deg", "min"), sep = " ", fill = "right") %>%
mutate(deg2 = parse_number(deg),
min2 = coalesce(parse_number(min)/60, 0),
result = sign * (deg2 + min2)) %>%
select(-c(deg:min2)) %>%
pivot_wider(names_from = name, values_from = result)
我不能保证这对您的所有数据都有效,但看起来对此处的示例数据有效。 (看起来您的 DF 输入中存在错字交换值 / space,与第 4+5 行中的建议输出不同。)
# A tibble: 9 x 5
No.ID Site.ID Date Lat Long
<dbl> <chr> <date> <dbl> <dbl>
1 34 MWA-S 2010-11-01 NA NA
2 5 MWA-S 2008-03-25 -2.41 9.65
3 16 MWA-S 2007-03-14 0.463 9.60
4 46 BAM 2010-11-01 14.0 NA
5 2 BAM 2008-03-25 NA -16.6
6 85 BAM 2007-03-14 -0.632 109.
7 60 BAM 2010-11-01 14.0 -16.6
8 1 BAM 2008-03-25 -0.633 -47.4
9 30 BAO 2007-03-14 14.1 -16.5
首先,我要感谢堆栈溢出社区和 Jon。解决这个问题后,我还收到了一位同事的解决方案,不确定他是否在这里,如果需要会更新以给予信任。乔恩 Spring 的方法非常有效,这也是如此。
- 功能一:清理坐标,去掉SW字符,换成-;删除所有其他字符 & whitespace;更换度数带有 space
的符号
- 功能2:识别坐标中属于deg_dec_min或dec_deg的部分。考虑数字的可能性范围。然后使用
conv_unit
函数进行转换。
- 在
dplyr
管道中应用函数
# gets rid of characters, leaving formatted dd.ddddd or dd mm.mmm
clean_coords <- function(x) {
v <- gsub("[Ww]|[Ss]|[Ww] |[Ss] ", "-", x) # convert W/S to -
v2 <- gsub("[eEwWsSnN] ", "", v) # remove NSEW and whitespace
v3 <- gsub("°|'", " ", v2) # replace degree with space
return(v3)
}
# finds elements that are in dd mm.mmmm format and converts them to dd.ddddd
ddmmm_to_dd <- function(x) {
ind <- grep("[0-9]{1,3} [0-9]", x)
x[ind] <- conv_unit(x[ind], from = 'deg_dec_min', to = 'dec_deg')
return(x)
}
# apply functions in pipeline
DF2 <- DF %>%
mutate(across(c(Lat, Long), clean_coords)) %>%
mutate(across(c(Lat, Long), ddmmm_to_dd))
我想修改数据框列以具有相同单位的坐标。它们存在于这些单元中; dec_deg、deg_dec_min 或 NA。这是一个可重现的例子:
Long <- c("","E 9.64740","E 9°35.988'","","-16.5708666666667","109.395389",
"-16.6455666666667","W047 22.044", "-16.5437166666667")
Lat <- c("","S 2.40889","N 00°27.799","14.0011","","-0.632361",
"13.9622333333333","S00 37.952", "14.0532")
Date <- as.Date(c('2010-11-1','2008-3-25','2007-3-14', '2010-11-1','2008-3-25','2007-3-14','2010-11-1','2008-3-25','2007-3-14'))
Site.ID <- c("MWA-S", "MWA-S","MWA-S","BAM","BAM","BAM","BAM","BAM","BAO")
No.ID <- c(34, 5,16,46,2,85,60,1,30)
DF <- data.frame(No.ID, Site.ID, Date, Lat, Long)
我引用了这个 question to clean up my columns by using the measurements
library and removing unwanted characters. But that fails b/c the coordinates are not in the same units. I want to create a function like this
library(measurements)
coord2dec <- function(x) {
x <- as.character(x)
x <- do.call(rbind, strsplit(x, split='N'|'E'|'S'|'W'|'°'))#maybe where to #apply my conditions
x <- apply(x, col, function(y) {
y <- as.numeric(y)
measurements::conv_unit(y$col, from = 'deg_dec_min', to = 'dec_deg')
})
return(x)
}
new_df <- apply(DF2, coord2dec)
以上失败 b/c 我可能遗漏了其他条件和格式。我的目标是创建一个函数来识别坐标是 deg_dec_min (dd mm.mmmm) 还是 dec_deg (dd.ddddd)。然后它将 W/S
转换为 -
;删除 "NSEW"
和 whitespace 并将 degree 替换为 space。所需的输出会将示例数据框转换为以下内容。
No.ID Site.ID Date Lat Long
1 34 MWA-S 2010-11-01 NA NA
2 5 MWA-S 2008-03-25 -2.408890 9.647400
3 16 MWA-S 2007-03-14 0.463317 9.599800
4 46 BAM 2010-11-01 14.0011 -16.5708667
5 2 BAM 2008-03-25 NA NA
6 85 BAM 2007-03-14 -0.632361 109.395389
7 60 BAM 2010-11-01 13.96223333 -16.6455666666667
8 1 BAM 2008-03-25 -0.632533 -47.367400
9 30 BAO 2007-03-14 14.0532 -16.5437166666667
您提出的解决方案看起来很适合正则表达式解决方案,但我认为更简单的方法可能是:
- 把S和W变成负数
- 移除S/N/E/W
- 用 space 替换 °
- 在任何 space 处拆分并假设右边的所有内容都以分钟为单位
- 组合符号、度数和 minute/60。
我使用 pivot_longer 以便我可以将纬度和经度值放入一列并同时对两者应用这些相同的转换,然后使用 pivot_wider 将它们放回去。
library(tidyverse)
DF %>%
pivot_longer(Lat:Long) %>%
mutate(sign = if_else(str_detect(value, "S|W"), -1, 1)) %>%
mutate(value = value %>%
str_replace_all(c("S|N|W|E" = "", "°" = " ")) %>%
str_trim()) %>%
separate(value, c("deg", "min"), sep = " ", fill = "right") %>%
mutate(deg2 = parse_number(deg),
min2 = coalesce(parse_number(min)/60, 0),
result = sign * (deg2 + min2)) %>%
select(-c(deg:min2)) %>%
pivot_wider(names_from = name, values_from = result)
我不能保证这对您的所有数据都有效,但看起来对此处的示例数据有效。 (看起来您的 DF 输入中存在错字交换值 / space,与第 4+5 行中的建议输出不同。)
# A tibble: 9 x 5
No.ID Site.ID Date Lat Long
<dbl> <chr> <date> <dbl> <dbl>
1 34 MWA-S 2010-11-01 NA NA
2 5 MWA-S 2008-03-25 -2.41 9.65
3 16 MWA-S 2007-03-14 0.463 9.60
4 46 BAM 2010-11-01 14.0 NA
5 2 BAM 2008-03-25 NA -16.6
6 85 BAM 2007-03-14 -0.632 109.
7 60 BAM 2010-11-01 14.0 -16.6
8 1 BAM 2008-03-25 -0.633 -47.4
9 30 BAO 2007-03-14 14.1 -16.5
首先,我要感谢堆栈溢出社区和 Jon。解决这个问题后,我还收到了一位同事的解决方案,不确定他是否在这里,如果需要会更新以给予信任。乔恩 Spring 的方法非常有效,这也是如此。
- 功能一:清理坐标,去掉SW字符,换成-;删除所有其他字符 & whitespace;更换度数带有 space 的符号
- 功能2:识别坐标中属于deg_dec_min或dec_deg的部分。考虑数字的可能性范围。然后使用
conv_unit
函数进行转换。 - 在
dplyr
管道中应用函数
# gets rid of characters, leaving formatted dd.ddddd or dd mm.mmm
clean_coords <- function(x) {
v <- gsub("[Ww]|[Ss]|[Ww] |[Ss] ", "-", x) # convert W/S to -
v2 <- gsub("[eEwWsSnN] ", "", v) # remove NSEW and whitespace
v3 <- gsub("°|'", " ", v2) # replace degree with space
return(v3)
}
# finds elements that are in dd mm.mmmm format and converts them to dd.ddddd
ddmmm_to_dd <- function(x) {
ind <- grep("[0-9]{1,3} [0-9]", x)
x[ind] <- conv_unit(x[ind], from = 'deg_dec_min', to = 'dec_deg')
return(x)
}
# apply functions in pipeline
DF2 <- DF %>%
mutate(across(c(Lat, Long), clean_coords)) %>%
mutate(across(c(Lat, Long), ddmmm_to_dd))