将四个数据框列折叠成两个交错的列
Collapsing four data frame columns into two, interleaved columns
我正在使用纬度和经度数据在 leaflet
地图上画线(见下文)。理想情况下,这些行将存储在数据框中的 lat
和 lng
列中。在 lat
列中,每个起点 lat 值后跟一个终点 lat 值,然后是另一条线的起点 lat 值(line_id
列允许区分每条线)。 lng
数据的排列方式类似。理想情况下,数据框应如下所示:
> df.better
line_id lat lng
1 ABC 51.50995 -0.1345093
2 ABC 51.51074 -0.1345093
3 XYZ 51.50991 -0.1345193
4 XYZ 51.51079 -0.1351200
问题是它以这种格式从数据存储中出来:
> df.wide
line_id start_lat end_lat start_lng end_lng
1 ABC 51.50995 51.51074 -0.1345093 -0.13519
2 XYZ 51.50991 51.51079 0.1351900 0.13512
这看起来有点像经典的 "wide to long" 数据整理问题,对此有很多问题和答案,但标准 "long" 格式将 lat 和 lng 数据折叠成一列,并且我需要两列。我尝试了如下的 tidyverse 解决方案:
df2 <- df.wide %>% pivot_longer(cols = start_lat:end_lng,
names_to="variable",
values_to="value")
然后我清理 variable
列:
df2$variable <- gsub(".*_lat","lat",df2$variable)
df2$variable <- gsub(".*_lng","lng",df2$variable)
这是结果,至少数据的顺序似乎是正确的:
> df2
A tibble: 8 x 3
line_id variable value
<fct> <chr> <dbl>
1 ABC lat 51.50995
2 ABC lat 51.51074
3 ABC lng -0.1345093
4 ABC lng -0.13519
5 XYZ lat 51.50991
6 XYZ lat 51.51079
7 XYZ lng 0.13519
8 XYZ lng 0.135120
最后一步似乎涉及再次传播数据,但使用 pivot_wider
会导致抱怨值未被唯一标识:
df2 %>% pivot_wider(names_from = variable,values_from = value)
# A tibble: 2 x 3
line_id lat lng
<fct> <list<dbl>> <list<dbl>>
1 ABC [2] [2]
2 XYZ [2] [2]
Warning message:
Values in `value` are not uniquely identified; output will contain list-cols.
我可以(我认为)明白错误发生的原因,但是在 variable
中提供唯一标识符只会让我回到开始的地方。 can/should 我如何处理这个问题?
require(magrittr)
require(tidyr)
require(dplyr)
options(pillar.sigfig = 7)
df.better <- data.frame(
line_id = c("ABC","ABC","XYZ","XYZ"),
lat = c(51.509950,51.510736,51.509910,51.510786),
lng = c(-0.1345093,-0.1345093,-0.1345193,-0.135120)
)
df.wide <- data.frame(
line_id = c("ABC","XYZ"),
start_lat = c(51.509950,51.509910),
end_lat = c(51.510736,51.510786),
start_lng = c(-0.1345093,0.135190),
end_lng = c(-0.135190,0.135120)
)
df2 <- df.wide %>% pivot_longer(cols = start_lat:end_lng,
names_to="variable",
values_to="value")
df2$variable <- gsub(".*_lat","lat",df2$variable)
df2$variable <- gsub(".*_lng","lng",df2$variable)
df2 %>% pivot_wider(names_from = variable,values_from = value)
m <- leaflet() %>% setView(lng = -0.1345093, lat = 51.510090, zoom = 18) %>% addTiles()
for (i in unique(df.better$line_id)) { # HT:
m <- m %>%
addPolylines(data = df.better[df.better$line_id == i, ],
lng = ~lng, lat = ~lat, color = "Green",
opacity = 0.5, weight = 2, dashArray = 5)
}
m
如果我没理解错的话,你正在寻找这样的东西:
df.wide <- data.frame(
line_id = c("ABC","XYZ"),
start_lat = c(51.509950,51.509910),
end_lat = c(51.510736,51.510786),
start_lng = c(-0.1345093,0.135190),
end_lng = c(-0.135190,0.135120)
)
df.wide %>%
pivot_longer(-line_id,
names_to = c("set", ".value"),
names_pattern = "(.+)_(.+)"
)
# line_id set lat lng
# <fct> <chr> <dbl> <dbl>
#1 ABC start 51.50995 -0.1345093
#2 ABC end 51.51074 -0.13519
#3 XYZ start 51.50991 0.13519
#4 XYZ end 51.51079 0.135120
这样的事情可能会成功
library(data.table)
dt <- data.table::fread("line_id start_lat end_lat start_lng end_lng
ABC 51.50995 51.51074 -0.1345093 -0.13519
XYZ 51.50991 51.51079 0.1351900 0.13512")
dt.melt <- melt( dt,
id.vars = "line_id",
measure.vars = patterns( lon = "_lng$",
lat = "_lat$" ),
variable.name = "point_id" )
# line_id point_id lon lat
# 1: ABC 1 -0.1345093 51.50995
# 2: XYZ 1 0.1351900 51.50991
# 3: ABC 2 -0.1351900 51.51074
# 4: XYZ 2 0.1351200 51.51079
library( sf )
library(dplyr)
library(leaflet)
dt.points <- st_as_sf( dt.melt, coords = c("lon", "lat"), crs = 4326)
dt.lines <- dt.points %>%
group_by( line_id ) %>%
summarise( geometry = st_combine( geometry ) ) %>%
st_cast( "LINESTRING" )
leaflet() %>% addTiles() %>% addPolylines( data = dt.lines, popup = ~line_id )
我正在使用纬度和经度数据在 leaflet
地图上画线(见下文)。理想情况下,这些行将存储在数据框中的 lat
和 lng
列中。在 lat
列中,每个起点 lat 值后跟一个终点 lat 值,然后是另一条线的起点 lat 值(line_id
列允许区分每条线)。 lng
数据的排列方式类似。理想情况下,数据框应如下所示:
> df.better
line_id lat lng
1 ABC 51.50995 -0.1345093
2 ABC 51.51074 -0.1345093
3 XYZ 51.50991 -0.1345193
4 XYZ 51.51079 -0.1351200
问题是它以这种格式从数据存储中出来:
> df.wide
line_id start_lat end_lat start_lng end_lng
1 ABC 51.50995 51.51074 -0.1345093 -0.13519
2 XYZ 51.50991 51.51079 0.1351900 0.13512
这看起来有点像经典的 "wide to long" 数据整理问题,对此有很多问题和答案,但标准 "long" 格式将 lat 和 lng 数据折叠成一列,并且我需要两列。我尝试了如下的 tidyverse 解决方案:
df2 <- df.wide %>% pivot_longer(cols = start_lat:end_lng,
names_to="variable",
values_to="value")
然后我清理 variable
列:
df2$variable <- gsub(".*_lat","lat",df2$variable)
df2$variable <- gsub(".*_lng","lng",df2$variable)
这是结果,至少数据的顺序似乎是正确的:
> df2
A tibble: 8 x 3
line_id variable value
<fct> <chr> <dbl>
1 ABC lat 51.50995
2 ABC lat 51.51074
3 ABC lng -0.1345093
4 ABC lng -0.13519
5 XYZ lat 51.50991
6 XYZ lat 51.51079
7 XYZ lng 0.13519
8 XYZ lng 0.135120
最后一步似乎涉及再次传播数据,但使用 pivot_wider
会导致抱怨值未被唯一标识:
df2 %>% pivot_wider(names_from = variable,values_from = value)
# A tibble: 2 x 3
line_id lat lng
<fct> <list<dbl>> <list<dbl>>
1 ABC [2] [2]
2 XYZ [2] [2]
Warning message:
Values in `value` are not uniquely identified; output will contain list-cols.
我可以(我认为)明白错误发生的原因,但是在 variable
中提供唯一标识符只会让我回到开始的地方。 can/should 我如何处理这个问题?
require(magrittr)
require(tidyr)
require(dplyr)
options(pillar.sigfig = 7)
df.better <- data.frame(
line_id = c("ABC","ABC","XYZ","XYZ"),
lat = c(51.509950,51.510736,51.509910,51.510786),
lng = c(-0.1345093,-0.1345093,-0.1345193,-0.135120)
)
df.wide <- data.frame(
line_id = c("ABC","XYZ"),
start_lat = c(51.509950,51.509910),
end_lat = c(51.510736,51.510786),
start_lng = c(-0.1345093,0.135190),
end_lng = c(-0.135190,0.135120)
)
df2 <- df.wide %>% pivot_longer(cols = start_lat:end_lng,
names_to="variable",
values_to="value")
df2$variable <- gsub(".*_lat","lat",df2$variable)
df2$variable <- gsub(".*_lng","lng",df2$variable)
df2 %>% pivot_wider(names_from = variable,values_from = value)
m <- leaflet() %>% setView(lng = -0.1345093, lat = 51.510090, zoom = 18) %>% addTiles()
for (i in unique(df.better$line_id)) { # HT:
m <- m %>%
addPolylines(data = df.better[df.better$line_id == i, ],
lng = ~lng, lat = ~lat, color = "Green",
opacity = 0.5, weight = 2, dashArray = 5)
}
m
如果我没理解错的话,你正在寻找这样的东西:
df.wide <- data.frame(
line_id = c("ABC","XYZ"),
start_lat = c(51.509950,51.509910),
end_lat = c(51.510736,51.510786),
start_lng = c(-0.1345093,0.135190),
end_lng = c(-0.135190,0.135120)
)
df.wide %>%
pivot_longer(-line_id,
names_to = c("set", ".value"),
names_pattern = "(.+)_(.+)"
)
# line_id set lat lng
# <fct> <chr> <dbl> <dbl>
#1 ABC start 51.50995 -0.1345093
#2 ABC end 51.51074 -0.13519
#3 XYZ start 51.50991 0.13519
#4 XYZ end 51.51079 0.135120
这样的事情可能会成功
library(data.table)
dt <- data.table::fread("line_id start_lat end_lat start_lng end_lng
ABC 51.50995 51.51074 -0.1345093 -0.13519
XYZ 51.50991 51.51079 0.1351900 0.13512")
dt.melt <- melt( dt,
id.vars = "line_id",
measure.vars = patterns( lon = "_lng$",
lat = "_lat$" ),
variable.name = "point_id" )
# line_id point_id lon lat
# 1: ABC 1 -0.1345093 51.50995
# 2: XYZ 1 0.1351900 51.50991
# 3: ABC 2 -0.1351900 51.51074
# 4: XYZ 2 0.1351200 51.51079
library( sf )
library(dplyr)
library(leaflet)
dt.points <- st_as_sf( dt.melt, coords = c("lon", "lat"), crs = 4326)
dt.lines <- dt.points %>%
group_by( line_id ) %>%
summarise( geometry = st_combine( geometry ) ) %>%
st_cast( "LINESTRING" )
leaflet() %>% addTiles() %>% addPolylines( data = dt.lines, popup = ~line_id )