汇总数据框 r 中的唯一行数
Summarize number of unique rows in data frame r
需要您的最佳建议。试图绘制纽约的自行车路线图。
library(tidyverse)
bikes <- read.csv("August.csv", header = TRUE)
str(bikes) # 1557663 obs. of 15 variables
summary(bikes)
names(bikes)
这是一条路线的样子
# Sample route (example)
route(from = "Clark St & Henry St, New York, NY", to = "Queens Plaza North &
Crescent St, New York, NY")
rt <- route(from = "Clark St & Henry St, New York, NY", to = "Queens Plaza
North & Crescent St, New York, NY")
nyc <- qmap("New York, NY", color = 'bw', zoom = 12)
nyc + geom_path(aes(x = rt$startLon, y = rt$startLat),
colour = "red", data = rt, alpha = 1, size = 0.2)
# How many stations are unique?
start.station <- bikes$start.station.name
unique(start.station) # 574 stations
end.station <- bikes$end.station.name
unique(end.station) # 582 stations
names(bikes)
# [1] "tripduration" "starttime" "stoptime"
# [4] "start.station.id" "start.station.name"
# "start.station.latitude"
# [7] "start.station.longitude" "end.station.id" "end.station.name"
# [10] "end.station.latitude" "end.station.longitude" "bikeid"
# [13] "usertype" "birth.year" "gender"
我可以假设我只需要两列 - 起点站和终点站名称。
# eliminate all columns besides two - start and end stations
only.stations <- bikes %>% as_tibble() %>%
mutate(tripduration = NULL, starttime = NULL, stoptime = NULL,
start.station.id = NULL,
start.station.latitude = NULL, start.station.longitude = NULL,
end.station.id = NULL,
end.station.latitude = NULL, end.station.longitude = NULL, bikeid = NULL,
usertype = NULL,
birth.year = NULL, gender = NULL)
only.stations # A tibble: 1,557,663, so, we have 1,557,663 rides
# start.station.name end.station.name
# <fctr> <fctr>
#1 Avenue D & E 3 St E 3 St & 1 Ave
#2 Broadway & E 14 St E 7 St & Avenue A
#3 Metropolitan Ave & Bedford Ave Union Ave & N 12 St
#4 E 10 St & 5 Ave E 10 St & 5 Ave
#5 LaGuardia Pl & W 3 St E 3 St & 1 Ave
#6 Grand St & Havemeyer St Graham Ave & Conselyea St
#7 N 12 St & Bedford Ave Bedford Ave & Nassau Ave
#8 9 Ave & W 18 St Pershing Square North
#9 E 2 St & 2 Ave E 2 St & Avenue C
#10 MacDougal St & Washington Sq E 10 St & Avenue A
# ... with 1,557,653 more rows
# unique(only.stations) # A tibble: 129,839 × 2 - so, do we have 129,839
unique (only.stations)
View(only.stations)
我的问题 - 如何对 129,839 行进行分组和汇总,并了解每条路线的使用频率。我相信它与 dplyr - group_by() 和 summarize() 一起使用,但尝试了几个选项但没有任何效果。 :(
此致
奥莱克西
您的问题似乎是关于计算 only.stations
中每个唯一行的频率。您缺少的关键字是 dplyr
的 summarise
函数中的 n()
。尝试:
only.stations %>%
group_by(start.station.name, end.station.name) %>%
summarise(frequency = n())
需要您的最佳建议。试图绘制纽约的自行车路线图。
library(tidyverse)
bikes <- read.csv("August.csv", header = TRUE)
str(bikes) # 1557663 obs. of 15 variables
summary(bikes)
names(bikes)
这是一条路线的样子
# Sample route (example)
route(from = "Clark St & Henry St, New York, NY", to = "Queens Plaza North &
Crescent St, New York, NY")
rt <- route(from = "Clark St & Henry St, New York, NY", to = "Queens Plaza
North & Crescent St, New York, NY")
nyc <- qmap("New York, NY", color = 'bw', zoom = 12)
nyc + geom_path(aes(x = rt$startLon, y = rt$startLat),
colour = "red", data = rt, alpha = 1, size = 0.2)
# How many stations are unique?
start.station <- bikes$start.station.name
unique(start.station) # 574 stations
end.station <- bikes$end.station.name
unique(end.station) # 582 stations
names(bikes)
# [1] "tripduration" "starttime" "stoptime"
# [4] "start.station.id" "start.station.name"
# "start.station.latitude"
# [7] "start.station.longitude" "end.station.id" "end.station.name"
# [10] "end.station.latitude" "end.station.longitude" "bikeid"
# [13] "usertype" "birth.year" "gender"
我可以假设我只需要两列 - 起点站和终点站名称。
# eliminate all columns besides two - start and end stations
only.stations <- bikes %>% as_tibble() %>%
mutate(tripduration = NULL, starttime = NULL, stoptime = NULL,
start.station.id = NULL,
start.station.latitude = NULL, start.station.longitude = NULL,
end.station.id = NULL,
end.station.latitude = NULL, end.station.longitude = NULL, bikeid = NULL,
usertype = NULL,
birth.year = NULL, gender = NULL)
only.stations # A tibble: 1,557,663, so, we have 1,557,663 rides
# start.station.name end.station.name
# <fctr> <fctr>
#1 Avenue D & E 3 St E 3 St & 1 Ave
#2 Broadway & E 14 St E 7 St & Avenue A
#3 Metropolitan Ave & Bedford Ave Union Ave & N 12 St
#4 E 10 St & 5 Ave E 10 St & 5 Ave
#5 LaGuardia Pl & W 3 St E 3 St & 1 Ave
#6 Grand St & Havemeyer St Graham Ave & Conselyea St
#7 N 12 St & Bedford Ave Bedford Ave & Nassau Ave
#8 9 Ave & W 18 St Pershing Square North
#9 E 2 St & 2 Ave E 2 St & Avenue C
#10 MacDougal St & Washington Sq E 10 St & Avenue A
# ... with 1,557,653 more rows
# unique(only.stations) # A tibble: 129,839 × 2 - so, do we have 129,839
unique (only.stations)
View(only.stations)
我的问题 - 如何对 129,839 行进行分组和汇总,并了解每条路线的使用频率。我相信它与 dplyr - group_by() 和 summarize() 一起使用,但尝试了几个选项但没有任何效果。 :(
此致 奥莱克西
您的问题似乎是关于计算 only.stations
中每个唯一行的频率。您缺少的关键字是 dplyr
的 summarise
函数中的 n()
。尝试:
only.stations %>%
group_by(start.station.name, end.station.name) %>%
summarise(frequency = n())