通过各种分隔符将 Dataframe 列分成更多列
Separating Dataframe columns into more columns by various delimiters
我有一个数据集,我试图在下面给出使用 dput 命令的示例。我 运行 遇到的问题是试图通过分隔符分隔数据。
> dput(head(team_data))
structure(list(X1 = 2:6,
names2 = c("Andre Callender Seton Hall Preparatory School (West Orange, NJ)", "Gosder Cherilus Somerville (Somerville, MA)", "Justin Bell Mount Vernon (Alexandria, VA)", "Tom Anevski Elder (Cincinnati, OH)", "Brad Mueller Mars Area (Mars, PA)"),
pos2 = c("RB 5-10 185", "OT 6-7 270", "TE 6-3 250", "OT 6-5 265", "CB 6-0 170"), rating2 = c("0.8667 194 18 8", "0.8667 262 20 1", "0.8333 306 14 7", "0.8333 377 25 13", "0.8333 496 36 16"),
status2 = c("Enrolled 6/30/2003", "Enrolled 6/30/2003", "Enrolled 6/30/2003", "Enrolled 6/30/2003", "Enrolled 6/30/2003"), team = c("Boston-College", "Boston-College", "Boston-College", "Boston-College", "Boston-College"), year = c(2003L, 2003L, 2003L, 2003L, 2003L)),
.Names = c("X1", "names2", "pos2", "rating2", "status2", "team", "year"), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
以下是我尝试在上述数据集上执行的代码。据我所知,以下两个函数工作正常并且符合预期。
library(rvest)
library(stringr)
library(tidyr)
library(readxl)
df2<-separate(data=team_data,col=pos2,into= c("Position","Height","Weight"),sep=" ")
df3<-separate(data=df2,col=rating2,into= c("Rating","National","Position","State Rank"),sep=" ")
但是我在尝试进一步分离数据框的列时遇到了很大的麻烦。我尝试了各种方法(下面的示例),但下面的所有代码片段都会产生相同的错误,"Error: Data source must be a dictionary"。
df4<-separate(data=df3,col=names2,into= c("Name","Geo"),sep="(")
df4<-separate(data=df3,col=names2,into= c("Name","Geo"),sep='\(|\)')
df4<-separate(data=df3,col=status2,into= c("Date_Enrollment","Enroll_Status"),sep=" ")
df4<-separate(data=df3,col=status2,into= c("Date_Enrollment","Enroll_Status"),sep=" ")
最终目标是在“(”和“,”处分离出 "names2" 列并删除“)”,这样我将得到 3 列数据。对于另一列 ("status2"),目标是将 "Enrolled" 从注册日期中分离出来。
从我读到的内容来看,我收到的错误表明我正在复制列名,但我无法弄清楚这是在哪里发生的。
您正在使用 Position
两次,一次在 df2
中,一次在 df3
中。这对我有用:
team_data %>%
separate(col=pos2, into= c("Position","Height","Weight"), sep=" ") %>%
separate(col=rating2,into= c("Rating","National","Position2","State Rank"),sep=" ")%>%
separate(col=names2,into= c("Name","Geo"),sep="\(") %>%
separate(col=status2,into= c("Date_Enrollment","Enroll_Status"),sep=" ")
我有一个数据集,我试图在下面给出使用 dput 命令的示例。我 运行 遇到的问题是试图通过分隔符分隔数据。
> dput(head(team_data))
structure(list(X1 = 2:6,
names2 = c("Andre Callender Seton Hall Preparatory School (West Orange, NJ)", "Gosder Cherilus Somerville (Somerville, MA)", "Justin Bell Mount Vernon (Alexandria, VA)", "Tom Anevski Elder (Cincinnati, OH)", "Brad Mueller Mars Area (Mars, PA)"),
pos2 = c("RB 5-10 185", "OT 6-7 270", "TE 6-3 250", "OT 6-5 265", "CB 6-0 170"), rating2 = c("0.8667 194 18 8", "0.8667 262 20 1", "0.8333 306 14 7", "0.8333 377 25 13", "0.8333 496 36 16"),
status2 = c("Enrolled 6/30/2003", "Enrolled 6/30/2003", "Enrolled 6/30/2003", "Enrolled 6/30/2003", "Enrolled 6/30/2003"), team = c("Boston-College", "Boston-College", "Boston-College", "Boston-College", "Boston-College"), year = c(2003L, 2003L, 2003L, 2003L, 2003L)),
.Names = c("X1", "names2", "pos2", "rating2", "status2", "team", "year"), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
以下是我尝试在上述数据集上执行的代码。据我所知,以下两个函数工作正常并且符合预期。
library(rvest)
library(stringr)
library(tidyr)
library(readxl)
df2<-separate(data=team_data,col=pos2,into= c("Position","Height","Weight"),sep=" ")
df3<-separate(data=df2,col=rating2,into= c("Rating","National","Position","State Rank"),sep=" ")
但是我在尝试进一步分离数据框的列时遇到了很大的麻烦。我尝试了各种方法(下面的示例),但下面的所有代码片段都会产生相同的错误,"Error: Data source must be a dictionary"。
df4<-separate(data=df3,col=names2,into= c("Name","Geo"),sep="(")
df4<-separate(data=df3,col=names2,into= c("Name","Geo"),sep='\(|\)')
df4<-separate(data=df3,col=status2,into= c("Date_Enrollment","Enroll_Status"),sep=" ")
df4<-separate(data=df3,col=status2,into= c("Date_Enrollment","Enroll_Status"),sep=" ")
最终目标是在“(”和“,”处分离出 "names2" 列并删除“)”,这样我将得到 3 列数据。对于另一列 ("status2"),目标是将 "Enrolled" 从注册日期中分离出来。
从我读到的内容来看,我收到的错误表明我正在复制列名,但我无法弄清楚这是在哪里发生的。
您正在使用 Position
两次,一次在 df2
中,一次在 df3
中。这对我有用:
team_data %>%
separate(col=pos2, into= c("Position","Height","Weight"), sep=" ") %>%
separate(col=rating2,into= c("Rating","National","Position2","State Rank"),sep=" ")%>%
separate(col=names2,into= c("Name","Geo"),sep="\(") %>%
separate(col=status2,into= c("Date_Enrollment","Enroll_Status"),sep=" ")