如何根据 R 中的记录标识符分配唯一 ID?
How can I assign unique IDs based on a record identifier in R?
我的任务:根据电影数据统计预算和收入数字。
我正在从基本上采用以下格式的文本文件中读取数据:
MV,Movie 1 Name
BT,Budget for Movie 1
GR,Gross Revenue Movie 1
但数据可能包含也可能不包含BT或GR,或者有时包含倍数,例如:
MV,Movie1
BT,1000000
GR,500000 (week1)
GR,500000 (week2)
GR,500000 (week3)
GR,500000 (week1)
MV,Movie2
BT,10000
GR,50000 (week1)
GR,500000 (week2)
MV,Movie3
MV,Movie4
BT,1000000
我想要创建的数据框如下所示:
mID recType recData
1 MV Movie1
1 BT 1000000
1 GR 500000 (week1)
1 GR 500000 (week2)
1 GR 500000 (week3)
1 GR 500000 (week1)
2 MV Movie2
2 BT 10000
2 GR 50000 (week1)
2 GR 500000 (week2)
3 MV Movie3
4 MV Movie4
4 BT 1000000
我的程序员说只需在 java 或 .NET 中编写一个数据清理应用程序,以便在将数据引入 R 之前清理数据,但我想看看互联网可以帮助我。
为超过 90,000 部电影为此编写一个循环,处理时间长得令人讨厌。
注意:最终目标是将此数据用作对电影盈利能力进行分类的主要来源,并将其与流派、演员和其他数据的外部文件进行交叉引用。
(IMDB 需要更好的数据设置)
谢谢!
尝试
df1$mID <- cumsum(grepl('^Movie', df1$recData))
#df1$mID <- cumsum(df1$recType=='MV')
df1[,c(3,1:2)]
# mID recType recData
#1 1 MV Movie1
#2 1 BT 1000000
#3 1 GR 500000 (week1)
#4 1 GR 500000 (week2)
#5 1 GR 500000 (week3)
#6 1 GR 500000 (week1)
#7 2 MV Movie2
#8 2 BT 10000
#9 2 GR 50000 (week1)
#10 2 GR 500000 (week2)
#11 3 MV Movie3
#12 4 MV Movie4
#13 4 BT 1000000
或使用data.table
(会更快)
library(data.table)
setDT(df1)[, mID:= cumsum(recType=='MV')][]
我的任务:根据电影数据统计预算和收入数字。
我正在从基本上采用以下格式的文本文件中读取数据:
MV,Movie 1 Name
BT,Budget for Movie 1
GR,Gross Revenue Movie 1
但数据可能包含也可能不包含BT或GR,或者有时包含倍数,例如:
MV,Movie1
BT,1000000
GR,500000 (week1)
GR,500000 (week2)
GR,500000 (week3)
GR,500000 (week1)
MV,Movie2
BT,10000
GR,50000 (week1)
GR,500000 (week2)
MV,Movie3
MV,Movie4
BT,1000000
我想要创建的数据框如下所示:
mID recType recData
1 MV Movie1
1 BT 1000000
1 GR 500000 (week1)
1 GR 500000 (week2)
1 GR 500000 (week3)
1 GR 500000 (week1)
2 MV Movie2
2 BT 10000
2 GR 50000 (week1)
2 GR 500000 (week2)
3 MV Movie3
4 MV Movie4
4 BT 1000000
我的程序员说只需在 java 或 .NET 中编写一个数据清理应用程序,以便在将数据引入 R 之前清理数据,但我想看看互联网可以帮助我。
为超过 90,000 部电影为此编写一个循环,处理时间长得令人讨厌。
注意:最终目标是将此数据用作对电影盈利能力进行分类的主要来源,并将其与流派、演员和其他数据的外部文件进行交叉引用。
(IMDB 需要更好的数据设置)
谢谢!
尝试
df1$mID <- cumsum(grepl('^Movie', df1$recData))
#df1$mID <- cumsum(df1$recType=='MV')
df1[,c(3,1:2)]
# mID recType recData
#1 1 MV Movie1
#2 1 BT 1000000
#3 1 GR 500000 (week1)
#4 1 GR 500000 (week2)
#5 1 GR 500000 (week3)
#6 1 GR 500000 (week1)
#7 2 MV Movie2
#8 2 BT 10000
#9 2 GR 50000 (week1)
#10 2 GR 500000 (week2)
#11 3 MV Movie3
#12 4 MV Movie4
#13 4 BT 1000000
或使用data.table
(会更快)
library(data.table)
setDT(df1)[, mID:= cumsum(recType=='MV')][]