通过用 R 中减少的一组值替换大量值来清理数据
Cleaning data by replacing large set of values with reduced set of values in R
我正在处理一个数据集,其中特定字段有许多可能的值,但我想将这些值清理为一组减少的值。
例如,申请要么被批准要么被拒绝,
但它们以不同的文本字符串记录在数据库中。
如何清理它以获得干净的输出?
the_status <- c('2: approved (newer)',
'5: approved (extended)',
'3: denied (not appealed)',
'14: denied (not appealed/withdrawn)',
'20: approved',
'21: denied',
'24: not approved within 21 days',
'28: not approved in 21 days')
data.frame(candidate_id = 1:8,
status = the_status)
我想要的:
data.frame(candidate_id = 1:8,
status = c('approved', 'approved', 'denied',
'denied', 'approved', 'denied',
'denied', 'denied'))
注:在真实数据集中,大约有10万行,
字段 status
大约有 30 个不同的字符串,
我想将其减少为大约 4 个值。
我们可以把'not approved'改成'denied',然后用sub
提取。
df1$status <- sub('[^:]+\:\s*(\S+).*', '\1',
sub('not approved', 'denied', df1$status))
你可以用 merge()
:
d <- data.frame(candidate_id = 1:8, status = the_status)
red.tab <- data.frame(candidate_id = 1:8,
status = c('approved', 'approved', 'denied',
'denied', 'approved', 'denied',
'denied', 'denied'))
merge(d, red.tab, by="candidate_id")
我会这样做:
- 确定唯一可能状态列表
unique(the_status)
手动编码:
code <- data.frame(orig_status=unique(the_status),
new_status=c("approved","denied",...))
# You have to do this step manually
- 合并数据集
示例:
set.seed(50)
raw_data <- data.frame(orig_status=sample(the_status,replace=TRUE,100),
id=1:100)
code <- data.frame(orig_status=unique(raw_data$orig_status),
new_status=c('denied','denied',
'approved','denied',
'approved','approved',
'denied','denied'))
code
clean_data <- merge(raw_data,code)
手动编码 30 个唯一值可能比寻找编程方式快得多。
这是我的解决方案。
the_status <- c('2: approved (newer)',
'5: approved (extended)',
'3: denied (not appealed)',
'14: denied (not appealed/withdrawn)',
'20: approved',
'21: denied',
'24: not approved within 21 days',
'28: not approved in 21 days')
使用 sapply、strsplit 和 unlist 命令将数据一一拆分。
x = sapply(the_status, function(t){ a = unlist(strsplit(t, ": "));
b = unlist(strsplit(a[2], " \("));
c(a[1],b[1]) })
它returns一个矩阵。
>t(x)
[,1] [,2]
2: approved (newer) "2" "approved"
5: approved (extended) "5" "approved"
3: denied (not appealed) "3" "denied"
14: denied (not appealed/withdrawn) "14" "denied"
20: approved "20" "approved"
21: denied "21" "denied"
24: not approved within 21 days "24" "not approved within 21 days"
28: not approved in 21 days "28" "not approved in 21 days"
将其转换为 data.frame 并设置名称。
df = data.frame(t(x))
rownames(df) = NULL
colnames(df) = c("candidate_id", "status")
这是结果。
df
candidate_id status
1 2 approved
2 5 approved
3 3 denied
4 14 denied
5 20 approved
6 21 denied
7 24 not approved within 21 days
8 28 not approved in 21 days
如果您不想要原始 ID,您可以按如下方式简单地更改它们:
df$candidate_id = 1:nrow(df$candidate_id)
或
df$candidate_id = rownames(df)
我正在处理一个数据集,其中特定字段有许多可能的值,但我想将这些值清理为一组减少的值。 例如,申请要么被批准要么被拒绝, 但它们以不同的文本字符串记录在数据库中。 如何清理它以获得干净的输出?
the_status <- c('2: approved (newer)',
'5: approved (extended)',
'3: denied (not appealed)',
'14: denied (not appealed/withdrawn)',
'20: approved',
'21: denied',
'24: not approved within 21 days',
'28: not approved in 21 days')
data.frame(candidate_id = 1:8,
status = the_status)
我想要的:
data.frame(candidate_id = 1:8,
status = c('approved', 'approved', 'denied',
'denied', 'approved', 'denied',
'denied', 'denied'))
注:在真实数据集中,大约有10万行,
字段 status
大约有 30 个不同的字符串,
我想将其减少为大约 4 个值。
我们可以把'not approved'改成'denied',然后用sub
提取。
df1$status <- sub('[^:]+\:\s*(\S+).*', '\1',
sub('not approved', 'denied', df1$status))
你可以用 merge()
:
d <- data.frame(candidate_id = 1:8, status = the_status)
red.tab <- data.frame(candidate_id = 1:8,
status = c('approved', 'approved', 'denied',
'denied', 'approved', 'denied',
'denied', 'denied'))
merge(d, red.tab, by="candidate_id")
我会这样做:
- 确定唯一可能状态列表
unique(the_status)
手动编码:
code <- data.frame(orig_status=unique(the_status), new_status=c("approved","denied",...)) # You have to do this step manually
- 合并数据集
示例:
set.seed(50)
raw_data <- data.frame(orig_status=sample(the_status,replace=TRUE,100),
id=1:100)
code <- data.frame(orig_status=unique(raw_data$orig_status),
new_status=c('denied','denied',
'approved','denied',
'approved','approved',
'denied','denied'))
code
clean_data <- merge(raw_data,code)
手动编码 30 个唯一值可能比寻找编程方式快得多。
这是我的解决方案。
the_status <- c('2: approved (newer)',
'5: approved (extended)',
'3: denied (not appealed)',
'14: denied (not appealed/withdrawn)',
'20: approved',
'21: denied',
'24: not approved within 21 days',
'28: not approved in 21 days')
使用 sapply、strsplit 和 unlist 命令将数据一一拆分。
x = sapply(the_status, function(t){ a = unlist(strsplit(t, ": "));
b = unlist(strsplit(a[2], " \("));
c(a[1],b[1]) })
它returns一个矩阵。
>t(x)
[,1] [,2]
2: approved (newer) "2" "approved"
5: approved (extended) "5" "approved"
3: denied (not appealed) "3" "denied"
14: denied (not appealed/withdrawn) "14" "denied"
20: approved "20" "approved"
21: denied "21" "denied"
24: not approved within 21 days "24" "not approved within 21 days"
28: not approved in 21 days "28" "not approved in 21 days"
将其转换为 data.frame 并设置名称。
df = data.frame(t(x))
rownames(df) = NULL
colnames(df) = c("candidate_id", "status")
这是结果。
df
candidate_id status
1 2 approved
2 5 approved
3 3 denied
4 14 denied
5 20 approved
6 21 denied
7 24 not approved within 21 days
8 28 not approved in 21 days
如果您不想要原始 ID,您可以按如下方式简单地更改它们:
df$candidate_id = 1:nrow(df$candidate_id)
或
df$candidate_id = rownames(df)