如何处理杂乱的数据 - 变量未标记为缺失
How to deal with messy data - with variables not marked as missing
我的数据很乱,但大部分看起来像这样,有 9 个类别的重复序列:
9 TI: Prize Structure and Information in Tournaments: Experimental Evidence
10 AU: Freeman, Richard B.; Gelber, Alexander M.
11 AF: NBER; NBER
12 SO: American Economic Journal: Applied Economics, 2(1), January 2010, pp. 149-64
13 IS: 1945-7782
14 AV: http://www.aeaweb.org/aej-applied/
15 DT: Journal Article
16 PY: 2010
17 AN: 1075725
然而,一些序列缺少某些 variables/variable 数字 ,这些数字未标记为 NA。一个例子是第一个序列,缺少标题 (TI:)-variable:
1 AU: Duflo, Esther
2 AF: MIT
3 SO: American Economic Journal: Applied Economics, 2(2), April 2010, pp.
4 IS: 1945-7782
5 AV: http://www.aeaweb.org/aej-applied/
6 DT: Journal Article
7 PY: 2010
8 AN: 1094392
每个序列(或者我收集的)都将以文章编号 (AN) 结尾。除此之外,variables/variable 值似乎随机丢失了很多。
如果我使用普通方法(按组分配 ID-numbers,然后分配函数),这将改变我数据中的顺序,导致第二个序列的标题出现在第一个序列中,依此类推.我想让数据以这种格式出现,每个序列中每个缺失的变量值都有 'NA':
Sequence TI: AU: AF: SO: IS: AV: DT: PY: AN:
1 'NA' Duflo MIT AEJ 1945 aea jour 2010 1094392
2 Priz Freem NBER AEJ 1945 aea jour 2010 1075725
有什么方法可以用普通的 r 函数做到这一点吗?
我不知道这是否有帮助(此数据非常规律),但我将附上前 27 个观察结果:
structure(list(category = c("AU:", "AF:", "SO:", "IS:", "AV:",
"DT:", "PY:", "AN:", "TI:", "AU:", "AF:", "SO:", "IS:", "AV:",
"DT:", "PY:", "AN:", "TI:", "AU:", "AF:", "SO:", "IS:", "AV:",
"DT:", "PY:", "AN:", "TI:"), value = c("Duflo, Esther", "MIT",
"American Economic Journal: Applied Economics, 2(2), April 2010, pp.",
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article",
"2010", "1094392", "Prize Structure and Information in Tournaments: Experimental Evidence",
"Freeman, Richard B.; Gelber, Alexander M.", "NBER; NBER", "American Economic Journal: Applied Economics, 2(1), January 2010, pp. 149-64",
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article",
"2010", "1075725", "Why Have College Completion Rates Declined? An Analysis of Changing Student Preparation and Collegiate Resources",
"Bound, John; Lovenheim, Michael F.; Turner, Sarah", "U MI; Cornell U; U VA",
"American Economic Journal: Applied Economics, 2(3), July 2010, pp. 129-57",
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article",
"2010", "1105792", "An Empirical Analysis of the Gender Gap in Mathematics"
)), row.names = c(NA, 27L), class = "data.frame")
我还将向我当前正在处理的文件添加一个 link:
https://www.dropbox.com/s/wwaimr21eld2jg6/relevant.csv?dl=0
请注意,"category" 的其中一个值显示为 "AN: Perspectives from a Cluster Analysis[...]",该值不正确。我还没有设法 grep/delete 这个条目,我现在太累了,对此感到抱歉。
假设您调用该数据集 dt
:
library(tidyverse)
dt %>% mutate(Sequence = cumsum(row_number() == 1 | lag(category) == 'AN:')) %>%
group_by(Sequence) %>%
mutate(seq_ind = row_number(), n = n()) %>%
ungroup %>%
arrange(desc(n), seq_ind) %>% #This will put sequences with the most number of fields first
mutate(category = as_factor(category)) %>% #column orders are defined based on the order of the largest sequence
select(-seq_ind, -n) %>%
spread(key = category, value = value)
基于 "Every sequence (or so I gather) will end with Article Number (AN). Apart from this, variables/variable values seem to be missing pretty much at random" 我假设 "AN:"
永远不会丢失 -
df %>%
mutate(
ID = cumsum(category == "AN:") - (category == "AN:") + 1
) %>%
spread(category, value)
# A tibble: 4 x 10
ID `AF:` `AN:` `AU:` `AV:` `DT:` `IS:` `PY:` `SO:` `TI:`
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 MIT 10943~ Duflo, Esther http://www.~ Journal~ 1945~ 2010 American Economic Jo~ <NA>
2 2 NBER; NB~ 10757~ Freeman, Richar~ http://www.~ Journal~ 1945~ 2010 American Economic Jo~ Prize Structure and Informat~
3 3 U MI; Co~ 11057~ Bound, John; Lo~ http://www.~ Journal~ 1945~ 2010 American Economic Jo~ Why Have College Completion ~
4 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> An Empirical Analysis of the~
备注-
这是一种测试 "AN:" 是否是序列结束的良好指标的方法 - 下面的代码计算连续 "AN:" 之间的行数。如果所有间隔都是 <= 9
那么它 returns FALSE
这意味着 "AN:" 是序列结束的一个很好的指标。如果没有,那么我建议将实际文件托管在某个地方,以便人们可以查看整个数据集。
df %>%
group_split(ID = cumsum(category == "AN:") - (category == "AN:") + 1) %>%
sapply(nrow) %>%
{any(. > 9)}
[1] FALSE
我的数据很乱,但大部分看起来像这样,有 9 个类别的重复序列:
9 TI: Prize Structure and Information in Tournaments: Experimental Evidence
10 AU: Freeman, Richard B.; Gelber, Alexander M.
11 AF: NBER; NBER
12 SO: American Economic Journal: Applied Economics, 2(1), January 2010, pp. 149-64
13 IS: 1945-7782
14 AV: http://www.aeaweb.org/aej-applied/
15 DT: Journal Article
16 PY: 2010
17 AN: 1075725
然而,一些序列缺少某些 variables/variable 数字 ,这些数字未标记为 NA。一个例子是第一个序列,缺少标题 (TI:)-variable:
1 AU: Duflo, Esther
2 AF: MIT
3 SO: American Economic Journal: Applied Economics, 2(2), April 2010, pp.
4 IS: 1945-7782
5 AV: http://www.aeaweb.org/aej-applied/
6 DT: Journal Article
7 PY: 2010
8 AN: 1094392
每个序列(或者我收集的)都将以文章编号 (AN) 结尾。除此之外,variables/variable 值似乎随机丢失了很多。
如果我使用普通方法(按组分配 ID-numbers,然后分配函数),这将改变我数据中的顺序,导致第二个序列的标题出现在第一个序列中,依此类推.我想让数据以这种格式出现,每个序列中每个缺失的变量值都有 'NA':
Sequence TI: AU: AF: SO: IS: AV: DT: PY: AN:
1 'NA' Duflo MIT AEJ 1945 aea jour 2010 1094392
2 Priz Freem NBER AEJ 1945 aea jour 2010 1075725
有什么方法可以用普通的 r 函数做到这一点吗?
我不知道这是否有帮助(此数据非常规律),但我将附上前 27 个观察结果:
structure(list(category = c("AU:", "AF:", "SO:", "IS:", "AV:",
"DT:", "PY:", "AN:", "TI:", "AU:", "AF:", "SO:", "IS:", "AV:",
"DT:", "PY:", "AN:", "TI:", "AU:", "AF:", "SO:", "IS:", "AV:",
"DT:", "PY:", "AN:", "TI:"), value = c("Duflo, Esther", "MIT",
"American Economic Journal: Applied Economics, 2(2), April 2010, pp.",
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article",
"2010", "1094392", "Prize Structure and Information in Tournaments: Experimental Evidence",
"Freeman, Richard B.; Gelber, Alexander M.", "NBER; NBER", "American Economic Journal: Applied Economics, 2(1), January 2010, pp. 149-64",
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article",
"2010", "1075725", "Why Have College Completion Rates Declined? An Analysis of Changing Student Preparation and Collegiate Resources",
"Bound, John; Lovenheim, Michael F.; Turner, Sarah", "U MI; Cornell U; U VA",
"American Economic Journal: Applied Economics, 2(3), July 2010, pp. 129-57",
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article",
"2010", "1105792", "An Empirical Analysis of the Gender Gap in Mathematics"
)), row.names = c(NA, 27L), class = "data.frame")
我还将向我当前正在处理的文件添加一个 link:
https://www.dropbox.com/s/wwaimr21eld2jg6/relevant.csv?dl=0
请注意,"category" 的其中一个值显示为 "AN: Perspectives from a Cluster Analysis[...]",该值不正确。我还没有设法 grep/delete 这个条目,我现在太累了,对此感到抱歉。
假设您调用该数据集 dt
:
library(tidyverse)
dt %>% mutate(Sequence = cumsum(row_number() == 1 | lag(category) == 'AN:')) %>%
group_by(Sequence) %>%
mutate(seq_ind = row_number(), n = n()) %>%
ungroup %>%
arrange(desc(n), seq_ind) %>% #This will put sequences with the most number of fields first
mutate(category = as_factor(category)) %>% #column orders are defined based on the order of the largest sequence
select(-seq_ind, -n) %>%
spread(key = category, value = value)
基于 "Every sequence (or so I gather) will end with Article Number (AN). Apart from this, variables/variable values seem to be missing pretty much at random" 我假设 "AN:"
永远不会丢失 -
df %>%
mutate(
ID = cumsum(category == "AN:") - (category == "AN:") + 1
) %>%
spread(category, value)
# A tibble: 4 x 10
ID `AF:` `AN:` `AU:` `AV:` `DT:` `IS:` `PY:` `SO:` `TI:`
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 MIT 10943~ Duflo, Esther http://www.~ Journal~ 1945~ 2010 American Economic Jo~ <NA>
2 2 NBER; NB~ 10757~ Freeman, Richar~ http://www.~ Journal~ 1945~ 2010 American Economic Jo~ Prize Structure and Informat~
3 3 U MI; Co~ 11057~ Bound, John; Lo~ http://www.~ Journal~ 1945~ 2010 American Economic Jo~ Why Have College Completion ~
4 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> An Empirical Analysis of the~
备注-
这是一种测试 "AN:" 是否是序列结束的良好指标的方法 - 下面的代码计算连续 "AN:" 之间的行数。如果所有间隔都是 <= 9
那么它 returns FALSE
这意味着 "AN:" 是序列结束的一个很好的指标。如果没有,那么我建议将实际文件托管在某个地方,以便人们可以查看整个数据集。
df %>%
group_split(ID = cumsum(category == "AN:") - (category == "AN:") + 1) %>%
sapply(nrow) %>%
{any(. > 9)}
[1] FALSE