如何处理杂乱的数据 - 变量未标记为缺失

How to deal with messy data - with variables not marked as missing

我的数据很乱,但大部分看起来像这样,有 9 个类别的重复序列:

9   TI: Prize Structure and Information in Tournaments: Experimental Evidence
10  AU: Freeman, Richard B.; Gelber, Alexander M.
11  AF: NBER; NBER
12  SO: American Economic Journal: Applied Economics, 2(1), January 2010, pp. 149-64
13  IS: 1945-7782
14  AV: http://www.aeaweb.org/aej-applied/
15  DT: Journal Article
16  PY: 2010
17  AN: 1075725

然而,一些序列缺少某些 variables/variable 数字 ,这些数字未标记为 NA。一个例子是第一个序列,缺少标题 (TI:)-variable:

1   AU: Duflo, Esther
2   AF: MIT
3   SO: American Economic Journal: Applied Economics, 2(2), April 2010, pp.
4   IS: 1945-7782
5   AV: http://www.aeaweb.org/aej-applied/
6   DT: Journal Article
7   PY: 2010
8   AN: 1094392

每个序列(或者我收集的)都将以文章编号 (AN) 结尾。除此之外,variables/variable 值似乎随机丢失了很多。

如果我使用普通方法(按组分配 ID-numbers,然后分配函数),这将改变我数据中的顺序,导致第二个序列的标题出现在第一个序列中,依此类推.我想让数据以这种格式出现,每个序列中每个缺失的变量值都有 'NA':

Sequence    TI:   AU:   AF:   SO:   IS:   AV:   DT:   PY:    AN:       
    1      'NA'  Duflo MIT   AEJ   1945  aea   jour  2010   1094392
    2       Priz Freem NBER  AEJ   1945  aea   jour  2010   1075725

有什么方法可以用普通的 r 函数做到这一点吗?

我不知道这是否有帮助(此数据非常规律),但我将附上前 27 个观察结果:

structure(list(category = c("AU:", "AF:", "SO:", "IS:", "AV:", 
"DT:", "PY:", "AN:", "TI:", "AU:", "AF:", "SO:", "IS:", "AV:", 
"DT:", "PY:", "AN:", "TI:", "AU:", "AF:", "SO:", "IS:", "AV:", 
"DT:", "PY:", "AN:", "TI:"), value = c("Duflo, Esther", "MIT", 
"American Economic Journal: Applied Economics, 2(2), April 2010, pp.", 
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article", 
"2010", "1094392", "Prize Structure and Information in Tournaments: Experimental Evidence", 
"Freeman, Richard B.; Gelber, Alexander M.", "NBER; NBER", "American Economic Journal: Applied Economics, 2(1), January 2010, pp. 149-64", 
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article", 
"2010", "1075725", "Why Have College Completion Rates Declined? An Analysis of Changing Student Preparation and Collegiate Resources", 
"Bound, John; Lovenheim, Michael F.; Turner, Sarah", "U MI; Cornell U; U VA", 
"American Economic Journal: Applied Economics, 2(3), July 2010, pp. 129-57", 
"1945-7782", "http://www.aeaweb.org/aej-applied/", "Journal Article", 
"2010", "1105792", "An Empirical Analysis of the Gender Gap in Mathematics"
)), row.names = c(NA, 27L), class = "data.frame")

我还将向我当前正在处理的文件添加一个 link:

https://www.dropbox.com/s/wwaimr21eld2jg6/relevant.csv?dl=0

请注意,"category" 的其中一个值显示为 "AN: Perspectives from a Cluster Analysis[...]",该值不正确。我还没有设法 grep/delete 这个条目,我现在太累了,对此感到抱歉。

假设您调用该数据集 dt:

library(tidyverse)
dt %>% mutate(Sequence = cumsum(row_number() == 1 | lag(category) == 'AN:')) %>% 
    group_by(Sequence) %>% 
    mutate(seq_ind = row_number(), n = n()) %>% 
    ungroup %>% 
    arrange(desc(n), seq_ind) %>% #This will put sequences with the most number of fields first
    mutate(category = as_factor(category)) %>% #column orders are defined based on the order of the largest sequence
    select(-seq_ind, -n) %>% 
    spread(key = category, value = value)

基于 "Every sequence (or so I gather) will end with Article Number (AN). Apart from this, variables/variable values seem to be missing pretty much at random" 我假设 "AN:" 永远不会丢失 -

df %>% 
  mutate(
    ID = cumsum(category == "AN:") - (category == "AN:") + 1
  ) %>% 
  spread(category, value)

# A tibble: 4 x 10
     ID `AF:`     `AN:`  `AU:`            `AV:`        `DT:`    `IS:` `PY:` `SO:`                 `TI:`                        
  <dbl> <chr>     <chr>  <chr>            <chr>        <chr>    <chr> <chr> <chr>                 <chr>                        
1     1 MIT       10943~ Duflo, Esther    http://www.~ Journal~ 1945~ 2010  American Economic Jo~ <NA>                         
2     2 NBER; NB~ 10757~ Freeman, Richar~ http://www.~ Journal~ 1945~ 2010  American Economic Jo~ Prize Structure and Informat~
3     3 U MI; Co~ 11057~ Bound, John; Lo~ http://www.~ Journal~ 1945~ 2010  American Economic Jo~ Why Have College Completion ~
4     4 <NA>      <NA>   <NA>             <NA>         <NA>     <NA>  <NA>  <NA>                  An Empirical Analysis of the~

备注-

这是一种测试 "AN:" 是否是序列结束的良好指标的方法 - 下面的代码计算连续 "AN:" 之间的行数。如果所有间隔都是 <= 9 那么它 returns FALSE 这意味着 "AN:" 是序列结束的一个很好的指标。如果没有,那么我建议将实际文件托管在某个地方,以便人们可以查看整个数据集。

df %>% 
  group_split(ID = cumsum(category == "AN:") - (category == "AN:") + 1) %>% 
  sapply(nrow) %>% 
  {any(. > 9)}

[1] FALSE