R:使用 R edgar 包从 SEC Edgar 数据库读取旧的 13F txt 文件

R: reading old 13F txt files from SEC Edgar database using R edgar package

您好,我正在尝试使用 R edgar 包读取 SEC edgar 数据库中的 13F 文件

我面临的挑战是我正在查看的文件是旧文件(~2000 年) https://www.sec.gov/edgar/browse/?CIK=1087699

它们是蹩脚的 txt 格式,与今天的 13F 不同,使用 readtxt 函数无法读取。



  cik.no = "0001087699",
  form.type = "13F-HR",

我试过了,R 只是告诉我它很忙并且永远在下载,它不是一个很大的 txt 文件。所以出了点问题。然后,当它最终完成时,它说没有找到给定 CIK 和表格类型的归档信息,但我很清楚地在查看文件。如果edgar包不是专门用来处理的,那怎么办呢?


有没有抓取可用的?我在 chrome 中通过检查突出显示了灯光,但它们对我来说看起来很奇怪(抱歉,根本不擅长 抓取 )。


> install.packages("httr")
# follow instructions etc

然后在 R shell(您可能需要重新启动):

> httr::GET("https://www.sec.gov/Archives/edgar/data/1087699/000108769999000001/0001087699-99-000001.txt")

这将成功下载文件,但我的 R 不够流利,无法解析此文本,但它看起来很简单:按 <TABLE> 拆分文本,行样条线换行,按空格拆分每一行对于列。

我解析了你提供的文件作为例子here。我首先将文件中的数据复制到一个txt文件中。文件 copied.txt 需要位于当前工作目录中。这可以让您了解如何继续。


df <- read_file("copied.txt") %>%
  # trying to extract data only from the table
    tbl_beg <- str_locate(x, "Managers Sole")[2] + 1
    tbl_end <- str_locate(x, "\r\n</TABLE>")[1]
    str_sub(x, tbl_beg, tbl_end)
    }) %>%
  # removing some unwanted characters from the beginning and the end of the extracted string
  str_sub(start = 4, end = -3) %>%
  # splitting for individual lines
  str_split('\"\r\n\"') %>% unlist() %>%
  # removing broken line break
  str_remove("\r\n") %>%
  # replacing the original text where there are spaces with one, where there is underscore
  # the reason for that is that I need to split the rows into columns using space
  str_replace_all("Sole   Managers Sole", " Managers_Sole") %>%
  # removing extra spaces
  str_squish() %>%
  # reversing the order of the line (I need to split from the right because the company name contains additional spaces)
  # if the company name is the last one, it is okey that there are additional spaces
  stringi::stri_reverse() %>%
  str_split(pattern = " ", n = 6, simplify = T) %>%
  # making the order to the original one
  apply(MARGIN = 2, FUN = stringi::stri_reverse) %>%
  as_tibble() %>%
  select(c(6:1)) %>%
  set_names(nm = c("name_of_issuer", "title_of_cl", "cusip_number", "fair_market_value", "shares",  "shares_of_princip_mngrs"))

# A tibble: 47 x 6
   name_of_issuer   title_of_cl cusip_number fair_market_value shares  shares_of_princip_mngrs
   <chr>            <chr>       <chr>        <chr>             <chr>   <chr>                  
 1 America Online   COM         02364J104    2,940,000         20,000  Managers_Sole          
 2 Anheuser Busch   COM         35229103     3,045,000         40,000  Managers_Sole          
 3 At Home          COM         45919107     787,500           5,000   Managers_Sole          
 4 AT&T             COM         1957109      5,985,937         75,000  Managers_Sole          
 5 Bank Toyko       COM         65379109     700,000           50,000  Managers_Sole          
 6 Bay View Capital COM         07262L101    14,958,437        792,500 Managers_Sole          
 7 Broadcast.com    COM         111310108    2,954,687         25,000  Managers_Sole          
 8 Chase Manhattan  COM         16161A108    10,578,750        130,000 Managers_Sole          
 9 Chase Manhattan  4/85C       16161A9DQ    59,375            500     Managers_Sole          
10 Cisco Systems    COM         17275R102    4,930,312         45,000  Managers_Sole