R:使用 R edgar 包从 SEC Edgar 数据库读取旧的 13F txt 文件
R: reading old 13F txt files from SEC Edgar database using R edgar package
您好,我正在尝试使用 R edgar 包读取 SEC edgar 数据库中的 13F 文件
我面临的挑战是我正在查看的文件是旧文件(~2000 年)
https://www.sec.gov/edgar/browse/?CIK=1087699
它们是蹩脚的 txt 格式,与今天的 13F 不同,使用 readtxt 函数无法读取。
示例文件在这里:https://www.sec.gov/Archives/edgar/data/1087699/000108769999000001/0001087699-99-000001.txt
library(edgar)
F13<-
getFilings(
cik.no = "0001087699",
form.type = "13F-HR",
1999,
quarter=c(1,2,3),
useragent="myname@gmail.com"
)
我试过了,R 只是告诉我它很忙并且永远在下载,它不是一个很大的 txt 文件。所以出了点问题。然后,当它最终完成时,它说没有找到给定 CIK 和表格类型的归档信息,但我很清楚地在查看文件。如果edgar包不是专门用来处理的,那怎么办呢?
我的最终目标是将文件放在漂亮的数据框中,股票代码和价格的列以及股票数据的行。请帮忙。
有没有抓取可用的?我在 chrome 中通过检查突出显示了灯光,但它们对我来说看起来很奇怪(抱歉,根本不擅长 抓取 )。
您可以使用httr
包来请求页面:
> install.packages("httr")
# follow instructions etc
然后在 R
shell(您可能需要重新启动):
> httr::GET("https://www.sec.gov/Archives/edgar/data/1087699/000108769999000001/0001087699-99-000001.txt")
这将成功下载文件,但我的 R 不够流利,无法解析此文本,但它看起来很简单:按 <TABLE>
拆分文本,行样条线换行,按空格拆分每一行对于列。
我解析了你提供的文件作为例子here。我首先将文件中的数据复制到一个txt文件中。文件 copied.txt
需要位于当前工作目录中。这可以让您了解如何继续。
library(tidyverse)
df <- read_file("copied.txt") %>%
# trying to extract data only from the table
(function(x){
tbl_beg <- str_locate(x, "Managers Sole")[2] + 1
tbl_end <- str_locate(x, "\r\n</TABLE>")[1]
str_sub(x, tbl_beg, tbl_end)
}) %>%
# removing some unwanted characters from the beginning and the end of the extracted string
str_sub(start = 4, end = -3) %>%
# splitting for individual lines
str_split('\"\r\n\"') %>% unlist() %>%
# removing broken line break
str_remove("\r\n") %>%
# replacing the original text where there are spaces with one, where there is underscore
# the reason for that is that I need to split the rows into columns using space
str_replace_all("Sole Managers Sole", " Managers_Sole") %>%
# removing extra spaces
str_squish() %>%
# reversing the order of the line (I need to split from the right because the company name contains additional spaces)
# if the company name is the last one, it is okey that there are additional spaces
stringi::stri_reverse() %>%
str_split(pattern = " ", n = 6, simplify = T) %>%
# making the order to the original one
apply(MARGIN = 2, FUN = stringi::stri_reverse) %>%
as_tibble() %>%
select(c(6:1)) %>%
set_names(nm = c("name_of_issuer", "title_of_cl", "cusip_number", "fair_market_value", "shares", "shares_of_princip_mngrs"))
# A tibble: 47 x 6
name_of_issuer title_of_cl cusip_number fair_market_value shares shares_of_princip_mngrs
<chr> <chr> <chr> <chr> <chr> <chr>
1 America Online COM 02364J104 2,940,000 20,000 Managers_Sole
2 Anheuser Busch COM 35229103 3,045,000 40,000 Managers_Sole
3 At Home COM 45919107 787,500 5,000 Managers_Sole
4 AT&T COM 1957109 5,985,937 75,000 Managers_Sole
5 Bank Toyko COM 65379109 700,000 50,000 Managers_Sole
6 Bay View Capital COM 07262L101 14,958,437 792,500 Managers_Sole
7 Broadcast.com COM 111310108 2,954,687 25,000 Managers_Sole
8 Chase Manhattan COM 16161A108 10,578,750 130,000 Managers_Sole
9 Chase Manhattan 4/85C 16161A9DQ 59,375 500 Managers_Sole
10 Cisco Systems COM 17275R102 4,930,312 45,000 Managers_Sole
您好,我正在尝试使用 R edgar 包读取 SEC edgar 数据库中的 13F 文件
我面临的挑战是我正在查看的文件是旧文件(~2000 年) https://www.sec.gov/edgar/browse/?CIK=1087699
它们是蹩脚的 txt 格式,与今天的 13F 不同,使用 readtxt 函数无法读取。
示例文件在这里:https://www.sec.gov/Archives/edgar/data/1087699/000108769999000001/0001087699-99-000001.txt
library(edgar)
F13<-
getFilings(
cik.no = "0001087699",
form.type = "13F-HR",
1999,
quarter=c(1,2,3),
useragent="myname@gmail.com"
)
我试过了,R 只是告诉我它很忙并且永远在下载,它不是一个很大的 txt 文件。所以出了点问题。然后,当它最终完成时,它说没有找到给定 CIK 和表格类型的归档信息,但我很清楚地在查看文件。如果edgar包不是专门用来处理的,那怎么办呢?
我的最终目标是将文件放在漂亮的数据框中,股票代码和价格的列以及股票数据的行。请帮忙。
有没有抓取可用的?我在 chrome 中通过检查突出显示了灯光,但它们对我来说看起来很奇怪(抱歉,根本不擅长 抓取 )。
您可以使用httr
包来请求页面:
> install.packages("httr")
# follow instructions etc
然后在 R
shell(您可能需要重新启动):
> httr::GET("https://www.sec.gov/Archives/edgar/data/1087699/000108769999000001/0001087699-99-000001.txt")
这将成功下载文件,但我的 R 不够流利,无法解析此文本,但它看起来很简单:按 <TABLE>
拆分文本,行样条线换行,按空格拆分每一行对于列。
我解析了你提供的文件作为例子here。我首先将文件中的数据复制到一个txt文件中。文件 copied.txt
需要位于当前工作目录中。这可以让您了解如何继续。
library(tidyverse)
df <- read_file("copied.txt") %>%
# trying to extract data only from the table
(function(x){
tbl_beg <- str_locate(x, "Managers Sole")[2] + 1
tbl_end <- str_locate(x, "\r\n</TABLE>")[1]
str_sub(x, tbl_beg, tbl_end)
}) %>%
# removing some unwanted characters from the beginning and the end of the extracted string
str_sub(start = 4, end = -3) %>%
# splitting for individual lines
str_split('\"\r\n\"') %>% unlist() %>%
# removing broken line break
str_remove("\r\n") %>%
# replacing the original text where there are spaces with one, where there is underscore
# the reason for that is that I need to split the rows into columns using space
str_replace_all("Sole Managers Sole", " Managers_Sole") %>%
# removing extra spaces
str_squish() %>%
# reversing the order of the line (I need to split from the right because the company name contains additional spaces)
# if the company name is the last one, it is okey that there are additional spaces
stringi::stri_reverse() %>%
str_split(pattern = " ", n = 6, simplify = T) %>%
# making the order to the original one
apply(MARGIN = 2, FUN = stringi::stri_reverse) %>%
as_tibble() %>%
select(c(6:1)) %>%
set_names(nm = c("name_of_issuer", "title_of_cl", "cusip_number", "fair_market_value", "shares", "shares_of_princip_mngrs"))
# A tibble: 47 x 6
name_of_issuer title_of_cl cusip_number fair_market_value shares shares_of_princip_mngrs
<chr> <chr> <chr> <chr> <chr> <chr>
1 America Online COM 02364J104 2,940,000 20,000 Managers_Sole
2 Anheuser Busch COM 35229103 3,045,000 40,000 Managers_Sole
3 At Home COM 45919107 787,500 5,000 Managers_Sole
4 AT&T COM 1957109 5,985,937 75,000 Managers_Sole
5 Bank Toyko COM 65379109 700,000 50,000 Managers_Sole
6 Bay View Capital COM 07262L101 14,958,437 792,500 Managers_Sole
7 Broadcast.com COM 111310108 2,954,687 25,000 Managers_Sole
8 Chase Manhattan COM 16161A108 10,578,750 130,000 Managers_Sole
9 Chase Manhattan 4/85C 16161A9DQ 59,375 500 Managers_Sole
10 Cisco Systems COM 17275R102 4,930,312 45,000 Managers_Sole