从 R 中的模式中提取数据
Extracting data from a pattern in R
我有兴趣将此文本文件中的数据转换为可以加载到 MySQL Workbench 数据库中的格式。
https://sbir.nasa.gov/SBIR/abstracts/17-1.html
我想要 运行 一些 R 代码,在标题为
的每一行之后给我公司名称
"SMALL BUSINESS CONCERN: (Firm Name, Mail Address, City/State/ZIP, Phone)"
例如,我正在寻找如下所示的输出:
Transition45 技术公司
ATSP 创新
等我可以加载到数据库列中。
希望这是有道理的,我对此比较陌生。谢谢
你problem/question不清楚。
如果我是对的,你想提取写在第 "SMALL BUSINESS CONCERN: (Firm Name, Mail Address, City/State/ZIP, Phone)" 行之后的地址详细信息,对吗?如果是,那么
url <- "https://sbir.nasa.gov/SBIR/abstracts/17-1.html"
abstracts_page <- readLines(url)
abstracts_page <- gsub("<.*?>", "", abstracts_page)
abstracts_page <- gsub("\t+", "", abstracts_page)
address_header_index <- grep("SMALL BUSINESS CONCERN:", abstracts_page)
address_list <- lapply(address_header_index, function(i) {
return(abstracts_page[(i + 2):(i + 6)])
})
address_list <- data.frame(do.call("rbind", address_list))
head(address_list)
# X1 X2 X3
# 1 Transition45 Technologies, Inc. 1739 North Case Street Orange, CA
# 2 ATSP Innovations 60 Hazelwood Drive Champaign, IL
# 3 Cornerstone Research Group, Inc. 2750 Indian Ripple Road Dayton, OH
# 4 Interdisciplinary Consulting Corporation 5745 Southwest 75th Street, #364 Gainesville, FL
# 5 CFD Research Corporation 701 McMillian Way Northwest, Suite D Huntsville, AL
# 6 LaunchPoint Technologies, Inc. 5735 Hollister Avenue, Suite B Goleta, CA
# X4 X5
# 1 92865-4211 (714) 283-2118
# 2 61820-7460 (217) 417-2374
# 3 45440-3638 (937) 320-1877
# 4 32608-5504 (352) 283-8110
# 5 35806-2923 (256) 726-4800
# 6 93117-6410 (805) 683-9659
我有兴趣将此文本文件中的数据转换为可以加载到 MySQL Workbench 数据库中的格式。
https://sbir.nasa.gov/SBIR/abstracts/17-1.html
我想要 运行 一些 R 代码,在标题为
的每一行之后给我公司名称"SMALL BUSINESS CONCERN: (Firm Name, Mail Address, City/State/ZIP, Phone)"
例如,我正在寻找如下所示的输出:
Transition45 技术公司 ATSP 创新
等我可以加载到数据库列中。
希望这是有道理的,我对此比较陌生。谢谢
你problem/question不清楚。
如果我是对的,你想提取写在第 "SMALL BUSINESS CONCERN: (Firm Name, Mail Address, City/State/ZIP, Phone)" 行之后的地址详细信息,对吗?如果是,那么
url <- "https://sbir.nasa.gov/SBIR/abstracts/17-1.html"
abstracts_page <- readLines(url)
abstracts_page <- gsub("<.*?>", "", abstracts_page)
abstracts_page <- gsub("\t+", "", abstracts_page)
address_header_index <- grep("SMALL BUSINESS CONCERN:", abstracts_page)
address_list <- lapply(address_header_index, function(i) {
return(abstracts_page[(i + 2):(i + 6)])
})
address_list <- data.frame(do.call("rbind", address_list))
head(address_list)
# X1 X2 X3
# 1 Transition45 Technologies, Inc. 1739 North Case Street Orange, CA
# 2 ATSP Innovations 60 Hazelwood Drive Champaign, IL
# 3 Cornerstone Research Group, Inc. 2750 Indian Ripple Road Dayton, OH
# 4 Interdisciplinary Consulting Corporation 5745 Southwest 75th Street, #364 Gainesville, FL
# 5 CFD Research Corporation 701 McMillian Way Northwest, Suite D Huntsville, AL
# 6 LaunchPoint Technologies, Inc. 5735 Hollister Avenue, Suite B Goleta, CA
# X4 X5
# 1 92865-4211 (714) 283-2118
# 2 61820-7460 (217) 417-2374
# 3 45440-3638 (937) 320-1877
# 4 32608-5504 (352) 283-8110
# 5 35806-2923 (256) 726-4800
# 6 93117-6410 (805) 683-9659