从 R 中的模式中提取数据

Extracting data from a pattern in R

我有兴趣将此文本文件中的数据转换为可以加载到 MySQL Workbench 数据库中的格式。

https://sbir.nasa.gov/SBIR/abstracts/17-1.html

我想要 运行 一些 R 代码,在标题为

的每一行之后给我公司名称

"SMALL BUSINESS CONCERN: (Firm Name, Mail Address, City/State/ZIP, Phone)"

例如,我正在寻找如下所示的输出:

Transition45 技术公司 ATSP 创新

等我可以加载到数据库列中。

希望这是有道理的,我对此比较陌生。谢谢

你problem/question不清楚。

如果我是对的,你想提取写在第 "SMALL BUSINESS CONCERN: (Firm Name, Mail Address, City/State/ZIP, Phone)" 行之后的地址详细信息,对吗?如果是,那么

url <- "https://sbir.nasa.gov/SBIR/abstracts/17-1.html"

abstracts_page <- readLines(url)
abstracts_page <- gsub("<.*?>", "", abstracts_page)
abstracts_page <- gsub("\t+", "", abstracts_page)

address_header_index <- grep("SMALL BUSINESS CONCERN:", abstracts_page)

address_list <- lapply(address_header_index, function(i) {
  return(abstracts_page[(i + 2):(i + 6)])
})

address_list <- data.frame(do.call("rbind", address_list))

head(address_list)

#                                          X1                                   X2                   X3
# 1          Transition45 Technologies, Inc.                1739 North Case Street      Orange,&nbsp;CA
# 2                         ATSP Innovations                    60 Hazelwood Drive   Champaign,&nbsp;IL
# 3         Cornerstone Research Group, Inc.               2750 Indian Ripple Road      Dayton,&nbsp;OH
# 4 Interdisciplinary Consulting Corporation      5745 Southwest 75th Street, #364 Gainesville,&nbsp;FL
# 5                 CFD Research Corporation  701 McMillian Way Northwest, Suite D  Huntsville,&nbsp;AL
# 6           LaunchPoint Technologies, Inc.        5735 Hollister Avenue, Suite B      Goleta,&nbsp;CA

#            X4             X5
# 1 92865-4211  (714) 283-2118
# 2 61820-7460  (217) 417-2374
# 3 45440-3638  (937) 320-1877
# 4 32608-5504  (352) 283-8110
# 5 35806-2923  (256) 726-4800
# 6 93117-6410  (805) 683-9659