没有从网络抓取中将值输入到数据框中
Values are not getting entered in dataframe from web scraping
我的主要目的是从网站中提取内容。我想把它保存在本地。在网站内容更新后,它也应该反映本地数据。
我能够从代码中使用的网页读取数据,现在我想将结果保存到数据框中以便导出结果。我希望 x6 的值应该进入数据框 df ,以便我可以将数据框结果导出到文本文件或 excel 文件中,或者您可以建议任何其他方法从网页中使用的网页中提取数据代码(网络抓取)。在此我希望我的 for 循环不起作用,所以请任何人帮助我。
library(rvest)
library(dplyr)
library(qdapRegex) # install.packages("qdapRegex")
google <- read_html("https://bidplus.gem.gov.in/bidresultlists")
(x <- google %>%
html_nodes(".block") %>%
html_text())
class(x)
(x1 <- gsub(" ", "", x))
(x2 <- gsub(" ", "", x1))
(x3 <- gsub(" ", "", x2))
(x4 <- gsub(" ", "", x3))
(x5 <- gsub(" ", "", x4))
(x6 <- gsub("\n", "", x5))
class(x6)
length(x6[i])
typeof(x6)
for (i in x6) {
BIDNO <- rm_between(x6[i], "BID NO:", "Status", extract = TRUE)
Status <- rm_between(x6[i], "Status:", "Quantity Required", extract = TRUE)
Quantity_Required <- rm_between(x6[i], "Quantity Required:", "Department Name And Address", extract = TRUE)
Department_Name_And_Address <- rm_between(x6[i], "Department Name And Address:", "Start Date", extract = TRUE)
Start_Date <- rm_between(x6[i], "Start Date:", "End Date", extract = TRUE)
# End_Date <- rm_between(x6[i], "End Date: ", "Technical Evaluation", extract=TRUE)
df <- data.frame("BID_NO", "Status", "Quantity_Required", "Department_Name_Address", "Start_Date")
}
df
View(df)
问题似乎是您创建的是一堆引号中带有“BID_NO”等的字符串。如果您尝试将值保存到数据框中,则需要将保存值的变量名称保存到数据框中。
df<-data.frame(BID_NO,Status,Quantity_Required,Department_Name_Address,Start_Date)
假设上面创建每个字段的所有代码都是正确的,并且值保存到这些变量中,你将得到一个单行数据框,因为它是在 for 循环中创建的,所以每次迭代时你都会覆盖最后一个版本。
如果您希望保存多行,请在循环之前创建 final_df
。那么
data.frame(rbind(final_df, df))
将在第一次传递时将数据行绑定到空帧,然后每次都添加一个新行。
但是在循环中创建的任何数据帧都将在每次通过时重新创建并覆盖...并保存变量中没有 ' '
的值...
使用 XPath 定位所需的元素可能是一条减少挫折和错误的途径:
library(rvest)
library(dplyr)
pg <- read_html("https://bidplus.gem.gov.in/bidresultlists")
获取所有出价块:
blocks <- html_nodes(pg, ".block")
目标商品和数量div:
items_and_quantity <- html_nodes(blocks, xpath=".//div[@class='col-block' and contains(., 'Item(s)')]")
拉出项目和数量:
items <- html_nodes(items_and_quantity, xpath=".//strong[contains(., 'Item(s)')]/following-sibling::span") %>% html_text(trim=TRUE)
quantity <- html_nodes(items_and_quantity, xpath=".//strong[contains(., 'Quantity')]/following-sibling::span") %>% html_text(trim=TRUE) %>% as.numeric()
获取部门名称和地址。修改它,使三行用竖线分隔 (|
)。这将在以后实现分离。管道符号是正则表达式的一个难题,因为它必须被转义,但它极不可能出现在文本中,制表符通常会在以后造成混淆。
department_name_and_address <- html_nodes(blocks, xpath=".//div[@class='col-block' and contains(., 'Department Name And Address')]") %>%
html_text(trim=TRUE) %>%
gsub("\n", "|", .) %>%
gsub("[[:space:]]*\||\|[[:space:]]*", "|", .)
定位具有出价 # 和状态的区块头:
block_header <- html_nodes(blocks, "div.block_header")
出价#(见答案末尾的注释):
html_nodes(block_header, xpath=".//p[contains(@class, 'bid_no')]") %>%
html_text(trim=TRUE) %>%
gsub("^.*: ", "", .) -> bid_no
拉出状态:
html_nodes(block_header, xpath=".//p/b[contains(., 'Status')]/following-sibling::span") %>%
html_text(trim=TRUE) -> status
设定并退出开始和结束日期:
html_nodes(blocks, xpath=".//strong[contains(., 'Start Date')]/following-sibling::span") %>%
html_text(trim=TRUE) -> start_date
html_nodes(blocks, xpath=".//strong[contains(., 'End Date')]/following-sibling::span") %>%
html_text(trim=TRUE) -> end_date
制作数据框:
data.frame(
bid_no,
status,
start_date,
end_date,
items,
quantity,
department_name_and_address,
stringsAsFactors=FALSE
) -> xdf
一些出价是“RA
”,因此我们还可以创建一个列让我们知道哪些出价:
xdf$is_ra <- grepl("/RA/", bid_no)
结果数据框:
str(xdf)
## 'data.frame': 10 obs. of 8 variables:
## $ bid_no : chr "GEM/2018/B/93066" "GEM/2018/B/93082" "GEM/2018/B/93105" "GEM/2018/B/93999" ...
## $ status : chr "Not Evaluated" "Not Evaluated" "Not Evaluated" "Not Evaluated" ...
## $ start_date : chr "25-09-2018 03:53:pm" "27-09-2018 09:16:am" "25-09-2018 05:08:pm" "26-09-2018 05:21:pm" ...
## $ end_date : chr "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" ...
## $ items : chr "automotive chassis fitted with engine" "automotive chassis fitted with engine" "automotive chassis fitted with engine" "Storage System" ...
## $ quantity : num 1 1 1 2 90 1 981 6 4 376
## $ department_name_and_address: chr "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Maharashtra Energy Department Maharashtra Bhusawal Tps N/a" ...
## $ is_ra : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
我会让你把日期变成POSIXct
个元素。
连续代码w/o解释为here.
此外,这不是 Java。 for
循环很少能解决 R 中的问题。而且,您应该仔细阅读正则表达式,因为计算替换空间也是一条充满危险和挫败感的道路。
我的主要目的是从网站中提取内容。我想把它保存在本地。在网站内容更新后,它也应该反映本地数据。
我能够从代码中使用的网页读取数据,现在我想将结果保存到数据框中以便导出结果。我希望 x6 的值应该进入数据框 df ,以便我可以将数据框结果导出到文本文件或 excel 文件中,或者您可以建议任何其他方法从网页中使用的网页中提取数据代码(网络抓取)。在此我希望我的 for 循环不起作用,所以请任何人帮助我。
library(rvest)
library(dplyr)
library(qdapRegex) # install.packages("qdapRegex")
google <- read_html("https://bidplus.gem.gov.in/bidresultlists")
(x <- google %>%
html_nodes(".block") %>%
html_text())
class(x)
(x1 <- gsub(" ", "", x))
(x2 <- gsub(" ", "", x1))
(x3 <- gsub(" ", "", x2))
(x4 <- gsub(" ", "", x3))
(x5 <- gsub(" ", "", x4))
(x6 <- gsub("\n", "", x5))
class(x6)
length(x6[i])
typeof(x6)
for (i in x6) {
BIDNO <- rm_between(x6[i], "BID NO:", "Status", extract = TRUE)
Status <- rm_between(x6[i], "Status:", "Quantity Required", extract = TRUE)
Quantity_Required <- rm_between(x6[i], "Quantity Required:", "Department Name And Address", extract = TRUE)
Department_Name_And_Address <- rm_between(x6[i], "Department Name And Address:", "Start Date", extract = TRUE)
Start_Date <- rm_between(x6[i], "Start Date:", "End Date", extract = TRUE)
# End_Date <- rm_between(x6[i], "End Date: ", "Technical Evaluation", extract=TRUE)
df <- data.frame("BID_NO", "Status", "Quantity_Required", "Department_Name_Address", "Start_Date")
}
df
View(df)
问题似乎是您创建的是一堆引号中带有“BID_NO”等的字符串。如果您尝试将值保存到数据框中,则需要将保存值的变量名称保存到数据框中。
df<-data.frame(BID_NO,Status,Quantity_Required,Department_Name_Address,Start_Date)
假设上面创建每个字段的所有代码都是正确的,并且值保存到这些变量中,你将得到一个单行数据框,因为它是在 for 循环中创建的,所以每次迭代时你都会覆盖最后一个版本。
如果您希望保存多行,请在循环之前创建 final_df
。那么
data.frame(rbind(final_df, df))
将在第一次传递时将数据行绑定到空帧,然后每次都添加一个新行。
但是在循环中创建的任何数据帧都将在每次通过时重新创建并覆盖...并保存变量中没有 ' '
的值...
使用 XPath 定位所需的元素可能是一条减少挫折和错误的途径:
library(rvest)
library(dplyr)
pg <- read_html("https://bidplus.gem.gov.in/bidresultlists")
获取所有出价块:
blocks <- html_nodes(pg, ".block")
目标商品和数量div:
items_and_quantity <- html_nodes(blocks, xpath=".//div[@class='col-block' and contains(., 'Item(s)')]")
拉出项目和数量:
items <- html_nodes(items_and_quantity, xpath=".//strong[contains(., 'Item(s)')]/following-sibling::span") %>% html_text(trim=TRUE)
quantity <- html_nodes(items_and_quantity, xpath=".//strong[contains(., 'Quantity')]/following-sibling::span") %>% html_text(trim=TRUE) %>% as.numeric()
获取部门名称和地址。修改它,使三行用竖线分隔 (|
)。这将在以后实现分离。管道符号是正则表达式的一个难题,因为它必须被转义,但它极不可能出现在文本中,制表符通常会在以后造成混淆。
department_name_and_address <- html_nodes(blocks, xpath=".//div[@class='col-block' and contains(., 'Department Name And Address')]") %>%
html_text(trim=TRUE) %>%
gsub("\n", "|", .) %>%
gsub("[[:space:]]*\||\|[[:space:]]*", "|", .)
定位具有出价 # 和状态的区块头:
block_header <- html_nodes(blocks, "div.block_header")
出价#(见答案末尾的注释):
html_nodes(block_header, xpath=".//p[contains(@class, 'bid_no')]") %>%
html_text(trim=TRUE) %>%
gsub("^.*: ", "", .) -> bid_no
拉出状态:
html_nodes(block_header, xpath=".//p/b[contains(., 'Status')]/following-sibling::span") %>%
html_text(trim=TRUE) -> status
设定并退出开始和结束日期:
html_nodes(blocks, xpath=".//strong[contains(., 'Start Date')]/following-sibling::span") %>%
html_text(trim=TRUE) -> start_date
html_nodes(blocks, xpath=".//strong[contains(., 'End Date')]/following-sibling::span") %>%
html_text(trim=TRUE) -> end_date
制作数据框:
data.frame(
bid_no,
status,
start_date,
end_date,
items,
quantity,
department_name_and_address,
stringsAsFactors=FALSE
) -> xdf
一些出价是“RA
”,因此我们还可以创建一个列让我们知道哪些出价:
xdf$is_ra <- grepl("/RA/", bid_no)
结果数据框:
str(xdf)
## 'data.frame': 10 obs. of 8 variables:
## $ bid_no : chr "GEM/2018/B/93066" "GEM/2018/B/93082" "GEM/2018/B/93105" "GEM/2018/B/93999" ...
## $ status : chr "Not Evaluated" "Not Evaluated" "Not Evaluated" "Not Evaluated" ...
## $ start_date : chr "25-09-2018 03:53:pm" "27-09-2018 09:16:am" "25-09-2018 05:08:pm" "26-09-2018 05:21:pm" ...
## $ end_date : chr "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" ...
## $ items : chr "automotive chassis fitted with engine" "automotive chassis fitted with engine" "automotive chassis fitted with engine" "Storage System" ...
## $ quantity : num 1 1 1 2 90 1 981 6 4 376
## $ department_name_and_address: chr "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Maharashtra Energy Department Maharashtra Bhusawal Tps N/a" ...
## $ is_ra : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
我会让你把日期变成POSIXct
个元素。
连续代码w/o解释为here.
此外,这不是 Java。 for
循环很少能解决 R 中的问题。而且,您应该仔细阅读正则表达式,因为计算替换空间也是一条充满危险和挫败感的道路。