通过带有嵌入式 leaflet svg 等的 RSelenium 提取底层数据
Extracting underlying data via RSelenium with embedded leaflet svg, and more
我想提取有关此 link 中每个广告的信息。现在,我已经到了可以自动点击 See Ad Details
的阶段,但是有很多基础数据并不容易整理成一个整洁的数据框。
library(RSelenium)
rs <- rsDriver()
remote <- rs$client
remote$navigate(
paste0(
"https://www.facebook.com/ads/library/?",
"active_status=all&ad_type=political_and_issue_ads&country=US&",
"impression_search_field=has_impressions_lifetime&",
"q=actblue&view_all_page_id=38471053686"
)
)
test <- remote$findElement(using = "xpath", "//*[@class=\"_7kfh\"]")
test$clickElement()
## Manually figured out element
test <- remote$findElement(using = "xpath", "//*[@class=\"_7lq0\"]")
test$getElementText()
输出文本本身很乱,但我相信经过一些时间和努力,它可以被整理成有用的东西。问题是争论
中的基础数据
- 图表,似乎只是一个图像,
- 传单 svg,当光标悬停在其上时显示数据。
我不知道如何系统地提取这张图片,尤其是传单 svg。在这种情况下,我将如何获取每个广告然后提取详细信息中可用的完整数据?
这不是一个完整的答案,但希望它能有所帮助。
我试了一下 scraping/parsing,但无法理解图形数据,因为它似乎位于通过 [=32] 中的 'network' 选项卡访问的许多文件中的复杂位置=] 开发工具(我通过使用网络选项卡中的 command+f 并搜索图表中包含的单词找到了数据补丁例如 'Women'、'Unknown' 等)
熟悉ReactJS的人运气可能会更好!
什么可能有效
您可以尝试使用光学字符识别 (OCR) 的完全不同的方法。
即截图(即remote$screenshot()
),base64转图片,读取,提取相关区域(即你要的具体数据的位置),使用方法描述了 here 将包含您需要的数据的区域转换为文本! (如果我有机会尝试,我会更新,但看起来不太可能,很想听听你的进展)
年龄和性别 图形是 canva 元素。要将它们作为图像获取,您可以截取元素的屏幕截图。 Python 示例:
driver.find_element_by_tag_name('canvas').screenshot("age_and_gender.png")
显示此广告的位置 是 SVG,您可以用相同的方式将其另存为图片。结果将不是很准确,因为 SVG 的可见部分与实际不同。但是您可以在之后裁剪图像。 Python 示例:
driver.find_element_by_tag_name('svg').screenshot("where_this_ad_was_shown.png")
要从中提取完整数据,您不能使用 Selenium。获取数据的方法是配置代理服务器,捕获 API 请求,并获取 JSON 格式的数据。是的,这是可能的。
简单的方法是在没有 Selenium 的情况下使用一些请求来获取广告和详细信息。 Python 工作示例:
import json
import requests
params = (
('q', 'actblue'),
('count', '1000'), # default is 30, for 38471053686 it will return about 300 results.
('active_status', 'all'),
('ad_type', 'political_and_issue_ads'),
('countries/[0/]', 'US'),
('impression_search_field', 'has_impressions_lifetime'),
('view_all_page_id', '38471053686'),
)
data = {'__a': '1', }
with requests.session() as s:
response = s.post('https://www.facebook.com/ads/library/async/search_ads/', params=params, data=data)
ads = json.loads(response.text.replace('for (;;);', ''))['payload']['results']
for ad in ads:
ad_details_params = (
('ad_archive_id', ad[0]['adArchiveID']),
('country', 'US'),
)
response = s.post('https://www.facebook.com/ads/library/async/insights/', params=ad_details_params, data=data)
print('parse json from response')
Not: Facebook not allows for automated data collection without written
permission https://www.facebook.com/apps/site_scraping_tos_terms.php
But as we all know, Facebook does not refuse to collect our data.
每个广告详细信息的响应如下:
{
"__ar": 1,
"payload": {
"ageGenderData": [
{
"age_range": "18-24",
"female": 0.03,
"male": 0.05,
"unknown": 0
},
{
"age_range": "25-34",
"female": 0.12,
"male": 0.12,
"unknown": 0.01
},
{
"age_range": "35-44",
"female": 0.16,
"male": 0.09,
"unknown": 0
},
{
"age_range": "45-54",
"female": 0.11,
"male": 0.05,
"unknown": 0
},
{
"age_range": "55-64",
"female": 0.09,
"male": 0.04,
"unknown": 0
},
{
"age_range": "65+",
"female": 0.09,
"male": 0.03,
"unknown": 0
}
],
"currency": "USD",
"currencyMatched": true,
"impressions": "35\u00a0B - 40\u00a0B",
"locationData": [
{
"reach": 0,
"region": "Alabama"
},
{
"reach": 0,
"region": "Utah"
},
{
"reach": 0,
"region": "Maine"
},
{
"reach": 0,
"region": "Louisiana"
},
{
"reach": 0,
"region": "Kentucky"
},
{
"reach": 0,
"region": "Kansas"
},
{
"reach": 0,
"region": "Idaho"
},
{
"reach": 0,
"region": "Delaware"
},
{
"reach": 0,
"region": "Connecticut"
},
{
"reach": 0,
"region": "Arkansas"
},
{
"reach": 0,
"region": "Hawaii"
},
{
"reach": 0,
"region": "Alaska"
},
{
"reach": 0,
"region": "Montana"
},
{
"reach": 0,
"region": "West Virginia"
},
{
"reach": 0,
"region": "Vermont"
},
{
"reach": 0,
"region": "Mississippi"
},
{
"reach": 0,
"region": "Wyoming"
},
{
"reach": 0,
"region": "Oklahoma"
},
{
"reach": 0,
"region": "North Dakota"
},
{
"reach": 0,
"region": "New Mexico"
},
{
"reach": 0,
"region": "New Hampshire"
},
{
"reach": 0,
"region": "Nebraska"
},
{
"reach": 0,
"region": "Rhode Island"
},
{
"reach": 0,
"region": "South Dakota"
},
{
"reach": 0.01,
"region": "Wisconsin"
},
{
"reach": 0.01,
"region": "Missouri"
},
{
"reach": 0.01,
"region": "Oregon"
},
{
"reach": 0.01,
"region": "Minnesota"
},
{
"reach": 0.01,
"region": "Maryland"
},
{
"reach": 0.01,
"region": "New Jersey"
},
{
"reach": 0.01,
"region": "Tennessee"
},
{
"reach": 0.01,
"region": "Washington, District of Columbia"
},
{
"reach": 0.01,
"region": "Indiana"
},
{
"reach": 0.02,
"region": "Michigan"
},
{
"reach": 0.02,
"region": "Iowa"
},
{
"reach": 0.02,
"region": "North Carolina"
},
{
"reach": 0.02,
"region": "Georgia"
},
{
"reach": 0.02,
"region": "Colorado"
},
{
"reach": 0.02,
"region": "Ohio"
},
{
"reach": 0.02,
"region": "Arizona"
},
{
"reach": 0.02,
"region": "Pennsylvania"
},
{
"reach": 0.02,
"region": "Virginia"
},
{
"reach": 0.03,
"region": "Washington"
},
{
"reach": 0.03,
"region": "Massachusetts"
},
{
"reach": 0.04,
"region": "Illinois"
},
{
"reach": 0.04,
"region": "Florida"
},
{
"reach": 0.06,
"region": "New York"
},
{
"reach": 0.13,
"region": "California"
},
{
"reach": 0.19,
"region": "Texas"
}
],
"singleCountry": "US",
"spend": "0 - 9",
"pageSpend": {
"currentWeek": null,
"isPoliticalPage": true,
"weeklyByDisclaimer": {
"WARREN FOR PRESIDENT, INC.": 270970
},
"lifetimeByDisclaimer": {
"Elizabeth for MA": 781272,
"Warren for President": 3396973,
"": 13584,
"WARREN FOR PRESIDENT, INC.": 4081618,
"the Elizabeth Warren Presidential Exploratory Committee": 219471
},
"hasPoliticalSpendInAnyCountry": true
},
"pageBlurb": "United States Senator from Massachusetts, former teacher, and candidate for President of the United States. (official campaign account)"
},
"bootloadable": {},
"ixData": {},
"bxData": {},
"gkxData": {},
"qexData": {},
"lid": "6796246259692811543"
}
最后,运行 这个来自 R 的 python 代码,使用 reticulate
,并且简单地 运行 整个 python 脚本作为一个字符串 - 注意如果 python 脚本不包含任何 "
字符,那么直接进入 R 非常方便,就像这样
library(reticulate)
py_run_string("import json
import requests
rest of script etc
etc
etc")
此外,您还需要安装脚本使用的两个 python 库。这可以通过在 mac 上打开终端并键入 pip install json
来安装 json
python 库,以及 pip install requests
来安装请求库来完成)
我想提取有关此 link 中每个广告的信息。现在,我已经到了可以自动点击 See Ad Details
的阶段,但是有很多基础数据并不容易整理成一个整洁的数据框。
library(RSelenium)
rs <- rsDriver()
remote <- rs$client
remote$navigate(
paste0(
"https://www.facebook.com/ads/library/?",
"active_status=all&ad_type=political_and_issue_ads&country=US&",
"impression_search_field=has_impressions_lifetime&",
"q=actblue&view_all_page_id=38471053686"
)
)
test <- remote$findElement(using = "xpath", "//*[@class=\"_7kfh\"]")
test$clickElement()
## Manually figured out element
test <- remote$findElement(using = "xpath", "//*[@class=\"_7lq0\"]")
test$getElementText()
输出文本本身很乱,但我相信经过一些时间和努力,它可以被整理成有用的东西。问题是争论
中的基础数据- 图表,似乎只是一个图像,
- 传单 svg,当光标悬停在其上时显示数据。
我不知道如何系统地提取这张图片,尤其是传单 svg。在这种情况下,我将如何获取每个广告然后提取详细信息中可用的完整数据?
这不是一个完整的答案,但希望它能有所帮助。
我试了一下 scraping/parsing,但无法理解图形数据,因为它似乎位于通过 [=32] 中的 'network' 选项卡访问的许多文件中的复杂位置=] 开发工具(我通过使用网络选项卡中的 command+f 并搜索图表中包含的单词找到了数据补丁例如 'Women'、'Unknown' 等)
熟悉ReactJS的人运气可能会更好!
什么可能有效
您可以尝试使用光学字符识别 (OCR) 的完全不同的方法。
即截图(即remote$screenshot()
),base64转图片,读取,提取相关区域(即你要的具体数据的位置),使用方法描述了 here 将包含您需要的数据的区域转换为文本! (如果我有机会尝试,我会更新,但看起来不太可能,很想听听你的进展)
年龄和性别 图形是 canva 元素。要将它们作为图像获取,您可以截取元素的屏幕截图。 Python 示例:
driver.find_element_by_tag_name('canvas').screenshot("age_and_gender.png")
显示此广告的位置 是 SVG,您可以用相同的方式将其另存为图片。结果将不是很准确,因为 SVG 的可见部分与实际不同。但是您可以在之后裁剪图像。 Python 示例:
driver.find_element_by_tag_name('svg').screenshot("where_this_ad_was_shown.png")
要从中提取完整数据,您不能使用 Selenium。获取数据的方法是配置代理服务器,捕获 API 请求,并获取 JSON 格式的数据。是的,这是可能的。
简单的方法是在没有 Selenium 的情况下使用一些请求来获取广告和详细信息。 Python 工作示例:
import json
import requests
params = (
('q', 'actblue'),
('count', '1000'), # default is 30, for 38471053686 it will return about 300 results.
('active_status', 'all'),
('ad_type', 'political_and_issue_ads'),
('countries/[0/]', 'US'),
('impression_search_field', 'has_impressions_lifetime'),
('view_all_page_id', '38471053686'),
)
data = {'__a': '1', }
with requests.session() as s:
response = s.post('https://www.facebook.com/ads/library/async/search_ads/', params=params, data=data)
ads = json.loads(response.text.replace('for (;;);', ''))['payload']['results']
for ad in ads:
ad_details_params = (
('ad_archive_id', ad[0]['adArchiveID']),
('country', 'US'),
)
response = s.post('https://www.facebook.com/ads/library/async/insights/', params=ad_details_params, data=data)
print('parse json from response')
Not: Facebook not allows for automated data collection without written permission https://www.facebook.com/apps/site_scraping_tos_terms.php
But as we all know, Facebook does not refuse to collect our data.
每个广告详细信息的响应如下:
{
"__ar": 1,
"payload": {
"ageGenderData": [
{
"age_range": "18-24",
"female": 0.03,
"male": 0.05,
"unknown": 0
},
{
"age_range": "25-34",
"female": 0.12,
"male": 0.12,
"unknown": 0.01
},
{
"age_range": "35-44",
"female": 0.16,
"male": 0.09,
"unknown": 0
},
{
"age_range": "45-54",
"female": 0.11,
"male": 0.05,
"unknown": 0
},
{
"age_range": "55-64",
"female": 0.09,
"male": 0.04,
"unknown": 0
},
{
"age_range": "65+",
"female": 0.09,
"male": 0.03,
"unknown": 0
}
],
"currency": "USD",
"currencyMatched": true,
"impressions": "35\u00a0B - 40\u00a0B",
"locationData": [
{
"reach": 0,
"region": "Alabama"
},
{
"reach": 0,
"region": "Utah"
},
{
"reach": 0,
"region": "Maine"
},
{
"reach": 0,
"region": "Louisiana"
},
{
"reach": 0,
"region": "Kentucky"
},
{
"reach": 0,
"region": "Kansas"
},
{
"reach": 0,
"region": "Idaho"
},
{
"reach": 0,
"region": "Delaware"
},
{
"reach": 0,
"region": "Connecticut"
},
{
"reach": 0,
"region": "Arkansas"
},
{
"reach": 0,
"region": "Hawaii"
},
{
"reach": 0,
"region": "Alaska"
},
{
"reach": 0,
"region": "Montana"
},
{
"reach": 0,
"region": "West Virginia"
},
{
"reach": 0,
"region": "Vermont"
},
{
"reach": 0,
"region": "Mississippi"
},
{
"reach": 0,
"region": "Wyoming"
},
{
"reach": 0,
"region": "Oklahoma"
},
{
"reach": 0,
"region": "North Dakota"
},
{
"reach": 0,
"region": "New Mexico"
},
{
"reach": 0,
"region": "New Hampshire"
},
{
"reach": 0,
"region": "Nebraska"
},
{
"reach": 0,
"region": "Rhode Island"
},
{
"reach": 0,
"region": "South Dakota"
},
{
"reach": 0.01,
"region": "Wisconsin"
},
{
"reach": 0.01,
"region": "Missouri"
},
{
"reach": 0.01,
"region": "Oregon"
},
{
"reach": 0.01,
"region": "Minnesota"
},
{
"reach": 0.01,
"region": "Maryland"
},
{
"reach": 0.01,
"region": "New Jersey"
},
{
"reach": 0.01,
"region": "Tennessee"
},
{
"reach": 0.01,
"region": "Washington, District of Columbia"
},
{
"reach": 0.01,
"region": "Indiana"
},
{
"reach": 0.02,
"region": "Michigan"
},
{
"reach": 0.02,
"region": "Iowa"
},
{
"reach": 0.02,
"region": "North Carolina"
},
{
"reach": 0.02,
"region": "Georgia"
},
{
"reach": 0.02,
"region": "Colorado"
},
{
"reach": 0.02,
"region": "Ohio"
},
{
"reach": 0.02,
"region": "Arizona"
},
{
"reach": 0.02,
"region": "Pennsylvania"
},
{
"reach": 0.02,
"region": "Virginia"
},
{
"reach": 0.03,
"region": "Washington"
},
{
"reach": 0.03,
"region": "Massachusetts"
},
{
"reach": 0.04,
"region": "Illinois"
},
{
"reach": 0.04,
"region": "Florida"
},
{
"reach": 0.06,
"region": "New York"
},
{
"reach": 0.13,
"region": "California"
},
{
"reach": 0.19,
"region": "Texas"
}
],
"singleCountry": "US",
"spend": "0 - 9",
"pageSpend": {
"currentWeek": null,
"isPoliticalPage": true,
"weeklyByDisclaimer": {
"WARREN FOR PRESIDENT, INC.": 270970
},
"lifetimeByDisclaimer": {
"Elizabeth for MA": 781272,
"Warren for President": 3396973,
"": 13584,
"WARREN FOR PRESIDENT, INC.": 4081618,
"the Elizabeth Warren Presidential Exploratory Committee": 219471
},
"hasPoliticalSpendInAnyCountry": true
},
"pageBlurb": "United States Senator from Massachusetts, former teacher, and candidate for President of the United States. (official campaign account)"
},
"bootloadable": {},
"ixData": {},
"bxData": {},
"gkxData": {},
"qexData": {},
"lid": "6796246259692811543"
}
最后,运行 这个来自 R 的 python 代码,使用 reticulate
,并且简单地 运行 整个 python 脚本作为一个字符串 - 注意如果 python 脚本不包含任何 "
字符,那么直接进入 R 非常方便,就像这样
library(reticulate)
py_run_string("import json
import requests
rest of script etc
etc
etc")
此外,您还需要安装脚本使用的两个 python 库。这可以通过在 mac 上打开终端并键入 pip install json
来安装 json
python 库,以及 pip install requests
来安装请求库来完成)