使用 string.replace 编辑后从 json 文件中删除反斜杠字符
Removing backslash character from json file after editing with string.replace
为了使用 Beautiful Soup 从页面中提取 HTML 项目并将其翻译成 JSON,我进行了大量编码工作。但是,我仍然有一个问题:当我打开最后的 JSON 文件时,它们的引号前都有一个反斜杠。我知道这是因为我必须将 HTML 转换为字符串,然后使用 str.replace
进行所有格式化。我正在寻找一个简短的代码来添加,它将从最终结果中删除反斜杠。
这是我的代码。
注意:HTML 文件被保存为 HTML 的 authorID,所以 GVcmmoEAAAAJ.html
from bs4 import BeautifulSoup
import json
import os
authorID = "GVcmmoEAAAAJ"
cur_dir = os.getcwd()
new_dir = authorID
path = os.path.join(cur_dir,new_dir)
if not os.path.exists(path):
os.mkdir(path)
html_file2 = open((authorID + ".html"), "rb")
soup = BeautifulSoup(html_file2.read(), 'lxml')
gs_results = soup.find_all('tr', class_= 'gsc_a_tr')
gs_strings = []
for i in gs_results:
item = i
gs_strings.append(str(item))
gs_data = []
for x in range(0, len(gs_strings)):
round1 = gs_strings[x].replace("<tr class=\"gsc_a_tr\"><td class=\"gsc_a_t\"><a class=\"gsc_a_at\" data-href=\"", "IDHASH = {\"DirectURL\":\"https://scholar.google.com")
round2 = round1.replace("\" href=\"javascript:void(0)\">*", "\"")
round3 = round2.replace("\" href=\"javascript:void(0)\">", "\", \"PopupURL\": \"POPUPURLHERE\", \"Title\":\"")
round4 = round3.replace("</a><div class=\"gs_gray\">", "\", \"Authors\":\"")
round5 = round4.replace("</div><div class=\"gs_gray\">", "\", \"Source\":\"")
round6 = round5.replace("</div></td><td class=\"gsc_a_c\"><a class=\"gsc_a_ac gs_ibl\" href=\"", "\", \"CitedBy\":\"")
round7 = round6.replace("<span class=\"gs_oph\">, ", "\", \"SourceYear\":\"")
round8 = round7.replace("</span></td></tr>", "\"}")
round9 = round8.replace("</a></td><td class=\"gsc_a_y\"><span class=\"gsc_a_h gsc_a_hc gs_ibl\">", "\", \"PageYear\":\"")
round10 = round9.replace("</a><span class=\"gsc_a_m\"><a class=\"gsc_a_am\" data-eid=\"", "\", \"DataID\":\"")
round11 = round10.replace("</span>", "")
round12 = round11.replace("<span>", "")
round13 = round12.replace("\"</a></td><td class=\"gsc_a_y\"><span class=\"gsc_a_h gsc_a_hc gs_ibl", "<span class=\"gsc_a_h gsc_a_hc gs_ibl")
round14 = round13.replace("<span class=\"gsc_a_h gsc_a_hc gs_ibl\">", "\", \"PageYear\":\"")
round15 = round14.replace("\">", "\", \"Citations\":\"")
round16 = round15.replace("&", "&")
gs_data.append(round16)
tempdata = gs_data[x]
with open((new_dir + "/" + authorID + "-" + str(x) + ".json"), "w") as new_file:
json.dump(tempdata,new_file)
new_file.close()
html_file2.close()
这是打开的 2 个示例:
> <tr class="gsc_a_tr"><td class="gsc_a_t"><a class="gsc_a_at"
> data-href="/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C"
> href="javascript:void(0)">Audience response made easy: using personal
> digital assistants as a classroom polling tool</a><div
> class="gs_gray">AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T
> Grappone</div><div class="gs_gray">Journal of the American Medical
> Informatics Association 11 (3), 217-220<span class="gs_oph">,
> 2004</span></div></td><td class="gsc_a_c"><a class="gsc_a_ac gs_ibl"
> href="https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441">75</a></td><td
> class="gsc_a_y"><span class="gsc_a_h gsc_a_hc
> gs_ibl">2004</span></td></tr>
>
> <tr class="gsc_a_tr"><td class="gsc_a_t"><a class="gsc_a_at"
> data-href="/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC"
> href="javascript:void(0)">The UCLA Libraries Affordable Course
> Materials Initiative: Expanding Access, Use, and Affordability of
> Course Materials</a><div class="gs_gray">SE Farb, T Grappone</div><div
> class="gs_gray">Against the Grain 26 (5), 14<span class="gs_oph">,
> 2014</span></div></td><td class="gsc_a_c"><a class="gsc_a_ac gs_ibl"
> href="https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717">1</a></td><td
> class="gsc_a_y"><span class="gsc_a_h gsc_a_hc
> gs_ibl">2014</span></td></tr>
屏幕显示如下:
IDHASH = {"DirectURL":"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C", "PopupURL": "POPUPURLHERE", "Title":"Audience response made easy: using personal digital assistants as a classroom polling tool", "Authors":"AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone", "Source":"Journal of the American Medical Informatics Association 11 (3), 217-220", "SourceYear":"2004", "CitedBy":"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441", "Citations":"75", "PageYear":"2004"}
IDHASH = {"DirectURL":"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC", "PopupURL": "POPUPURLHERE", "Title":"The UCLA Libraries Affordable Course Materials Initiative: Expanding Access, Use, and Affordability of Course Materials", "Authors":"SE Farb, T Grappone", "Source":"Against the Grain 26 (5), 14", "SourceYear":"2014", "CitedBy":"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717", "Citations":"1", "PageYear":"2014"}
看起来不错,但是当我打开 JSON 文件时,这是我得到的:
"IDHASH = {\"DirectURL\":\"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C\", \"PopupURL\": \"POPUPURLHERE\", \"Title\":\"Audience response made easy: using personal digital assistants as a classroom polling tool\", \"Authors\":\"AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone\", \"Source\":\"Journal of the American Medical Informatics Association 11 (3), 217-220\", \"SourceYear\":\"2004\", \"CitedBy\":\"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441\", \"Citations\":\"75\", \"PageYear\":\"2004\"}"
"IDHASH = {\"DirectURL\":\"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC\", \"PopupURL\": \"POPUPURLHERE\", \"Title\":\"The UCLA Libraries Affordable Course Materials Initiative: Expanding Access, Use, and Affordability of Course Materials\", \"Authors\":\"SE Farb, T Grappone\", \"Source\":\"Against the Grain 26 (5), 14\", \"SourceYear\":\"2014\", \"CitedBy\":\"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717\", \"Citations\":\"1\", \"PageYear\":\"2014\"}"
我需要删除 " 之前的 \ 标记,整个过程中只有 "。
我将原来的 Beautiful Soup 结果转换为字符串,因为我想不出任何其他方法来修改它,而且我需要在某些地方保留 HTML 编码——所以我不只是想要屏幕显示table个结果。
我确实看了一些相关的问题,但答案似乎是针对类的,这不是我在做的。我无法理解它们。
好的,我又修改了代码,这就可以了。我不得不完全删除“SourceYear”并将其与“Source”字段合并,但这没关系。
html_file2 = open((authorID + ".html"), "r")
soup = BeautifulSoup(html_file2, 'lxml')
gs_results = soup.find_all('tr', class_= 'gsc_a_tr')
gs_lists = []
x = 0
for i in gs_results:
item = i
list_keys = ["DirectURL","Title","Authors","Source","CitedBy","Citations","PageYear"]
initial_link = i.a['data-href']
prefaceURL = "https://scholar.google.com"
gs_lists.append((
prefaceURL + i.a['data-href'],
i.a.text,
i.select_one('.gs_gray').text,
i.select('.gs_gray')[-1].text,
i.select_one('.gsc_a_ac')['href'],
i.select_one('.gsc_a_ac').text,
i.select_one('.gsc_a_y').text
))
with open((new_dir + "/" + authorID + "-" + str(x) + ".json"), "w") as new_file:
new_entry = dict(zip(list_keys,gs_lists[x]))
json.dump(new_entry,new_file)
new_file.close()
x = x+1
- 您插入了一个错误的
HTML
结构,它不等于原来的结构。我确实清理了它以便能够处理它。
Kindly be informed to copy/paste the HTML
code as it's shown on the website or file. as you made it hard for other to be able to help you.
- 请尝试了解您正在使用的库bs4-Documentation
3.You 真的不需要你在替换字符串并清除它的地方所做的大回合!
from bs4 import BeautifulSoup
from pprint import pp
html = """<tr class="gsc_a_tr">
<td class="gsc_a_t"><a class="gsc_a_at" data-href="/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C" href="javascript:void(0)">Audience response made easy: using personal digital assistants as a classroom polling tool</a>
<div class="gs_gray">AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone</div>
<div class="gs_gray">Journal of the American Medical Informatics Association 11 (3), 217-220<span class="gs_oph">,
2004</span></div>
</td>
<td class="gsc_a_c"><a class="gsc_a_ac gs_ibl" href="https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441">75</a></td>
<td class="gsc_a_y"><span class="gsc_a_h gsc_a_hc
gs_ibl">2004</span></td>
</tr>
<tr class="gsc_a_tr">
<td class="gsc_a_t"><a class="gsc_a_at" data-href="/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC" href="javascript:void(0)">The UCLA Libraries Affordable Course Materials Initiative: Expanding Access, Use, and Affordability of Course Materials</a>
<div class="gs_gray">SE Farb, T Grappone</div>
<div class="gs_gray">Against the Grain 26 (5), 14<span class="gs_oph">,
2014</span></div>
</td>
<td class="gsc_a_c"><a class="gsc_a_ac gs_ibl" href="https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717">1</a></td>
<td class="gsc_a_y"><span class="gsc_a_h gsc_a_hcgs_ibl">2014</span></td>
</tr>"""
soup = BeautifulSoup(html, 'lxml')
goal = [
(
x.a['data-href'],
x.a.text,
x.select_one('.gs_gray').text,
x.select('.gs_gray')[-1].text.rsplit(',', 1)[0],
x.select('.gs_gray')[-1].text.rsplit(',', 1)[1].strip(),
x.select_one('.gsc_a_ac')['href'],
x.select_one('.gsc_a_ac').text,
x.select_one('.gsc_a_y').text
)
for x in soup.select('tr.gsc_a_tr')
]
pp(goal, indent=2)
Ask your self why bs4
PARSER is created ??
输出:
[ ( '/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C',
'Audience response made easy: using personal digital assistants as a '
'classroom polling tool',
'AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone',
'Journal of the American Medical Informatics Association 11 (3), 217-220',
'2004',
'https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441',
'75',
'2004'),
( '/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC',
'The UCLA Libraries Affordable Course Materials Initiative: Expanding '
'Access, Use, and Affordability of Course Materials',
'SE Farb, T Grappone',
'Against the Grain 26 (5), 14',
'2014',
'https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717',
'1',
'2014')]
现在你有了一个元组列表!随意分配键并转换为字典。
为了使用 Beautiful Soup 从页面中提取 HTML 项目并将其翻译成 JSON,我进行了大量编码工作。但是,我仍然有一个问题:当我打开最后的 JSON 文件时,它们的引号前都有一个反斜杠。我知道这是因为我必须将 HTML 转换为字符串,然后使用 str.replace
进行所有格式化。我正在寻找一个简短的代码来添加,它将从最终结果中删除反斜杠。
这是我的代码。
注意:HTML 文件被保存为 HTML 的 authorID,所以 GVcmmoEAAAAJ.html
from bs4 import BeautifulSoup
import json
import os
authorID = "GVcmmoEAAAAJ"
cur_dir = os.getcwd()
new_dir = authorID
path = os.path.join(cur_dir,new_dir)
if not os.path.exists(path):
os.mkdir(path)
html_file2 = open((authorID + ".html"), "rb")
soup = BeautifulSoup(html_file2.read(), 'lxml')
gs_results = soup.find_all('tr', class_= 'gsc_a_tr')
gs_strings = []
for i in gs_results:
item = i
gs_strings.append(str(item))
gs_data = []
for x in range(0, len(gs_strings)):
round1 = gs_strings[x].replace("<tr class=\"gsc_a_tr\"><td class=\"gsc_a_t\"><a class=\"gsc_a_at\" data-href=\"", "IDHASH = {\"DirectURL\":\"https://scholar.google.com")
round2 = round1.replace("\" href=\"javascript:void(0)\">*", "\"")
round3 = round2.replace("\" href=\"javascript:void(0)\">", "\", \"PopupURL\": \"POPUPURLHERE\", \"Title\":\"")
round4 = round3.replace("</a><div class=\"gs_gray\">", "\", \"Authors\":\"")
round5 = round4.replace("</div><div class=\"gs_gray\">", "\", \"Source\":\"")
round6 = round5.replace("</div></td><td class=\"gsc_a_c\"><a class=\"gsc_a_ac gs_ibl\" href=\"", "\", \"CitedBy\":\"")
round7 = round6.replace("<span class=\"gs_oph\">, ", "\", \"SourceYear\":\"")
round8 = round7.replace("</span></td></tr>", "\"}")
round9 = round8.replace("</a></td><td class=\"gsc_a_y\"><span class=\"gsc_a_h gsc_a_hc gs_ibl\">", "\", \"PageYear\":\"")
round10 = round9.replace("</a><span class=\"gsc_a_m\"><a class=\"gsc_a_am\" data-eid=\"", "\", \"DataID\":\"")
round11 = round10.replace("</span>", "")
round12 = round11.replace("<span>", "")
round13 = round12.replace("\"</a></td><td class=\"gsc_a_y\"><span class=\"gsc_a_h gsc_a_hc gs_ibl", "<span class=\"gsc_a_h gsc_a_hc gs_ibl")
round14 = round13.replace("<span class=\"gsc_a_h gsc_a_hc gs_ibl\">", "\", \"PageYear\":\"")
round15 = round14.replace("\">", "\", \"Citations\":\"")
round16 = round15.replace("&", "&")
gs_data.append(round16)
tempdata = gs_data[x]
with open((new_dir + "/" + authorID + "-" + str(x) + ".json"), "w") as new_file:
json.dump(tempdata,new_file)
new_file.close()
html_file2.close()
这是打开的 2 个示例:
> <tr class="gsc_a_tr"><td class="gsc_a_t"><a class="gsc_a_at"
> data-href="/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C"
> href="javascript:void(0)">Audience response made easy: using personal
> digital assistants as a classroom polling tool</a><div
> class="gs_gray">AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T
> Grappone</div><div class="gs_gray">Journal of the American Medical
> Informatics Association 11 (3), 217-220<span class="gs_oph">,
> 2004</span></div></td><td class="gsc_a_c"><a class="gsc_a_ac gs_ibl"
> href="https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441">75</a></td><td
> class="gsc_a_y"><span class="gsc_a_h gsc_a_hc
> gs_ibl">2004</span></td></tr>
>
> <tr class="gsc_a_tr"><td class="gsc_a_t"><a class="gsc_a_at"
> data-href="/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC"
> href="javascript:void(0)">The UCLA Libraries Affordable Course
> Materials Initiative: Expanding Access, Use, and Affordability of
> Course Materials</a><div class="gs_gray">SE Farb, T Grappone</div><div
> class="gs_gray">Against the Grain 26 (5), 14<span class="gs_oph">,
> 2014</span></div></td><td class="gsc_a_c"><a class="gsc_a_ac gs_ibl"
> href="https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717">1</a></td><td
> class="gsc_a_y"><span class="gsc_a_h gsc_a_hc
> gs_ibl">2014</span></td></tr>
屏幕显示如下:
IDHASH = {"DirectURL":"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C", "PopupURL": "POPUPURLHERE", "Title":"Audience response made easy: using personal digital assistants as a classroom polling tool", "Authors":"AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone", "Source":"Journal of the American Medical Informatics Association 11 (3), 217-220", "SourceYear":"2004", "CitedBy":"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441", "Citations":"75", "PageYear":"2004"}
IDHASH = {"DirectURL":"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC", "PopupURL": "POPUPURLHERE", "Title":"The UCLA Libraries Affordable Course Materials Initiative: Expanding Access, Use, and Affordability of Course Materials", "Authors":"SE Farb, T Grappone", "Source":"Against the Grain 26 (5), 14", "SourceYear":"2014", "CitedBy":"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717", "Citations":"1", "PageYear":"2014"}
看起来不错,但是当我打开 JSON 文件时,这是我得到的:
"IDHASH = {\"DirectURL\":\"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C\", \"PopupURL\": \"POPUPURLHERE\", \"Title\":\"Audience response made easy: using personal digital assistants as a classroom polling tool\", \"Authors\":\"AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone\", \"Source\":\"Journal of the American Medical Informatics Association 11 (3), 217-220\", \"SourceYear\":\"2004\", \"CitedBy\":\"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441\", \"Citations\":\"75\", \"PageYear\":\"2004\"}"
"IDHASH = {\"DirectURL\":\"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC\", \"PopupURL\": \"POPUPURLHERE\", \"Title\":\"The UCLA Libraries Affordable Course Materials Initiative: Expanding Access, Use, and Affordability of Course Materials\", \"Authors\":\"SE Farb, T Grappone\", \"Source\":\"Against the Grain 26 (5), 14\", \"SourceYear\":\"2014\", \"CitedBy\":\"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717\", \"Citations\":\"1\", \"PageYear\":\"2014\"}"
我需要删除 " 之前的 \ 标记,整个过程中只有 "。
我将原来的 Beautiful Soup 结果转换为字符串,因为我想不出任何其他方法来修改它,而且我需要在某些地方保留 HTML 编码——所以我不只是想要屏幕显示table个结果。
我确实看了一些相关的问题,但答案似乎是针对类的,这不是我在做的。我无法理解它们。
好的,我又修改了代码,这就可以了。我不得不完全删除“SourceYear”并将其与“Source”字段合并,但这没关系。
html_file2 = open((authorID + ".html"), "r")
soup = BeautifulSoup(html_file2, 'lxml')
gs_results = soup.find_all('tr', class_= 'gsc_a_tr')
gs_lists = []
x = 0
for i in gs_results:
item = i
list_keys = ["DirectURL","Title","Authors","Source","CitedBy","Citations","PageYear"]
initial_link = i.a['data-href']
prefaceURL = "https://scholar.google.com"
gs_lists.append((
prefaceURL + i.a['data-href'],
i.a.text,
i.select_one('.gs_gray').text,
i.select('.gs_gray')[-1].text,
i.select_one('.gsc_a_ac')['href'],
i.select_one('.gsc_a_ac').text,
i.select_one('.gsc_a_y').text
))
with open((new_dir + "/" + authorID + "-" + str(x) + ".json"), "w") as new_file:
new_entry = dict(zip(list_keys,gs_lists[x]))
json.dump(new_entry,new_file)
new_file.close()
x = x+1
- 您插入了一个错误的
HTML
结构,它不等于原来的结构。我确实清理了它以便能够处理它。
Kindly be informed to copy/paste the
HTML
code as it's shown on the website or file. as you made it hard for other to be able to help you.
- 请尝试了解您正在使用的库bs4-Documentation
3.You 真的不需要你在替换字符串并清除它的地方所做的大回合!
from bs4 import BeautifulSoup
from pprint import pp
html = """<tr class="gsc_a_tr">
<td class="gsc_a_t"><a class="gsc_a_at" data-href="/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C" href="javascript:void(0)">Audience response made easy: using personal digital assistants as a classroom polling tool</a>
<div class="gs_gray">AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone</div>
<div class="gs_gray">Journal of the American Medical Informatics Association 11 (3), 217-220<span class="gs_oph">,
2004</span></div>
</td>
<td class="gsc_a_c"><a class="gsc_a_ac gs_ibl" href="https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441">75</a></td>
<td class="gsc_a_y"><span class="gsc_a_h gsc_a_hc
gs_ibl">2004</span></td>
</tr>
<tr class="gsc_a_tr">
<td class="gsc_a_t"><a class="gsc_a_at" data-href="/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC" href="javascript:void(0)">The UCLA Libraries Affordable Course Materials Initiative: Expanding Access, Use, and Affordability of Course Materials</a>
<div class="gs_gray">SE Farb, T Grappone</div>
<div class="gs_gray">Against the Grain 26 (5), 14<span class="gs_oph">,
2014</span></div>
</td>
<td class="gsc_a_c"><a class="gsc_a_ac gs_ibl" href="https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717">1</a></td>
<td class="gsc_a_y"><span class="gsc_a_h gsc_a_hcgs_ibl">2014</span></td>
</tr>"""
soup = BeautifulSoup(html, 'lxml')
goal = [
(
x.a['data-href'],
x.a.text,
x.select_one('.gs_gray').text,
x.select('.gs_gray')[-1].text.rsplit(',', 1)[0],
x.select('.gs_gray')[-1].text.rsplit(',', 1)[1].strip(),
x.select_one('.gsc_a_ac')['href'],
x.select_one('.gsc_a_ac').text,
x.select_one('.gsc_a_y').text
)
for x in soup.select('tr.gsc_a_tr')
]
pp(goal, indent=2)
Ask your self why
bs4
PARSER is created ??
输出:
[ ( '/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C',
'Audience response made easy: using personal digital assistants as a '
'classroom polling tool',
'AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone',
'Journal of the American Medical Informatics Association 11 (3), 217-220',
'2004',
'https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441',
'75',
'2004'),
( '/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC',
'The UCLA Libraries Affordable Course Materials Initiative: Expanding '
'Access, Use, and Affordability of Course Materials',
'SE Farb, T Grappone',
'Against the Grain 26 (5), 14',
'2014',
'https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717',
'1',
'2014')]
现在你有了一个元组列表!随意分配键并转换为字典。