使用 string.replace 编辑后从 json 文件中删除反斜杠字符

Question

为了使用 Beautiful Soup 从页面中提取 HTML 项目并将其翻译成 JSON，我进行了大量编码工作。但是，我仍然有一个问题：当我打开最后的 JSON 文件时，它们的引号前都有一个反斜杠。我知道这是因为我必须将 HTML 转换为字符串，然后使用 str.replace 进行所有格式化。我正在寻找一个简短的代码来添加，它将从最终结果中删除反斜杠。

这是我的代码。

注意：HTML 文件被保存为 HTML 的 authorID，所以 GVcmmoEAAAAJ.html

from bs4 import BeautifulSoup
import json
import os

authorID = "GVcmmoEAAAAJ"  

cur_dir = os.getcwd()
new_dir = authorID
path = os.path.join(cur_dir,new_dir)
if not os.path.exists(path):
    os.mkdir(path)

html_file2 = open((authorID + ".html"), "rb")
soup = BeautifulSoup(html_file2.read(), 'lxml')

gs_results = soup.find_all('tr', class_= 'gsc_a_tr')

gs_strings = []
for i in gs_results:
    item = i
    gs_strings.append(str(item))

gs_data = []
for x in range(0, len(gs_strings)):
    round1 = gs_strings[x].replace("<tr class=\"gsc_a_tr\"><td class=\"gsc_a_t\"><a class=\"gsc_a_at\" data-href=\"", "IDHASH = {\"DirectURL\":\"https://scholar.google.com")
    round2 = round1.replace("\" href=\"javascript:void(0)\">*", "\"")
    round3 = round2.replace("\" href=\"javascript:void(0)\">", "\", \"PopupURL\": \"POPUPURLHERE\", \"Title\":\"")
    round4 = round3.replace("</a><div class=\"gs_gray\">", "\", \"Authors\":\"")
    round5 = round4.replace("</div><div class=\"gs_gray\">", "\", \"Source\":\"")
    round6 = round5.replace("</div></td><td class=\"gsc_a_c\"><a class=\"gsc_a_ac gs_ibl\" href=\"", "\", \"CitedBy\":\"")
    round7 = round6.replace("<span class=\"gs_oph\">, ", "\", \"SourceYear\":\"")
    round8 = round7.replace("</span></td></tr>", "\"}")
    round9 = round8.replace("</a></td><td class=\"gsc_a_y\"><span class=\"gsc_a_h gsc_a_hc gs_ibl\">", "\", \"PageYear\":\"")
    round10 = round9.replace("</a><span class=\"gsc_a_m\"><a class=\"gsc_a_am\" data-eid=\"", "\", \"DataID\":\"")
    round11 = round10.replace("</span>", "")
    round12 = round11.replace("<span>", "")
    round13 = round12.replace("\"</a></td><td class=\"gsc_a_y\"><span class=\"gsc_a_h gsc_a_hc gs_ibl", "<span class=\"gsc_a_h gsc_a_hc gs_ibl")
    round14 = round13.replace("<span class=\"gsc_a_h gsc_a_hc gs_ibl\">", "\", \"PageYear\":\"")
    round15 = round14.replace("\">", "\", \"Citations\":\"")
    round16 = round15.replace("&amp;", "&")
    
    gs_data.append(round16)
    tempdata = gs_data[x]
    
    with open((new_dir + "/" + authorID + "-" + str(x) + ".json"), "w") as new_file:
        json.dump(tempdata,new_file) 
        
    
    new_file.close()
    
html_file2.close()

这是打开的 2 个示例：

> <tr class="gsc_a_tr"><td class="gsc_a_t"><a class="gsc_a_at"
> data-href="/citations?view_op=view_citation&amp;hl=en&amp;oe=ASCII&amp;user=GVcmmoEAAAAJ&amp;citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C"
> href="javascript:void(0)">Audience response made easy: using personal
> digital assistants as a classroom polling tool</a><div
> class="gs_gray">AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T
> Grappone</div><div class="gs_gray">Journal of the American Medical
> Informatics Association 11 (3), 217-220<span class="gs_oph">,
> 2004</span></div></td><td class="gsc_a_c"><a class="gsc_a_ac gs_ibl"
> href="https://scholar.google.com/scholar?oi=bibs&amp;hl=en&amp;oe=ASCII&amp;cites=8886823218645962441">75</a></td><td
> class="gsc_a_y"><span class="gsc_a_h gsc_a_hc
> gs_ibl">2004</span></td></tr>
> 
> <tr class="gsc_a_tr"><td class="gsc_a_t"><a class="gsc_a_at"
> data-href="/citations?view_op=view_citation&amp;hl=en&amp;oe=ASCII&amp;user=GVcmmoEAAAAJ&amp;citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC"
> href="javascript:void(0)">The UCLA Libraries Affordable Course
> Materials Initiative: Expanding Access, Use, and Affordability of
> Course Materials</a><div class="gs_gray">SE Farb, T Grappone</div><div
> class="gs_gray">Against the Grain 26 (5), 14<span class="gs_oph">,
> 2014</span></div></td><td class="gsc_a_c"><a class="gsc_a_ac gs_ibl"
> href="https://scholar.google.com/scholar?oi=bibs&amp;hl=en&amp;oe=ASCII&amp;cites=3591317356459154717">1</a></td><td
> class="gsc_a_y"><span class="gsc_a_h gsc_a_hc
> gs_ibl">2014</span></td></tr>

屏幕显示如下：

IDHASH = {"DirectURL":"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C", "PopupURL": "POPUPURLHERE", "Title":"Audience response made easy: using personal digital assistants as a classroom polling tool", "Authors":"AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone", "Source":"Journal of the American Medical Informatics Association 11 (3), 217-220", "SourceYear":"2004", "CitedBy":"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441", "Citations":"75", "PageYear":"2004"}

IDHASH = {"DirectURL":"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC", "PopupURL": "POPUPURLHERE", "Title":"The UCLA Libraries Affordable Course Materials Initiative: Expanding Access, Use, and Affordability of Course Materials", "Authors":"SE Farb, T Grappone", "Source":"Against the Grain 26 (5), 14", "SourceYear":"2014", "CitedBy":"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717", "Citations":"1", "PageYear":"2014"}

看起来不错，但是当我打开 JSON 文件时，这是我得到的：

"IDHASH = {\"DirectURL\":\"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C\", \"PopupURL\": \"POPUPURLHERE\", \"Title\":\"Audience response made easy: using personal digital assistants as a classroom polling tool\", \"Authors\":\"AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone\", \"Source\":\"Journal of the American Medical Informatics Association 11 (3), 217-220\", \"SourceYear\":\"2004\", \"CitedBy\":\"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441\", \"Citations\":\"75\", \"PageYear\":\"2004\"}"

"IDHASH = {\"DirectURL\":\"https://scholar.google.com/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC\", \"PopupURL\": \"POPUPURLHERE\", \"Title\":\"The UCLA Libraries Affordable Course Materials Initiative: Expanding Access, Use, and Affordability of Course Materials\", \"Authors\":\"SE Farb, T Grappone\", \"Source\":\"Against the Grain 26 (5), 14\", \"SourceYear\":\"2014\", \"CitedBy\":\"https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717\", \"Citations\":\"1\", \"PageYear\":\"2014\"}"

我需要删除 " 之前的 \ 标记，整个过程中只有 "。

我将原来的 Beautiful Soup 结果转换为字符串，因为我想不出任何其他方法来修改它，而且我需要在某些地方保留 HTML 编码——所以我不只是想要屏幕显示table个结果。

我确实看了一些相关的问题，但答案似乎是针对类的，这不是我在做的。我无法理解它们。

好的，我又修改了代码，这就可以了。我不得不完全删除“SourceYear”并将其与“Source”字段合并，但这没关系。

html_file2 = open((authorID + ".html"), "r")
soup = BeautifulSoup(html_file2, 'lxml')

gs_results = soup.find_all('tr', class_= 'gsc_a_tr')

gs_lists = []
x = 0
for i in gs_results:
    item = i
    list_keys = ["DirectURL","Title","Authors","Source","CitedBy","Citations","PageYear"]
    initial_link = i.a['data-href']
    prefaceURL = "https://scholar.google.com"
    gs_lists.append((
        prefaceURL + i.a['data-href'],
        i.a.text,
        i.select_one('.gs_gray').text,
        i.select('.gs_gray')[-1].text,
        i.select_one('.gsc_a_ac')['href'],
        i.select_one('.gsc_a_ac').text,
        i.select_one('.gsc_a_y').text
    ))
    
    with open((new_dir + "/" + authorID + "-" + str(x) + ".json"), "w") as new_file:
        new_entry = dict(zip(list_keys,gs_lists[x]))
        json.dump(new_entry,new_file)
        
    new_file.close()
    x = x+1

Answer 1

您插入了一个错误的 HTML 结构，它不等于原来的结构。我确实清理了它以便能够处理它。

Kindly be informed to copy/paste the HTML code as it's shown on the website or file. as you made it hard for other to be able to help you.

请尝试了解您正在使用的库bs4-Documentation

3.You 真的不需要你在替换字符串并清除它的地方所做的大回合！

from bs4 import BeautifulSoup
from pprint import pp

html = """<tr class="gsc_a_tr">
    <td class="gsc_a_t"><a class="gsc_a_at" data-href="/citations?view_op=view_citation&amp;hl=en&amp;oe=ASCII&amp;user=GVcmmoEAAAAJ&amp;citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C" href="javascript:void(0)">Audience response made easy: using personal digital assistants as a classroom polling tool</a>
        <div class="gs_gray">AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone</div>
        <div class="gs_gray">Journal of the American Medical Informatics Association 11 (3), 217-220<span class="gs_oph">,
        2004</span></div>
    </td>
    <td class="gsc_a_c"><a class="gsc_a_ac gs_ibl" href="https://scholar.google.com/scholar?oi=bibs&amp;hl=en&amp;oe=ASCII&amp;cites=8886823218645962441">75</a></td>
    <td class="gsc_a_y"><span class="gsc_a_h gsc_a_hc
gs_ibl">2004</span></td>
</tr>
<tr class="gsc_a_tr">
    <td class="gsc_a_t"><a class="gsc_a_at" data-href="/citations?view_op=view_citation&amp;hl=en&amp;oe=ASCII&amp;user=GVcmmoEAAAAJ&amp;citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC" href="javascript:void(0)">The UCLA Libraries Affordable Course Materials Initiative: Expanding Access, Use, and Affordability of Course Materials</a>
        <div class="gs_gray">SE Farb, T Grappone</div>
        <div class="gs_gray">Against the Grain 26 (5), 14<span class="gs_oph">,
        2014</span></div>
    </td>
    <td class="gsc_a_c"><a class="gsc_a_ac gs_ibl" href="https://scholar.google.com/scholar?oi=bibs&amp;hl=en&amp;oe=ASCII&amp;cites=3591317356459154717">1</a></td>
    <td class="gsc_a_y"><span class="gsc_a_h gsc_a_hcgs_ibl">2014</span></td>
</tr>"""


soup = BeautifulSoup(html, 'lxml')
goal = [
    (
        x.a['data-href'],
        x.a.text,
        x.select_one('.gs_gray').text,
        x.select('.gs_gray')[-1].text.rsplit(',', 1)[0],
        x.select('.gs_gray')[-1].text.rsplit(',', 1)[1].strip(),
        x.select_one('.gsc_a_ac')['href'],
        x.select_one('.gsc_a_ac').text,
        x.select_one('.gsc_a_y').text
    )
    for x in soup.select('tr.gsc_a_tr')
]
pp(goal, indent=2)

Ask your self why bs4 PARSER is created ??

输出：

[ ( '/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:u5HHmVD_uO8C',
    'Audience response made easy: using personal digital assistants as a '
    'classroom polling tool',
    'AS Menon, S Moffett, M Enriquez, MM Martinez, P Dev, T Grappone',
    'Journal of the American Medical Informatics Association 11 (3), 217-220',
    '2004',
    'https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=8886823218645962441',
    '75',
    '2004'),
  ( '/citations?view_op=view_citation&hl=en&oe=ASCII&user=GVcmmoEAAAAJ&citation_for_view=GVcmmoEAAAAJ:WF5omc3nYNoC',
    'The UCLA Libraries Affordable Course Materials Initiative: Expanding '
    'Access, Use, and Affordability of Course Materials',
    'SE Farb, T Grappone',
    'Against the Grain 26 (5), 14',
    '2014',
    'https://scholar.google.com/scholar?oi=bibs&hl=en&oe=ASCII&cites=3591317356459154717',
    '1',
    '2014')]

现在你有了一个元组列表！随意分配键并转换为字典。

使用 string.replace 编辑后从 json 文件中删除反斜杠字符

Removing backslash character from json file after editing with string.replace

html

string

json

beautifulsoup

str-replace