网页爬取问题:无法删除\n字符
Web crawling problem: Can't delete \n characters
我现在正在使用 python 从网站抓取数据。事情进展顺利,直到我发现我无法一次合并所有处理过的行。
这是我的错误代码:(我正在使用 scrapy 进行抓取)
rep = response.xpath('/html/body/div[1]/div[2]/div[3]/div[{:d}]'.format(i)).get()
rep = rep.replace('<div class="d-flex justify-content-between search-result-line py-3 px-3">','')
rep = rep.replace('<div class="font-weight-bold">','')
rep = rep.replace('<span>','')
rep = rep.replace('</span>','')
rep = rep.replace('</div></div>',',')
rep = rep.replace('</div>','":')
rep = rep.replace('<div>','"')
rep.join(rep.split('\n'))
该代码的原始输入:
<div class="search-result py-4 px-0 col-12 col-md-6 col-lg-5 mx-auto mt-4"><div class="font-weight-bold mb-3 px-3">Candidate number : <span class="student-id text-dc3545">33000001</span></div><div class="d-flex justify-content-between search-result-line py-3 px-3"><div>Math</div><div class="font-weight-bold">6.40</div></div><div class="d-flex justify-content-between search-result-line py-3 px-3"><div>Literature</div><div class="font-weight-bold">4.50</div></div><div class="d-flex justify-content-between search-result-line py-3 px-3"><div>History</div><div class="font-weight-bold">6.50</div></div><div class="d-flex justify-content-between search-result-line py-3 px-3"><div>Geography</div><div class="font-weight-bold">7.50</div></div><div class="d-flex justify-content-between search-result-line py-3 px-3"><div>Foreign language (<span>N1</span>)</div><div class="font-weight-bold">3</div></div><div class="d-flex justify-content-between search-result-line py-3 px-3"><div>Civic Education</div><div class="font-weight-bold">7.75</div></div></div>
我在该代码之后期望的是:
“数学”:6.40,“文学”:4.50等
但这是我真正得到的:
"Math":6.40,
"Literature":4.50,
etc.
我是不是搞砸了什么?
scrapy shell
In [1]: courses = response.xpath('//div[contains(@class, "d-flex justify-content-between search-result-line py-3 px-3")
...: ]')
In [2]: for course in courses:
...: data = course.xpath('.//text()').getall()
...: data_str = ' '.join(data)
...: print(data_str)
...:
Math 6.40
Literature 4.50
History 6.50
Geography 7.50
Foreign language ( N1 ) 3
Civic Education 7.75
我现在正在使用 python 从网站抓取数据。事情进展顺利,直到我发现我无法一次合并所有处理过的行。 这是我的错误代码:(我正在使用 scrapy 进行抓取)
rep = response.xpath('/html/body/div[1]/div[2]/div[3]/div[{:d}]'.format(i)).get()
rep = rep.replace('<div class="d-flex justify-content-between search-result-line py-3 px-3">','')
rep = rep.replace('<div class="font-weight-bold">','')
rep = rep.replace('<span>','')
rep = rep.replace('</span>','')
rep = rep.replace('</div></div>',',')
rep = rep.replace('</div>','":')
rep = rep.replace('<div>','"')
rep.join(rep.split('\n'))
该代码的原始输入:
<div class="search-result py-4 px-0 col-12 col-md-6 col-lg-5 mx-auto mt-4"><div class="font-weight-bold mb-3 px-3">Candidate number : <span class="student-id text-dc3545">33000001</span></div><div class="d-flex justify-content-between search-result-line py-3 px-3"><div>Math</div><div class="font-weight-bold">6.40</div></div><div class="d-flex justify-content-between search-result-line py-3 px-3"><div>Literature</div><div class="font-weight-bold">4.50</div></div><div class="d-flex justify-content-between search-result-line py-3 px-3"><div>History</div><div class="font-weight-bold">6.50</div></div><div class="d-flex justify-content-between search-result-line py-3 px-3"><div>Geography</div><div class="font-weight-bold">7.50</div></div><div class="d-flex justify-content-between search-result-line py-3 px-3"><div>Foreign language (<span>N1</span>)</div><div class="font-weight-bold">3</div></div><div class="d-flex justify-content-between search-result-line py-3 px-3"><div>Civic Education</div><div class="font-weight-bold">7.75</div></div></div>
我在该代码之后期望的是: “数学”:6.40,“文学”:4.50等 但这是我真正得到的:
"Math":6.40,
"Literature":4.50,
etc.
我是不是搞砸了什么?
scrapy shell
In [1]: courses = response.xpath('//div[contains(@class, "d-flex justify-content-between search-result-line py-3 px-3")
...: ]')
In [2]: for course in courses:
...: data = course.xpath('.//text()').getall()
...: data_str = ' '.join(data)
...: print(data_str)
...:
Math 6.40
Literature 4.50
History 6.50
Geography 7.50
Foreign language ( N1 ) 3
Civic Education 7.75