如何在 Python 和 BS4 中正确地抓取数据?
How to scrape data properly in Python and BS4?
这是期望的输出。包含 2 行的 CSV 文件:
1639, 06/05/17, 08,09,16,26,37,50
1639, 06/05/17, 13,28,32,33,37,38
今天,我只有这个,但是使用 VBA Excel 代码 clean/organize 数据:
08,09,16,26,37,50
13,28,32,33,37,38
print screen
第一行'1639, 06/05/17'来自Resultado <span>Concurso 1639 (06/05/2017)</span>
,'08,09,16,26,37,50'来自下面提供的标签:
<ul class="numbers dupla-sena">
<h6>1º sorteio</ <h6>1º sorteio</h6>
<li>08</li><li>09</li><li>16</li><li>26</li><li>37</li><li>50</li>
</ul>
在第二行中,我们可能可以从第 1 行复制“1639, 06/05/17”,而“13,28,32,33,37,38”来自另一个标签:
<ul class="numbers dupla-sena">
<h6>2º sorteio</h6>
<li>13</li><li>28</li><li>32</li><li>33</li><li>37</li><li>38</li>
</ul>
下面是我的代码:
import requests
from bs4 import BeautifulSoup as soup
url = 'http://loterias.caixa.gov.br/wps/portal/loterias/landing/duplasena/'
r = requests.get(url)
ltr = soup(r.text, "xml")
ltr.findAll("div",{"class":"content-section section-text with-box no-margin-bottom"})
filename = "ds_1640.csv"
f=open(filename,"w")
使用下面的命令我想我可以得到所有我想要的,但我不知道如何按照我需要的方式提取数据:
ltr.findAll("div",{"class":"content-section section-text with-box no-margin-bottom"})
所以,我尝试了另一种方法来捕获“1º sorteio da dupla-sena”的值
print('-----------------dupla-sena 1º sorteio-----------------------------')
d1 = ltr.findAll("ul",{"class":"numbers dupla-sena"})[0].text.strip()
print(ltr.findAll("ul",{"class":"numbers dupla-sena"})[0].text.strip())
输出 1
1º sorteio
080916263750
分隔两位数
d1 = '0'+ d1 if len(d1)%2 else d1
gi = [iter(d1)]*2
r = [''.join(dz1) for dz1 in zip(*gi)]
d3=",".join(r)
结果
08,09,16,26,37,50
第二次提取也是如此
print('-----------------dupla-sena 2º sorteio-----------------------------')
dd1 = ltr.findAll("ul",{"class":"numbers dupla-sena"})[1].text.strip()
print(ltr.findAll("ul",{"class":"numbers dupla-sena"})[1].text.strip())
输出 2
2º sorteio
132832333738
分隔两位数
dd1 = '0'+ dd1 if len(dd1)%2 else dd1
gi = [iter(dd1)]*2
r1 = [''.join(ddz1) for ddz1 in zip(*gi)]
dd3=",".join(r1)
然后我们有
13,28,32,33,37,38
将数据保存到 csv 文件
f.write(d3 + ',' + dd3 +'\n')
f.close()
输出:当前目录下的一个csv文件:
01,º ,so,rt,ei,o
,08,09,16,26,37,50,02,º ,so,rt,ei,o
,13,28,32,33,37,38
我可以使用上面的method/output,但我必须使用VBA excel来处理这些乱七八糟的数据,但我尽量避免使用vba代码.实际上,我更感兴趣的是学习 Python 并越来越多地使用这个强大的工具。
使用此解决方案,我只实现了我想要的一部分,即:
08,09,16,26,37,50
13,28,32,33,37,38
但是,正如我们所知,所需的输出是:
1639, 06/05/17, 08,09,16,26,37,50
1639, 06/05/17, 13,28,32,33,37,38
我在 MAC OS X Yosemite(10.10.5) 中使用 Python 3.6.1 (v3.6.1:),Jupyter notebook。
我怎样才能做到这一点?我不知道如何提取“1639、06/05/17”并将其放入 csv 文件中,是否有更好的方法来提取六个数字(08、09、16、26、37、50 和 13 ,28,32,33,37,38) 并且不要使用下面的代码并且不要使用 vba?
分隔两位数:
d1 = '0'+ d1 if len(d1)%2 else d1
gi = [iter(d1)]*2
r = [''.join(dz1) for dz1 in zip(*gi)]
更新问题
import requests
from bs4 import BeautifulSoup
import re
import csv
url = 'http://loterias.caixa.gov.br/wps/portal/loterias/landing/duplasena/'
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml") ## "lxml" to avoid the warning
pat = re.compile(r'(?i)(?<=concurso)\s*(?P<concurso>\d+)\s*\((?P<data>.+?)(?=\))')
concurso_e_data = soup.find(id='resultados').h2.span.text
match = pat.search(concurso_e_data)
# first I would do the above part differently seeing as how you want the end data to look
if match:
concurso, data = match.groups()
nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
# unpack numheaders into field names
field_names = ['sena', 'data', *num_headers]
# above gives you this
# field_names = [
# 'sena', ## I've changed "seria"for "sena"
# 'data',
# 'numero1',
# 'numero2',
# 'numero3',
# 'numero4',
# 'numero5',
# 'numero6',
# ]
rows = []
# then add the numbers
# nums is all the `ul` list elements contains the drawing numbers
for group in nums:
# start each row with the shared concurso, data elements
row = [concurso, data]
# for each `ul` get all the `li` elements containing the individual number
for num in group.findAll('li'):
# add each number
row.append(int(num.text))
# get [('sena', '1234'), ('data', '12/13'2017'),...]
row_title_value_pairs = zip(field_names, row)
# turn into dict {'sena': '1234', 'data': '12/13/2017', ...}
row_dict = dict(row_title_value_pairs)
rows.append(row_dict)
# so now rows looks like: [{
# 'sena': '1234',
# 'data': '12/13/2017',
# 'numero1': 1,
# 'numero2': 2,
# 'numero3': 3,
# 'numero4': 4,
# 'numero5': 5,
# 'numero6': 6
# }, ...]
with open('file_v5.csv', 'w', encoding='utf-8') as csvfile:
csv_writer = csv.DictWriter(
csvfile,
fieldnames=field_names,
dialect='excel',
extrasaction='ignore', # drop extra fields if not in field_names not necessary but just in case
quoting=csv.QUOTE_NONNUMERIC # quote anything thats not a number, again just in case
)
csv_writer.writeheader()
for row in rows:
csv_writer.writerow(row_dict)
输出
# "sena","data","numero1","numero2","numero3","numero4","numero5","numero6"
# "1641","11/05/2017",1,5,15,28,30,43
# "1641","11/05/2017",1,5,15,28,30,43 #This comes from 1. drawing and not
from the corcect one (2.)
更新问题 2
import requests
from bs4 import BeautifulSoup
import re
import csv
url = 'http://loterias.caixa.gov.br/wps/portal/loterias/landing/duplasena/'
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml") ## "lxml" to avoid the warning
pat = re.compile(r'(?i)(?<=concurso)\s*(?P<concurso>\d+)\s*\((?P<data>.+?)(?=\))')
concurso_e_data = soup.find(id='resultados').h2.span.text
match = pat.search(concurso_e_data)
# everything should be indented under this block since
# if there is no match then none of the below code should run
if match:
concurso, data = match.groups()
nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
field_names = ['sena', 'data', *num_headers]
# PROBLEM 1
# all this should be indented into the `if match:` block above
# none of this should run if there is no match
# you cannot build the rows without the match for sena and data
# Let's add some print statements to see whats going on
rows = []
for group in nums:
# here each group is a full `sena` row from the site
print('Pulling sena: ', group.text)
row = [concurso, data]
print('Adding concurso + data to row: ', row)
for num in group.findAll('li'):
row.append(int(num.text))
print('Adding {} to row.'.format(num))
print('Row complete: ', row)
row_title_value_pairs = zip(field_names, row)
print('Transform row to header, value pairs: ', row_title_value_pairs)
row_dict = dict(row_title_value_pairs)
print('Row dictionary: ', row_dict)
rows.append(row_dict)
print('Rows: ', rows)
# PROBLEM 2
# It would seem that you've confused this section when switching
# out the original list comprehension with the more explicit
# for loop in building the rows.
# The below block should be indented to this level.
# Still under the `if match:`, but out of the the
# `for group in nums:` above
# the below block loops over rows, but you are still building
# the rows in the for loop
# you are effectively double looping over the values in `row`
with open('ds_v4_copy5.csv', 'w', encoding='utf-8') as csvfile:
csv_writer = csv.DictWriter(
csvfile,
fieldnames=field_names,
dialect='excel',
extrasaction='ignore', # drop extra fields if not in field_names not necessary but just in case
quoting=csv.QUOTE_NONNUMERIC # quote anything thats not a number, again just in case
)
csv_writer.writeheader()
# this is where you are looping extra because this block is in the `for` loop mentioned in my above notes
#for row in rows: ### I tried here to avoid the looping extra
#print('Adding row to CSV: ', row)
csv_writer.writerow(row_dict)
我想我遵循了您的指示。但到目前为止我们得到了这个:
"sena","data","numero1","numero2","numero3","numero4","numero5","numero6"
"1643","16/05/2017",3,4,9,19,21,26 #which is "1º sorteio"
仍然缺少“2º sorteio”。我知道我做错了什么,因为“1º sorteio”和“2º sorteio”都在:
print(rows[0]) --> {'sena': '1643', 'data': '16/05/2017', 'numero1': 1, 'numero2': 21, 'numero3': 22, 'numero4': 43, 'numero5': 47, 'numero6': 50}
print(rows[1]) --> {'sena': '1643', 'data': '16/05/2017', 'numero1': 3, 'numero2': 4, 'numero3': 9, 'numero4': 19, 'numero5': 21, 'numero6': 26}
但是,当我尝试将 row_dict 的内容存储在 csv 中时(只有行 [0] 出现在 row_dict 中)。我试图弄清楚如何包含丢失的内容。也许我错了,但我认为“1º sorteio”和“2º sorteio”都应该包含在 row_dict 中,但是当我们看到这个时,代码并没有证实(这是一个猜测):
print(row_dict)
{'sena': '1643', 'data': '16/05/2017', 'numero1': 3, 'numero2': 4, 'numero3': 9, 'numero4': 19, 'numero5': 21, 'numero6': 26}
我看不出我做错了什么。我知道这个答案要花很多时间,但在这个过程中我和你一起学到了很多东西。并且已经在使用我与您一起学习的几种工具(重新、概念、字典、zip)。
免责声明:我对美汤不是很熟悉,我通常使用lxml,话说...
soup = BeautifulSoup(response.text) # <-- edit showing how i assigned soup
pat = re.compile(r'(?i)(?<=concurso)\s*(?P<concurso>\d+)\s*\((?P<data>.+?)(?=\))')
concurso_e_data = soup.find(id='resultados').h2.span.text
match = pat.search(concurso_e_data)
if match:
concurso, data = match.groups()
nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
numeros = []
for i in nums:
numeros.append(','.join(j.text for j in i.findAll('li')))
rows = []
for n in numeros:
rows.append(','.join([concurso, data, n]))
print(rows)
['1639,06/05/2017,08,09,16,26,37,50', '1639,06/05/2017,13,28,32,33,37,38']
虽然这是您要求的格式,但在数字组中使用逗号(列分隔符)不是一个坏主意。您应该用另一个字符分隔或用 space.
分隔数字
更新 1:
写在评论部分并不是最好的方法,所以...假设您真正想要的格式是 8 行,如下所示 (seria, data, num1, num2, ... num6)
其中 seria
和 data
是字符串,数字是 int
s:
# first I would do the above part differently seeing as how you want the end data to look
...
if match:
concurso, data = match.groups()
nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
# unpack numheaders into field names
field_names = ['seria', 'data', *num_headers]
# above gives you this
# field_names = [
# 'seria',
# 'data',
# 'numero1',
# 'numero2',
# 'numero3',
# 'numero4',
# 'numero5',
# 'numero6',
# ]
rows = [
dict(zip(
field_names,
[concurso, data, *[int(num.text) for num in group.findAll('li')]]
))
for group in nums]
# so now rows looks like: [{
# 'seria': '1234',
# 'data': '12/13/2017',
# 'numero1': 1,
# 'numero2': 2,
# 'numero3': 3,
# 'numero4': 4,
# 'numero5': 5,
# 'numero6': 6
# }, ...]
with open('file.csv', 'a', encoding='utf-8') as csvfile:
csv_writer = csv.DictWriter(
csvfile,
fieldnames=field_names,
dialect='excel',
extrasaction='ignore', # drop extra fields if not in field_names not necessary but just in case
quoting=csv.QUOTE_NONNUMERIC # quote anything thats not a number, again just in case
)
csv_writer.writeheader()
for row in rows:
csv_writer.writerow(row_dict)
这部分有点乱:
rows = [
dict(zip(
field_names,
[concurso, data, *[int(num.text) for num in group.findAll('li')]
))
for group in nums]
所以让我换一种方式写:
rows = []
# then add the numbers
# nums is all the `ul` list elements contains the drawing numbers
for group in nums:
# start each row with the shared concurso, data elements
row = [concurso, data]
# for each `ul` get all the `li` elements containing the individual number
for num in group.findAll('li'):
# add each number
row.append(int(num.text))
# get [('seria', '1234'), ('data', '12/13'2017'),...]
row_title_value_pairs = zip(field_names, row)
# turn into dict {'seria': '1234', 'data': '12/13/2017', ...}
row_dict = dict(row_title_value_pairs)
rows.append(row_dict)
# or just write the csv here instead of appending to rows and re-looping over the values
...
更新 2:
我希望您从中学到的一件事是在学习时使用 print
语句,这样您就可以理解代码的作用。我不会进行更正,但我会指出它们并在发生重大变化的每个位置添加 print
语句...
match = pat.search(concurso_e_data)
# everything should be indented under this block since
# if there is no match then none of the below code should run
if match:
concurso, data = match.groups()
nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
field_names = ['sena', 'data', *num_headers]
# PROBLEM 1
# all this should be indented into the `if match:` block above
# none of this should run if there is no match
# you cannot build the rows without the match for sena and data
# Let's add some print statements to see whats going on
rows = []
for group in nums:
# here each group is a full `sena` row from the site
print('Pulling sena: ', group.text)
row = [concurso, data]
print('Adding concurso + data to row: ', row)
for num in group.findAll('li'):
row.append(int(num.text))
print('Adding {} to row.'.format(num))
print('Row complete: ', row)
row_title_value_pairs = zip(field_names, row)
print('Transform row to header, value pairs: ', row_title_value_pairs)
row_dict = dict(row_title_value_pairs)
print('Row dictionary: ', row_dict)
rows.append(row_dict)
print('Rows: ', rows)
# PROBLEM 2
# It would seem that you've confused this section when switching
# out the original list comprehension with the more explicit
# for loop in building the rows.
# The below block should be indented to this level.
# Still under the `if match:`, but out of the the
# `for group in nums:` above
# the below block loops over rows, but you are still building
# the rows in the for loop
# you are effectively double looping over the values in `row`
with open('file_v5.csv', 'w', encoding='utf-8') as csvfile:
csv_writer = csv.DictWriter(
csvfile,
fieldnames=field_names,
dialect='excel',
extrasaction='ignore', # drop extra fields if not in field_names not necessary but just in case
quoting=csv.QUOTE_NONNUMERIC # quote anything thats not a number, again just in case
)
csv_writer.writeheader()
# this is where you are looping extra because this block is in the `for` loop mentioned in my above notes
for row in rows:
print('Adding row to CSV: ', row)
csv_writer.writerow(row_dict)
运行 并查看打印语句向您显示的内容。但也要阅读注释,因为如果 sena、数据不匹配,有些东西会导致错误。
提示:进行缩进,然后在最后的 if match:
块下添加 else: print('No sena, data match!')
...但是 运行 首先检查它打印的内容。
(代表OP发表).
在@Verbal_Kint的帮助下,我们成功了!输出我需要的方式!我已将输出更改为:
sena;data;numero1;numero2;numero3;numero4;numero5;numero6
1644;18/05/2017;4;6;31;39;47;49
1644;18/05/2017;20;37;44;45;46;50
因此,在他们关注“,”和“;”之后在 Excel 中,我决定将 ","
更改为 ";"
以毫无问题地在 Excel 中打开列。
import requests
from bs4 import BeautifulSoup
import re
import csv
url = 'http://loterias.caixa.gov.br/wps/portal/loterias/landing/duplasena/'
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml") ## "lxml" to avoid the warning
pat = re.compile(r'(?i)(?<=concurso)\s*(?P<concurso>\d+)\s*\((?P<data>.+?)(?=\))')
concurso_e_data = soup.find(id='resultados').h2.span.text
match = pat.search(concurso_e_data)
if match:
concurso, data = match.groups()
nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
field_names = ['sena', 'data', *num_headers]
rows = []
for group in nums:
row = [concurso, data]
for num in group.findAll('li'):
row.append(int(num.text))
row_title_value_pairs = zip(field_names, row)
row_dict = dict(row_title_value_pairs)
rows.append(row_dict)
with open('ds_v10.csv', 'w', encoding='utf-8') as csvfile:
csv_writer = csv.DictWriter(
csvfile,
fieldnames=field_names,
dialect='excel',
delimiter = ';', #to handle column issue in excel!
)
csv_writer.writeheader()
csv_writer.writerow(rows[0])
csv_writer.writerow(rows[1])
这是期望的输出。包含 2 行的 CSV 文件:
1639, 06/05/17, 08,09,16,26,37,50
1639, 06/05/17, 13,28,32,33,37,38
今天,我只有这个,但是使用 VBA Excel 代码 clean/organize 数据:
08,09,16,26,37,50
13,28,32,33,37,38
print screen
第一行'1639, 06/05/17'来自Resultado <span>Concurso 1639 (06/05/2017)</span>
,'08,09,16,26,37,50'来自下面提供的标签:
<ul class="numbers dupla-sena">
<h6>1º sorteio</ <h6>1º sorteio</h6>
<li>08</li><li>09</li><li>16</li><li>26</li><li>37</li><li>50</li>
</ul>
在第二行中,我们可能可以从第 1 行复制“1639, 06/05/17”,而“13,28,32,33,37,38”来自另一个标签:
<ul class="numbers dupla-sena">
<h6>2º sorteio</h6>
<li>13</li><li>28</li><li>32</li><li>33</li><li>37</li><li>38</li>
</ul>
下面是我的代码:
import requests
from bs4 import BeautifulSoup as soup
url = 'http://loterias.caixa.gov.br/wps/portal/loterias/landing/duplasena/'
r = requests.get(url)
ltr = soup(r.text, "xml")
ltr.findAll("div",{"class":"content-section section-text with-box no-margin-bottom"})
filename = "ds_1640.csv"
f=open(filename,"w")
使用下面的命令我想我可以得到所有我想要的,但我不知道如何按照我需要的方式提取数据:
ltr.findAll("div",{"class":"content-section section-text with-box no-margin-bottom"})
所以,我尝试了另一种方法来捕获“1º sorteio da dupla-sena”的值
print('-----------------dupla-sena 1º sorteio-----------------------------')
d1 = ltr.findAll("ul",{"class":"numbers dupla-sena"})[0].text.strip()
print(ltr.findAll("ul",{"class":"numbers dupla-sena"})[0].text.strip())
输出 1
1º sorteio
080916263750
分隔两位数
d1 = '0'+ d1 if len(d1)%2 else d1
gi = [iter(d1)]*2
r = [''.join(dz1) for dz1 in zip(*gi)]
d3=",".join(r)
结果
08,09,16,26,37,50
第二次提取也是如此
print('-----------------dupla-sena 2º sorteio-----------------------------')
dd1 = ltr.findAll("ul",{"class":"numbers dupla-sena"})[1].text.strip()
print(ltr.findAll("ul",{"class":"numbers dupla-sena"})[1].text.strip())
输出 2
2º sorteio
132832333738
分隔两位数
dd1 = '0'+ dd1 if len(dd1)%2 else dd1
gi = [iter(dd1)]*2
r1 = [''.join(ddz1) for ddz1 in zip(*gi)]
dd3=",".join(r1)
然后我们有
13,28,32,33,37,38
将数据保存到 csv 文件
f.write(d3 + ',' + dd3 +'\n')
f.close()
输出:当前目录下的一个csv文件:
01,º ,so,rt,ei,o
,08,09,16,26,37,50,02,º ,so,rt,ei,o
,13,28,32,33,37,38
我可以使用上面的method/output,但我必须使用VBA excel来处理这些乱七八糟的数据,但我尽量避免使用vba代码.实际上,我更感兴趣的是学习 Python 并越来越多地使用这个强大的工具。 使用此解决方案,我只实现了我想要的一部分,即:
08,09,16,26,37,50
13,28,32,33,37,38
但是,正如我们所知,所需的输出是:
1639, 06/05/17, 08,09,16,26,37,50
1639, 06/05/17, 13,28,32,33,37,38
我在 MAC OS X Yosemite(10.10.5) 中使用 Python 3.6.1 (v3.6.1:),Jupyter notebook。
我怎样才能做到这一点?我不知道如何提取“1639、06/05/17”并将其放入 csv 文件中,是否有更好的方法来提取六个数字(08、09、16、26、37、50 和 13 ,28,32,33,37,38) 并且不要使用下面的代码并且不要使用 vba?
分隔两位数:
d1 = '0'+ d1 if len(d1)%2 else d1
gi = [iter(d1)]*2
r = [''.join(dz1) for dz1 in zip(*gi)]
更新问题
import requests
from bs4 import BeautifulSoup
import re
import csv
url = 'http://loterias.caixa.gov.br/wps/portal/loterias/landing/duplasena/'
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml") ## "lxml" to avoid the warning
pat = re.compile(r'(?i)(?<=concurso)\s*(?P<concurso>\d+)\s*\((?P<data>.+?)(?=\))')
concurso_e_data = soup.find(id='resultados').h2.span.text
match = pat.search(concurso_e_data)
# first I would do the above part differently seeing as how you want the end data to look
if match:
concurso, data = match.groups()
nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
# unpack numheaders into field names
field_names = ['sena', 'data', *num_headers]
# above gives you this
# field_names = [
# 'sena', ## I've changed "seria"for "sena"
# 'data',
# 'numero1',
# 'numero2',
# 'numero3',
# 'numero4',
# 'numero5',
# 'numero6',
# ]
rows = []
# then add the numbers
# nums is all the `ul` list elements contains the drawing numbers
for group in nums:
# start each row with the shared concurso, data elements
row = [concurso, data]
# for each `ul` get all the `li` elements containing the individual number
for num in group.findAll('li'):
# add each number
row.append(int(num.text))
# get [('sena', '1234'), ('data', '12/13'2017'),...]
row_title_value_pairs = zip(field_names, row)
# turn into dict {'sena': '1234', 'data': '12/13/2017', ...}
row_dict = dict(row_title_value_pairs)
rows.append(row_dict)
# so now rows looks like: [{
# 'sena': '1234',
# 'data': '12/13/2017',
# 'numero1': 1,
# 'numero2': 2,
# 'numero3': 3,
# 'numero4': 4,
# 'numero5': 5,
# 'numero6': 6
# }, ...]
with open('file_v5.csv', 'w', encoding='utf-8') as csvfile:
csv_writer = csv.DictWriter(
csvfile,
fieldnames=field_names,
dialect='excel',
extrasaction='ignore', # drop extra fields if not in field_names not necessary but just in case
quoting=csv.QUOTE_NONNUMERIC # quote anything thats not a number, again just in case
)
csv_writer.writeheader()
for row in rows:
csv_writer.writerow(row_dict)
输出
# "sena","data","numero1","numero2","numero3","numero4","numero5","numero6"
# "1641","11/05/2017",1,5,15,28,30,43
# "1641","11/05/2017",1,5,15,28,30,43 #This comes from 1. drawing and not
from the corcect one (2.)
更新问题 2
import requests
from bs4 import BeautifulSoup
import re
import csv
url = 'http://loterias.caixa.gov.br/wps/portal/loterias/landing/duplasena/'
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml") ## "lxml" to avoid the warning
pat = re.compile(r'(?i)(?<=concurso)\s*(?P<concurso>\d+)\s*\((?P<data>.+?)(?=\))')
concurso_e_data = soup.find(id='resultados').h2.span.text
match = pat.search(concurso_e_data)
# everything should be indented under this block since
# if there is no match then none of the below code should run
if match:
concurso, data = match.groups()
nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
field_names = ['sena', 'data', *num_headers]
# PROBLEM 1
# all this should be indented into the `if match:` block above
# none of this should run if there is no match
# you cannot build the rows without the match for sena and data
# Let's add some print statements to see whats going on
rows = []
for group in nums:
# here each group is a full `sena` row from the site
print('Pulling sena: ', group.text)
row = [concurso, data]
print('Adding concurso + data to row: ', row)
for num in group.findAll('li'):
row.append(int(num.text))
print('Adding {} to row.'.format(num))
print('Row complete: ', row)
row_title_value_pairs = zip(field_names, row)
print('Transform row to header, value pairs: ', row_title_value_pairs)
row_dict = dict(row_title_value_pairs)
print('Row dictionary: ', row_dict)
rows.append(row_dict)
print('Rows: ', rows)
# PROBLEM 2
# It would seem that you've confused this section when switching
# out the original list comprehension with the more explicit
# for loop in building the rows.
# The below block should be indented to this level.
# Still under the `if match:`, but out of the the
# `for group in nums:` above
# the below block loops over rows, but you are still building
# the rows in the for loop
# you are effectively double looping over the values in `row`
with open('ds_v4_copy5.csv', 'w', encoding='utf-8') as csvfile:
csv_writer = csv.DictWriter(
csvfile,
fieldnames=field_names,
dialect='excel',
extrasaction='ignore', # drop extra fields if not in field_names not necessary but just in case
quoting=csv.QUOTE_NONNUMERIC # quote anything thats not a number, again just in case
)
csv_writer.writeheader()
# this is where you are looping extra because this block is in the `for` loop mentioned in my above notes
#for row in rows: ### I tried here to avoid the looping extra
#print('Adding row to CSV: ', row)
csv_writer.writerow(row_dict)
我想我遵循了您的指示。但到目前为止我们得到了这个:
"sena","data","numero1","numero2","numero3","numero4","numero5","numero6"
"1643","16/05/2017",3,4,9,19,21,26 #which is "1º sorteio"
仍然缺少“2º sorteio”。我知道我做错了什么,因为“1º sorteio”和“2º sorteio”都在:
print(rows[0]) --> {'sena': '1643', 'data': '16/05/2017', 'numero1': 1, 'numero2': 21, 'numero3': 22, 'numero4': 43, 'numero5': 47, 'numero6': 50}
print(rows[1]) --> {'sena': '1643', 'data': '16/05/2017', 'numero1': 3, 'numero2': 4, 'numero3': 9, 'numero4': 19, 'numero5': 21, 'numero6': 26}
但是,当我尝试将 row_dict 的内容存储在 csv 中时(只有行 [0] 出现在 row_dict 中)。我试图弄清楚如何包含丢失的内容。也许我错了,但我认为“1º sorteio”和“2º sorteio”都应该包含在 row_dict 中,但是当我们看到这个时,代码并没有证实(这是一个猜测):
print(row_dict)
{'sena': '1643', 'data': '16/05/2017', 'numero1': 3, 'numero2': 4, 'numero3': 9, 'numero4': 19, 'numero5': 21, 'numero6': 26}
我看不出我做错了什么。我知道这个答案要花很多时间,但在这个过程中我和你一起学到了很多东西。并且已经在使用我与您一起学习的几种工具(重新、概念、字典、zip)。
免责声明:我对美汤不是很熟悉,我通常使用lxml,话说...
soup = BeautifulSoup(response.text) # <-- edit showing how i assigned soup
pat = re.compile(r'(?i)(?<=concurso)\s*(?P<concurso>\d+)\s*\((?P<data>.+?)(?=\))')
concurso_e_data = soup.find(id='resultados').h2.span.text
match = pat.search(concurso_e_data)
if match:
concurso, data = match.groups()
nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
numeros = []
for i in nums:
numeros.append(','.join(j.text for j in i.findAll('li')))
rows = []
for n in numeros:
rows.append(','.join([concurso, data, n]))
print(rows)
['1639,06/05/2017,08,09,16,26,37,50', '1639,06/05/2017,13,28,32,33,37,38']
虽然这是您要求的格式,但在数字组中使用逗号(列分隔符)不是一个坏主意。您应该用另一个字符分隔或用 space.
分隔数字更新 1:
写在评论部分并不是最好的方法,所以...假设您真正想要的格式是 8 行,如下所示 (seria, data, num1, num2, ... num6)
其中 seria
和 data
是字符串,数字是 int
s:
# first I would do the above part differently seeing as how you want the end data to look
...
if match:
concurso, data = match.groups()
nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
# unpack numheaders into field names
field_names = ['seria', 'data', *num_headers]
# above gives you this
# field_names = [
# 'seria',
# 'data',
# 'numero1',
# 'numero2',
# 'numero3',
# 'numero4',
# 'numero5',
# 'numero6',
# ]
rows = [
dict(zip(
field_names,
[concurso, data, *[int(num.text) for num in group.findAll('li')]]
))
for group in nums]
# so now rows looks like: [{
# 'seria': '1234',
# 'data': '12/13/2017',
# 'numero1': 1,
# 'numero2': 2,
# 'numero3': 3,
# 'numero4': 4,
# 'numero5': 5,
# 'numero6': 6
# }, ...]
with open('file.csv', 'a', encoding='utf-8') as csvfile:
csv_writer = csv.DictWriter(
csvfile,
fieldnames=field_names,
dialect='excel',
extrasaction='ignore', # drop extra fields if not in field_names not necessary but just in case
quoting=csv.QUOTE_NONNUMERIC # quote anything thats not a number, again just in case
)
csv_writer.writeheader()
for row in rows:
csv_writer.writerow(row_dict)
这部分有点乱:
rows = [
dict(zip(
field_names,
[concurso, data, *[int(num.text) for num in group.findAll('li')]
))
for group in nums]
所以让我换一种方式写:
rows = []
# then add the numbers
# nums is all the `ul` list elements contains the drawing numbers
for group in nums:
# start each row with the shared concurso, data elements
row = [concurso, data]
# for each `ul` get all the `li` elements containing the individual number
for num in group.findAll('li'):
# add each number
row.append(int(num.text))
# get [('seria', '1234'), ('data', '12/13'2017'),...]
row_title_value_pairs = zip(field_names, row)
# turn into dict {'seria': '1234', 'data': '12/13/2017', ...}
row_dict = dict(row_title_value_pairs)
rows.append(row_dict)
# or just write the csv here instead of appending to rows and re-looping over the values
...
更新 2:
我希望您从中学到的一件事是在学习时使用 print
语句,这样您就可以理解代码的作用。我不会进行更正,但我会指出它们并在发生重大变化的每个位置添加 print
语句...
match = pat.search(concurso_e_data)
# everything should be indented under this block since
# if there is no match then none of the below code should run
if match:
concurso, data = match.groups()
nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
field_names = ['sena', 'data', *num_headers]
# PROBLEM 1
# all this should be indented into the `if match:` block above
# none of this should run if there is no match
# you cannot build the rows without the match for sena and data
# Let's add some print statements to see whats going on
rows = []
for group in nums:
# here each group is a full `sena` row from the site
print('Pulling sena: ', group.text)
row = [concurso, data]
print('Adding concurso + data to row: ', row)
for num in group.findAll('li'):
row.append(int(num.text))
print('Adding {} to row.'.format(num))
print('Row complete: ', row)
row_title_value_pairs = zip(field_names, row)
print('Transform row to header, value pairs: ', row_title_value_pairs)
row_dict = dict(row_title_value_pairs)
print('Row dictionary: ', row_dict)
rows.append(row_dict)
print('Rows: ', rows)
# PROBLEM 2
# It would seem that you've confused this section when switching
# out the original list comprehension with the more explicit
# for loop in building the rows.
# The below block should be indented to this level.
# Still under the `if match:`, but out of the the
# `for group in nums:` above
# the below block loops over rows, but you are still building
# the rows in the for loop
# you are effectively double looping over the values in `row`
with open('file_v5.csv', 'w', encoding='utf-8') as csvfile:
csv_writer = csv.DictWriter(
csvfile,
fieldnames=field_names,
dialect='excel',
extrasaction='ignore', # drop extra fields if not in field_names not necessary but just in case
quoting=csv.QUOTE_NONNUMERIC # quote anything thats not a number, again just in case
)
csv_writer.writeheader()
# this is where you are looping extra because this block is in the `for` loop mentioned in my above notes
for row in rows:
print('Adding row to CSV: ', row)
csv_writer.writerow(row_dict)
运行 并查看打印语句向您显示的内容。但也要阅读注释,因为如果 sena、数据不匹配,有些东西会导致错误。
提示:进行缩进,然后在最后的 if match:
块下添加 else: print('No sena, data match!')
...但是 运行 首先检查它打印的内容。
(代表OP发表).
在@Verbal_Kint的帮助下,我们成功了!输出我需要的方式!我已将输出更改为:
sena;data;numero1;numero2;numero3;numero4;numero5;numero6
1644;18/05/2017;4;6;31;39;47;49
1644;18/05/2017;20;37;44;45;46;50
因此,在他们关注“,”和“;”之后在 Excel 中,我决定将 ","
更改为 ";"
以毫无问题地在 Excel 中打开列。
import requests
from bs4 import BeautifulSoup
import re
import csv
url = 'http://loterias.caixa.gov.br/wps/portal/loterias/landing/duplasena/'
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml") ## "lxml" to avoid the warning
pat = re.compile(r'(?i)(?<=concurso)\s*(?P<concurso>\d+)\s*\((?P<data>.+?)(?=\))')
concurso_e_data = soup.find(id='resultados').h2.span.text
match = pat.search(concurso_e_data)
if match:
concurso, data = match.groups()
nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
field_names = ['sena', 'data', *num_headers]
rows = []
for group in nums:
row = [concurso, data]
for num in group.findAll('li'):
row.append(int(num.text))
row_title_value_pairs = zip(field_names, row)
row_dict = dict(row_title_value_pairs)
rows.append(row_dict)
with open('ds_v10.csv', 'w', encoding='utf-8') as csvfile:
csv_writer = csv.DictWriter(
csvfile,
fieldnames=field_names,
dialect='excel',
delimiter = ';', #to handle column issue in excel!
)
csv_writer.writeheader()
csv_writer.writerow(rows[0])
csv_writer.writerow(rows[1])