使用正则表达式查找字符串模式并将结果附加到列表中

find string pattern using regex and append the result into a list

我是一个菜鸟,在 Python 上使用 re 库。我正在做一个网络抓取,我想匹配一些字符串模式并将值附加到列表中。例如:

parking = []
rooms  = []
toilets = []


attribute = soup.find('ul',{'class':'specs-list'}).find_all('li')
for a in attribute:
    print(a.text)

索引为0的输出迭代a

Metters
50 m�

Rooms
2

Toilets
1

索引为 1 的输出迭代 a

   Metters
   50 m�
   
   parking 
   1
   
   spends
   340 

 

例如,我想匹配标题的名称,如果存在于 A 值上,我想将结果附加到每个列表上

伪代码:

for a in attribute:
  if a contains "Rooms":
     rooms.append(a)
  if a contains "Parking":
     parking.append(a)
  if a contains "toilets":
     parking.append(a)


  if a not contains strings above:
     rooms.append(nan)
     parking.append(nan)
     rooms.append(nan)

我使用BeautifulSoup创建网页抓取结果属性值如下:

索引 0 的属性变量输出:

[<li class="specs-item">
<strong>Metters</strong>
<span>50 m�</span>
</li>,<li class="specs-item">
<strong>Rooms</strong>
<span>2</span>
</li>,<li class="specs-item">
<strong>Toilets</strong>
<span>1</span>
</li>,<li class="specs-item">
<strong>Spends</strong>
<span>340</span></li>]

一个属性有一个长度为0f的5个值,每个值的代码都和上面类似,但是标题和值不同,有的包含parking,rooms,toiletes,有的只有toilets和rooms,等等。

这对你有帮助:

from bs4 import BeautifulSoup
import requests 

parking = []
rooms  = []
toilets = []

html = requests.get('website url').text

soup = BeautifulSoup(html,'html.parser')

attribute = soup.find_all('li',{'class':'specs-item'})

for a in attribute:
    
    heading = a.strong.text
    span = a.span.text
    
    if heading == "Parking":
        parking.append(span)
    elif heading == "Rooms":
        rooms.append(span)
    elif heading == "Toilets":
        toilets.append(span)
    
print("Parking =" , parking)
print("Rooms =", rooms)
print("Toilets =", toilets)

u 提供的 li 值的输出:

Parking = []
Rooms = ['2']
Toilets = ['1']

编辑:

虽然这可行,但我觉得拥有这么多 lists 并不是一个好方法。相反,你可以使用 dictionary。这就是你如何使用 dictionary:

实现相同的输出
details_dict = {'Parking':[],
                'Rooms':[],
                'Toilets':[]}
for a in attribute:
    
    heading = a.strong.text
    span = a.span.text
    
    if heading == "Parking" or heading == "Rooms" or heading == "Toilets":
        details_dict[heading].append(span)

print(details_dict)

输出:

{'Parking': [], 'Rooms': ['2'], 'Toilets': ['1']}

我觉得这是一个更好的方法。但这完全取决于你。选择最适合您任务的那个。