Python

Question

我正在尝试获取一些数据，这是我的代码：

import requests
from bs4 import BeautifulSoup

url = 'http://www.privredni-imenik.com/firma/68225-a_expo'
r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")

g_data = soup.find_all("div", {"class":"podaci"})
for i in g_data:
    some = i.text.encode('utf-8', 'ignore')
    print (some)

有效，但结果如下所示：

b'A & L EXPO PREDUZE\xc4\x86E ZA PROIZVODNJU

其中\xc4\x86应该用字母Ć表示。

我怎样才能让它工作？

Answer 1

b'\xc4\x86' 是一个字节对象，不是字符串（你可以通过引号前面的 'b' 来判断）。因此，如果您尝试打印 bytes 对象，则有限 ascii 集之外的任何字符都将以其十六进制表示形式显示。要打印您想要查看的 utf-8 字符，您需要将字节对象解码为字符串对象（或者，查看您的代码，首先不要将其编码为字节对象）。

例如，尝试：

>>> b'\xc4\x86'.decode()
'Ć'

有关字节和字符串的更多信息，请阅读此处： http://www.diveintopython3.net/strings.html

Answer 2

您已经有一个字符串，只需打印文本：

In [18]: g_data = soup.find_all("div", {"class":"podaci"})

In [19]: for i in g_data:
   ....:         some = i.text
   ....:         print (some)
   ....:     
A & L EXPO PREDUZEĆE ZA PROIZVODNJU, TRGOVINU I USLUGE DOO 11070 BEOGRAD VLADIMIRA POPOVTelefaksMatični broj: 17461460  Informacije o delatnostima koje obavlja ova firma:  » Organizovanje sastanaka i sajmova 

In [20]:  print(type(some))
<class 'str'>    
In [21]: print(type(some.encode('utf-8', 'ignore')))
<class 'bytes'>

您正在使用 i.text.encode('utf-8', 'ignore') 编码为 bytes 根本不需要执行任何操作，除非打印文本。

Python - 无法正确编码字符串

Python - Cant make it to encode string properly

encoding

beautifulsoup

character-encoding

python-3.x