如何去掉字符串中的某些字符？ .replace() 不起作用

Question

我需要删除从 xml 文件中获取的字符串中的波兰语字符。我使用 .replace() 但在这种情况下它不起作用。为什么？代码：

# -*- coding: utf-8
from prestapyt import PrestaShopWebService
from xml.etree import ElementTree

prestashop = PrestaShopWebService('http://localhost/prestashop/api', 
                              'key')
prestashop.debug = True

name = ElementTree.tostring(prestashop.search('products', options=
{'display': '[name]', 'filter[id]': '[2]'}), encoding='cp852',  
method='text')

print name
print name.replace('ł', 'l')

输出：

Naturalne mydło odświeżające
Naturalne mydło odświeżające

但是当我尝试替换非波兰语字符时它工作正常。

print name
print name.replace('a', 'o')

结果：

Naturalne mydło odświeżające
Noturolne mydło odświeżojące

这也很好用：

name = "Naturalne mydło odświeżające"
print name.replace('ł', 'l')

有什么建议吗？

Answer 1

如果我理解正确你的问题，你可以使用unidecode:

>>> from unidecode import unidecode
>>> unidecode("Naturalne mydło odświeżające")
'Naturalne mydlo odswiezajace'

您可能需要先用 name.decode('utf_8') 解码 cp852 编码的字符串。

Answer 2

您将编码与字节字符串混合在一起。这是一个重现该问题的简短工作示例。我假设您是运行在默认编码为 cp852:

的 Windows 控制台中

#!python2
# coding: utf-8
from xml.etree import ElementTree as et
name_element = et.Element('data')
name_element.text = u'Naturalne mydło odświeżające'
name = et.tostring(name_element,encoding='cp852', method='text')
print name
print name.replace('ł', 'l')

输出（无替换）：

Naturalne mydło odświeżające
Naturalne mydło odświeżające

原因是，name字符串是在cp852中编码的，而字节串常量'ł'是在utf-8的源代码编码中编码的。

print repr(name)
print repr('ł')

输出：

'Naturalne myd\x88o od\x98wie\xbeaj\xa5ce'
'\xc5\x82'

最佳解决方案是使用 Unicode 字符串：

#!python2
# coding: utf-8
from xml.etree import ElementTree as et
name_element = et.Element('data')
name_element.text = u'Naturalne mydło odświeżające'
name = et.tostring(name_element,encoding='cp852', method='text').decode('cp852')
print name
print name.replace(u'ł', u'l')
print repr(name)
print repr(u'ł')

输出（进行了替换）：

Naturalne mydło odświeżające
Naturalne mydlo odświeżające
u'Naturalne myd\u0142o od\u015bwie\u017caj\u0105ce'
u'\u0142'

注意Python3的et.tostring有Unicode选项，字符串常量默认是Unicode。 repr() 版本的字符串也更具可读性，但 ascii() 实现了旧的行为。您还会发现 Python 3.6 甚至会在不使用波兰语代码页的控制台上打印波兰语，因此您可能根本不需要替换字符。

#!python3
# coding: utf-8
from xml.etree import ElementTree as et
name_element = et.Element('data')
name_element.text = 'Naturalne mydło odświeżające'
name = et.tostring(name_element,encoding='unicode', method='text')
print(name)
print(name.replace('ł','l'))
print(repr(name),repr('ł'))
print(ascii(name),ascii('ł'))

输出：

Naturalne mydło odświeżające
Naturalne mydlo odświeżające
'Naturalne mydło odświeżające' 'ł'
'Naturalne myd\u0142o od\u015bwie\u017caj\u0105ce' '\u0142'

如何去掉字符串中的某些字符？ .replace() 不起作用

How to get rid of some characters from string? .replace() doesn't work

python

xml

replace

xml.etree