文本剥离问题

Question

如果这是一个 PEBKAC 问题，请提前致歉，但我看不出我做错了什么。

Python 3.5.1 (FWIW)

我从在线资源中提取数据，页面的每一行都是 \r\n 的 .strip() 编辑，等等，并转换为 utf-8 字符串。我正在寻找的行在下面进一步减少。

我想取两个字符串，将它们连接起来并去掉所有非字母数字。

> x = "ABC"
> y = "Some-text as an example."
> z = x+y.lower()

> type z
<class 'str'>

所以问题来了。

> z = z.strip("'-. ")
> print z

为什么是结果：

ABCsome-text as an example.

而不是，如我所愿：

ABCsometextasanexample

我可以使用四个 .replace() 命令让它工作，但 strip 真的不想在这里工作。我也尝试过单独的拆分命令：

> y = y.strip("-")
> print(y)
some-text as an example.

鉴于

> y.replace("-", '')
> print(y)
sometext as an example.

关于 .strip() 我可能做错了什么？

Answer 1

Strip 不会去除所有字符，它只会去除字符串末尾的字符。

来自the official documentation

Return a copy of the string with the leading and trailing characters removed. The chars argument is a string specifying the set of characters to be removed. If omitted or None, the chars argument defaults to removing whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped

Answer 2

正如其他人指出的那样，strip() 的问题在于它只对字符串开头和结尾的字符进行操作——因此多次使用 replace() 将是实现目标的方法你只想使用字符串方法。

虽然这不是您问的问题，但这里介绍了如何使用一次调用来处理 re 正则表达式模块中的 re.sub() 函数。被替换的任意字符由字符串变量名chars.

的内容定义

import re

x = "ABC"
y = "Some-text as an example."
z = x + y.lower()

print('before: {!r}'.format(z))  # -> before: 'ABCsome-text as an example.'

chars = "'-. "  # Characters to be replaced.
z = re.sub('(' + '|'.join(re.escape(ch) for ch in chars) + ')', '', z)

print('after: {!r}'.format(z))  # -> after: 'ABCsometextasanexample'

Answer 3

由于您希望删除所有非字母数字字符，让我们使用以下方法使其更通用：

import re

x = "ABC"
y = "Some-text as an example."
z = x+y.lower()

z = re.sub(r'\W+', '', z)

Answer 4

另一个解决方案是使用 python 的 filter():

import re

x = "ABC"
y = "Some-text as an example."
z = x+y.lower()

z = filter(lambda c: c.isalnum(), z)

文本剥离问题

Text stripping issue

python

text

strip

python-3.5