在 Python 中使用函数作为 re.sub 的参数?
Using a function as argument to re.sub in Python?
我正在编写一个程序来拆分主题标签中包含的单词。
例如我想拆分主题标签:
#Whatthehello #goback
进入:
What the hello go back
我在使用带有函数参数的 re.sub
时遇到了麻烦。
我写的代码是:
import re,pdb
def func_replace(each_func):
i=0
wordsineach_func=[]
while len(each_func) >0:
i=i+1
word_found=longest_word(each_func)
if len(word_found)>0:
wordsineach_func.append(word_found)
each_func=each_func.replace(word_found,"")
return ' '.join(wordsineach_func)
def longest_word(phrase):
phrase_length=len(phrase)
words_found=[];index=0
outerstring=""
while index < phrase_length:
outerstring=outerstring+phrase[index]
index=index+1
if outerstring in words or outerstring.lower() in words:
words_found.append(outerstring)
if len(words_found) ==0:
words_found.append(phrase)
return max(words_found, key=len)
words=[]
# The file corncob_lowercase.txt contains a list of dictionary words
with open('corncob_lowercase.txt') as f:
read_words=f.readlines()
for read_word in read_words:
words.append(read_word.replace("\n","").replace("\r",""))
例如,当像这样使用这些函数时:
s="#Whatthehello #goback"
#checking if the function is able to segment words
hashtags=re.findall(r"#(\w+)", s)
print func_replace(hashtags[0])
# using the function for re.sub
print re.sub(r"#(\w+)", lambda m: func_replace(m.group()), s)
我得到的输出是:
What the hello
#Whatthehello #goback
这不是我预期的输出:
What the hello
What the hello go back
为什么会这样?特别是我使用了 this answer 的建议,但我不明白这段代码出了什么问题。
请注意 m.group()
return 是匹配的整个字符串,无论它是否是捕获组的一部分:
In [19]: m = re.search(r"#(\w+)", s)
In [20]: m.group()
Out[20]: '#Whatthehello'
m.group(0)
也 return 整个比赛:
In [23]: m.group(0)
Out[23]: '#Whatthehello'
相比之下,m.groups()
returns 所有捕获组:
In [21]: m.groups()
Out[21]: ('Whatthehello',)
和m.group(1)
return第一个捕获组:
In [22]: m.group(1)
Out[22]: 'Whatthehello'
所以您代码中的问题源于在
中使用 m.group
re.sub(r"#(\w+)", lambda m: func_replace(m.group()), s)
因为
In [7]: re.search(r"#(\w+)", s).group()
Out[7]: '#Whatthehello'
而如果您使用 .group(1)
,您将获得
In [24]: re.search(r"#(\w+)", s).group(1)
Out[24]: 'Whatthehello'
和前面的 #
完全不同:
In [25]: func_replace('#Whatthehello')
Out[25]: '#Whatthehello'
In [26]: func_replace('Whatthehello')
Out[26]: 'What the hello'
因此,将 m.group()
更改为 m.group(1)
,并将 /usr/share/dict/words
替换为 corncob_lowercase.txt
,
import re
def func_replace(each_func):
i = 0
wordsineach_func = []
while len(each_func) > 0:
i = i + 1
word_found = longest_word(each_func)
if len(word_found) > 0:
wordsineach_func.append(word_found)
each_func = each_func.replace(word_found, "")
return ' '.join(wordsineach_func)
def longest_word(phrase):
phrase_length = len(phrase)
words_found = []
index = 0
outerstring = ""
while index < phrase_length:
outerstring = outerstring + phrase[index]
index = index + 1
if outerstring in words or outerstring.lower() in words:
words_found.append(outerstring)
if len(words_found) == 0:
words_found.append(phrase)
return max(words_found, key=len)
words = []
# corncob_lowercase.txt contains a list of dictionary words
with open('/usr/share/dict/words', 'rb') as f:
for read_word in f:
words.append(read_word.strip())
s = "#Whatthehello #goback"
hashtags = re.findall(r"#(\w+)", s)
print func_replace(hashtags[0])
print re.sub(r"#(\w+)", lambda m: func_replace(m.group(1)), s)
打印
What the hello
What the hello gob a c k
因为,唉,'gob'
比 'go'
长。
您可以调试的一种方法是用常规函数替换 lambda
函数,然后添加打印语句:
def foo(m):
result = func_replace(m.group())
print(m.group(), result)
return result
In [35]: re.sub(r"#(\w+)", foo, s)
('#Whatthehello', '#Whatthehello') <-- This shows you what `m.group()` and `func_replace(m.group())` returns
('#goback', '#goback')
Out[35]: '#Whatthehello #goback'
那会让你的注意力集中在
In [25]: func_replace('#Whatthehello')
Out[25]: '#Whatthehello'
然后您可以将其与
进行比较
In [26]: func_replace(hashtags[0])
Out[26]: 'What the hello'
In [27]: func_replace('Whatthehello')
Out[27]: 'What the hello'
那你会问这样的问题,如果m.group()
returns '#Whatthehello'
,我需要什么方法return 'Whatthehello'
。深入 the docs 然后解决问题。
我正在编写一个程序来拆分主题标签中包含的单词。
例如我想拆分主题标签:
#Whatthehello #goback
进入:
What the hello go back
我在使用带有函数参数的 re.sub
时遇到了麻烦。
我写的代码是:
import re,pdb
def func_replace(each_func):
i=0
wordsineach_func=[]
while len(each_func) >0:
i=i+1
word_found=longest_word(each_func)
if len(word_found)>0:
wordsineach_func.append(word_found)
each_func=each_func.replace(word_found,"")
return ' '.join(wordsineach_func)
def longest_word(phrase):
phrase_length=len(phrase)
words_found=[];index=0
outerstring=""
while index < phrase_length:
outerstring=outerstring+phrase[index]
index=index+1
if outerstring in words or outerstring.lower() in words:
words_found.append(outerstring)
if len(words_found) ==0:
words_found.append(phrase)
return max(words_found, key=len)
words=[]
# The file corncob_lowercase.txt contains a list of dictionary words
with open('corncob_lowercase.txt') as f:
read_words=f.readlines()
for read_word in read_words:
words.append(read_word.replace("\n","").replace("\r",""))
例如,当像这样使用这些函数时:
s="#Whatthehello #goback"
#checking if the function is able to segment words
hashtags=re.findall(r"#(\w+)", s)
print func_replace(hashtags[0])
# using the function for re.sub
print re.sub(r"#(\w+)", lambda m: func_replace(m.group()), s)
我得到的输出是:
What the hello
#Whatthehello #goback
这不是我预期的输出:
What the hello
What the hello go back
为什么会这样?特别是我使用了 this answer 的建议,但我不明白这段代码出了什么问题。
请注意 m.group()
return 是匹配的整个字符串,无论它是否是捕获组的一部分:
In [19]: m = re.search(r"#(\w+)", s)
In [20]: m.group()
Out[20]: '#Whatthehello'
m.group(0)
也 return 整个比赛:
In [23]: m.group(0)
Out[23]: '#Whatthehello'
相比之下,m.groups()
returns 所有捕获组:
In [21]: m.groups()
Out[21]: ('Whatthehello',)
和m.group(1)
return第一个捕获组:
In [22]: m.group(1)
Out[22]: 'Whatthehello'
所以您代码中的问题源于在
中使用m.group
re.sub(r"#(\w+)", lambda m: func_replace(m.group()), s)
因为
In [7]: re.search(r"#(\w+)", s).group()
Out[7]: '#Whatthehello'
而如果您使用 .group(1)
,您将获得
In [24]: re.search(r"#(\w+)", s).group(1)
Out[24]: 'Whatthehello'
和前面的 #
完全不同:
In [25]: func_replace('#Whatthehello')
Out[25]: '#Whatthehello'
In [26]: func_replace('Whatthehello')
Out[26]: 'What the hello'
因此,将 m.group()
更改为 m.group(1)
,并将 /usr/share/dict/words
替换为 corncob_lowercase.txt
,
import re
def func_replace(each_func):
i = 0
wordsineach_func = []
while len(each_func) > 0:
i = i + 1
word_found = longest_word(each_func)
if len(word_found) > 0:
wordsineach_func.append(word_found)
each_func = each_func.replace(word_found, "")
return ' '.join(wordsineach_func)
def longest_word(phrase):
phrase_length = len(phrase)
words_found = []
index = 0
outerstring = ""
while index < phrase_length:
outerstring = outerstring + phrase[index]
index = index + 1
if outerstring in words or outerstring.lower() in words:
words_found.append(outerstring)
if len(words_found) == 0:
words_found.append(phrase)
return max(words_found, key=len)
words = []
# corncob_lowercase.txt contains a list of dictionary words
with open('/usr/share/dict/words', 'rb') as f:
for read_word in f:
words.append(read_word.strip())
s = "#Whatthehello #goback"
hashtags = re.findall(r"#(\w+)", s)
print func_replace(hashtags[0])
print re.sub(r"#(\w+)", lambda m: func_replace(m.group(1)), s)
打印
What the hello
What the hello gob a c k
因为,唉,'gob'
比 'go'
长。
您可以调试的一种方法是用常规函数替换 lambda
函数,然后添加打印语句:
def foo(m):
result = func_replace(m.group())
print(m.group(), result)
return result
In [35]: re.sub(r"#(\w+)", foo, s)
('#Whatthehello', '#Whatthehello') <-- This shows you what `m.group()` and `func_replace(m.group())` returns
('#goback', '#goback')
Out[35]: '#Whatthehello #goback'
那会让你的注意力集中在
In [25]: func_replace('#Whatthehello')
Out[25]: '#Whatthehello'
然后您可以将其与
进行比较In [26]: func_replace(hashtags[0])
Out[26]: 'What the hello'
In [27]: func_replace('Whatthehello')
Out[27]: 'What the hello'
那你会问这样的问题,如果m.group()
returns '#Whatthehello'
,我需要什么方法return 'Whatthehello'
。深入 the docs 然后解决问题。