Word Boundary 正则表达式与 Devnagari Script 的整个单词不匹配
Word Boundary regex does not match the whole word for Devnagari Script
articles = ['a','an','the']
regex = r"\b(?:{})\b".format("|".join(word))
sent = 'Davis is theta'
re.split(regex,sent)
>> ['Davis ', ' theta']
此代码段适用于英语,但用于 Devnagari 脚本,它也匹配部分单词。
stopwords = ['कम','र','छ']
regex = r"\b(?:{})\b".format("|".join(stopwords))
sent = "रामको कम्पनी छ"
re.split(regex,sent)
>> ['', 'ामको ', '्पनी छ']
预期输出
['रामको' 'कम्पनी']
我正在使用 python3。是错误还是我遗漏了什么?
I suspect /b matches [a-zA-Z0-9] and I am using unicode. Is there an alternative to this task?
您可能希望通过 findall
而不是 split
使用此代码:
import re
stopwords = ['कम','र','छ']
reg = re.compile(r'(?!(?:{})(?!\S))\S+'.format("|".join(stopwords)))
sent = 'रामको कम्पनी छ'
print (reg.findall(sent))
此正则表达式避免使用不能很好地与 Devanagri 等 Unicode 文本配合使用的词边界。
Check: Python unicode regular expression matching failing with some unicode characters -bug or mistake?
articles = ['a','an','the']
regex = r"\b(?:{})\b".format("|".join(word))
sent = 'Davis is theta'
re.split(regex,sent)
>> ['Davis ', ' theta']
此代码段适用于英语,但用于 Devnagari 脚本,它也匹配部分单词。
stopwords = ['कम','र','छ']
regex = r"\b(?:{})\b".format("|".join(stopwords))
sent = "रामको कम्पनी छ"
re.split(regex,sent)
>> ['', 'ामको ', '्पनी छ']
预期输出
['रामको' 'कम्पनी']
我正在使用 python3。是错误还是我遗漏了什么?
I suspect /b matches [a-zA-Z0-9] and I am using unicode. Is there an alternative to this task?
您可能希望通过 findall
而不是 split
使用此代码:
import re
stopwords = ['कम','र','छ']
reg = re.compile(r'(?!(?:{})(?!\S))\S+'.format("|".join(stopwords)))
sent = 'रामको कम्पनी छ'
print (reg.findall(sent))
此正则表达式避免使用不能很好地与 Devanagri 等 Unicode 文本配合使用的词边界。
Check: Python unicode regular expression matching failing with some unicode characters -bug or mistake?