在 Python 中创建简单的自定义动态分词器时出错

Question

我正在尝试创建一个动态分词器，但它没有按预期工作。

下面是我的代码：

import re

def tokenize(sent):

  splitter = re.findall("\W",sent)
  splitter = list(set(splitter))

  for i in sent:
    if i in splitter:
      sent.replace(i, "<SPLIT>"+i+"<SPLIT>")

  sent.split('<SPLIT>')
  return sent


sent = "Who's kid are you? my ph. is +1-6466461022.Bye!"

tokens = tokenize(sent)

print(tokens)

这不行！

我预计它会 return 以下列表：

["Who", "'s", "kid", "are", "you","?", "my" ,"ph",".", "is", "+","1","-",6466461022,".","Bye","!"]

Answer 1

你可以使用

[x for x in re.split(r"([^'\w\s]|'(?![^\W\d_])|(?<![^\W\d_])')|(?='(?<=[^\W\d_]')(?=[^\W\d_]))|\s+", sent) if x]

见regex demo。模式匹配

( - 第 1 组（因为这些文本被捕获到一个组中，所以这些匹配出现在结果列表中）：
- [^'\w\s] - 除 ' 以外的任何字符、单词和空格字符
- | - 或
- '(?![^\W\d_]) - ' 后面没有紧跟着一个字母（[^\W\d_] 匹配任何 Unicode 字母）
- | - 或
- (?<![^\W\d_])' - ' 前面没有紧跟字母
) - 小组结束
| - 或
(?='(?<=[^\W\d_]')(?=[^\W\d_])) - ' 字符之前的位置，用字母
| - 或
\s+ - 一个或多个空白字符。

参见 Python demo:

import re

sents = ["Who's kid are you? my ph. is +1-6466461022.Bye!", "Who's kid are you? my ph. is +1-6466461022.'Bye!'"]
for sent in sents:
    print( [x for x in re.split(r"([^'\w\s]|'(?![^\W\d_])|(?<![^\W\d_])')|(?='(?<=[^\W\d_]')(?=[^\W\d_]))|\s+", sent) if x] )

# => ['Who', "'s", 'kid', 'are', 'you', '?', 'my', 'ph', '.', 'is', '+', '1', '-', '6466461022', '.', 'Bye', '!']
# => ['Who', "'s", 'kid', 'are', 'you', '?', 'my', 'ph', '.', 'is', '+', '1', '-', '6466461022', '.', "'", 'Bye', '!', "'"]

Answer 2

如果不是 ' 的特殊处理，这将是非常微不足道的。我假设你正在做 NLP，所以你想考虑 ' 属于哪个“方面”。例如，"tryin'" 不应拆分，"'tis" 也不应拆分（它是）。

import re


def tokenize(sent):
    split_pattern = rf"(\w+')(?:\W+|$)|('\w+)|(?:\s+)|(\W)"
    return [word for word in re.split(split_pattern, sent) if word]

sent = (
    "Who's kid are you? my ph. is +1-6466461022.Bye!",
    "Tryin' to show how the single quote can belong to either side",
    "'tis but a regex thing + don't forget EOL testin'",
    "You've got to love regex"
)

for item in sent:
    print(tokenize(item))

python re 库从左到右评估包含 | 的模式，它是非贪婪的，这意味着一旦找到匹配它就会停止，即使它是不是最长的匹配。

此外，re.split() 函数的一个特点是您可以使用匹配组来保留要拆分的 patterns/matches（否则字符串将被拆分，并且匹配发生拆分的位置被丢弃）。

模式分解：

(\w+')(?:\W+|$) - 后跟 ' 且紧跟其后没有单词字符的单词。例如，"tryin'"、"testin'"。不要捕获非单词字符。
('\w+) - ' 后跟至少一个单词字符。将分别匹配 "don't" 和 "they've" 中的 "'t" 和 "'ve"。
(?:\s+) - 拆分任何空格，但丢弃空格本身
(\W) - 拆分所有非单词字符（无需费心查找字符串本身中存在的子集）

在 Python 中创建简单的自定义动态分词器时出错

Error when creating a simple custom dynamic tokenizer in Python

python

tokenize

python-re