如何使用字长作为标记来拆分字符串

Question

我正在使用 Python 3.

准备包含文档标题的字符串，用作美国专利网站中的搜索词

1) 保留长短语是有益的，但是

2) 当搜索包含许多长度不超过 3 个字符的词时，搜索效果不佳，因此我需要消除它们。

我已经尝试使用正则表达式“\b\w[1:3}\b *”拆分一到三个字母的单词，有或没有尾随 space，但没有成功.但是，我不是正则表达式专家。

for pubtitle in df_tpdownloads['PublicationTitleSplit']:
    pubtitle = pubtitle.lower() # make lower case
    pubtitle = re.split("[?:.,;\"\'\-()]+", pubtitle) # tokenize and remove punctuation
    #print(pubtitle)

    for subArray in pubtitle:
        print(subArray)
        subArray = subArray.strip()
        subArray = re.split("(\b\w{1:3}\b) *", subArray) # split on words that are < 4 letters
        print(subArray)

上面的代码遍历了一个 pandas 系列并清除了标点符号，但未能按字长进行拆分。

我希望看到类似下面示例的内容。

示例：

所以，

" and training requirements for selected salt applications"```

变成

['training requirements', 'selected salt applications'].

而且，

"december 31"

变成

['december'].

而且，

"experimental system for salt in an emergence research and applications in process heat"

变成

['experimental system', 'salt', 'emergence research', 'applications', 'process heat'].

但是拆分并没有捕获小词，我无法判断问题出在正则表达式、re.split 命令还是两者。

我或许可以使用蛮力方法，但想要一个优雅的解决方案。任何帮助将不胜感激。

Answer 1

您可以使用

list(filter(None, re.split(r'\s*\b\w{1,3}\b\s*|[^\w\s]+', pubtitle.strip().lower())))

获得你想要的结果。见 regex demo.

r'\s*\b\w{1,3}\b\s*|[^\w\s]+' 正则表达式将没有前导和尾随空格（由于 .strip()）的小写（使用 .lower()）字符串拆分为没有标点符号的标记（[^\w\s]+那样做）并且没有 1-3 个单词的字符词（\s*\b\w{1,3}\b\s* 那样做）。

图案详情

\s* - 0+ 个空格
\b - 单词边界
\w{1,3} - 1、2 或 3 个字符（如果您不想匹配 _，请使用 [^\W_]+）
\b - 单词边界
\s* - 0+ 空格
| - 或
[^\w\s]+ - 除了单词和空白字符之外的 1 个或多个字符。

见Python demo:

import re

df_tpdownloads = [" and training requirements for selected salt applications",
                  "december 31",
                  "experimental system for salt in an emergence research and applications in process heat"]

#for pubtitle in df_tpdownloads['PublicationTitleSplit']:
for pubtitle in df_tpdownloads:
    result = list(filter(None, re.split(r'\s*\b\w{1,3}\b\s*|[^\w\s]+', pubtitle.strip().lower())))
    print(result)

输出：

['training requirements', 'selected salt applications']
['december']
['experimental system', 'salt', 'emergence research', 'applications', 'process heat']

如何使用字长作为标记来拆分字符串

How to split a string using word length as a token

python

regex

split

string-length