Regex , 找出句子，全部是大写字母

Question

我需要你的帮助。

目前我正在使用此代码部分进行工作；

    altbaslik = []
    for line in sentenceIndex:
        finded = re.match(r"\w*[A-Z]\w*[A-Z]\w*|[Ö|Ç|Ş|Ü|Ğ|İ]", line)
        if finded != None:
          finded2 = finded.group()
          altbaslik.append(finded2)


    print(altbaslik)

sentenceIndex = 这是一个列表。它包含段落中的标记化句子。例如：

示例段落：

VODOFONE ARENA CHANCE But the most important point is that Murat Çetinkaya was elected with the alliance of President Erdoğan and Prime Minister Davutoğlu.我将详细描述该过程。我什至会谈到导致总统和总理就同一个名字达成一致的标准。但有一件事我无法传达。中央银行行长的命运在多尔玛巴赫切和沃达丰竞技场之间的旅途中决定。

句子索引：

['VODOFONE ARENA ŞANSI Ama asıl önemli olan nokta Murat Çetinkaya, Cumhurbaşkanı Erdoğan ve Başbakan Davutoğlu’nun ittifakıyla seçildi.','.................','...... ..']

我需要一个正则表达式，它可以找到句子中所有大写字母的单词。

"VODOFONE ARENA ŞANSI" 我需要找到并提取此部分。我正在使用的当前正则表达式不起作用。我需要有关此正则表达式的帮助。

注意： [Ö|Ç|Ş|Ü|Ğ|İ] 我正在研究土耳其语文本。这就是为什么我也需要注意这些字母。

感谢在这个问题上抽出时间帮助我的人:)

Answer 1

您可以将 re.findall 与

一起使用

r'\b[A-ZÖÇŞÜĞİ]+(?:\W+[A-ZÖÇŞÜĞİ]+)*\b'

使用 Python regex 库，您可以使用 pip install regex:

安装

r'\b\p{Lu}+(?:\W+\p{Lu}+)*\b'

参见regex demo。

详情

\b - 单词边界
[A-ZÖÇŞÜĞİ]+ - 1+ 个大写字母（基本拉丁语和土耳其语）（\p{Lu} 匹配任何 Unicode 大写字母）
(?:\W+[A-ZÖÇŞÜĞİ]+)* - 0 次或多次重复
- \W+ - 任何 1+ 个非单词字符
- [A-ZÖÇŞÜĞİ]+ - 1+ 个大写字母（基本拉丁语和土耳其语）（\p{Lu} 匹配任何 Unicode 大写字母）
\b - 单词边界

见Python demo:

import re

altbaslik=[]
sentenceIndex = ['VODOFONE ARENA ŞANSI Ama asıl önemli olan nokta Murat Çetinkaya, Cumhurbaşkanı Erdoğan ve Başbakan Davutoğlu’nun ittifakıyla seçildi.','...................','.................']
for line in sentenceIndex:
    found = re.findall(r"\b[A-ZÖÇŞÜĞİ]+(?:\W+[A-ZÖÇŞÜĞİ]+)*\b", line)
    if len(found):
        altbaslik.extend(found)

print(altbaslik) # => ['VODOFONE ARENA ŞANSI']

或者，使用 PyPi regex:

import regex

altbaslik=[]
sentenceIndex = ['VODOFONE ARENA ŞANSI Ama asıl önemli olan nokta Murat Çetinkaya, Cumhurbaşkanı Erdoğan ve Başbakan Davutoğlu’nun ittifakıyla seçildi.','...................','.................']
for line in sentenceIndex:
    found = regex.findall(r'\b\p{Lu}+(?:\W+\p{Lu}+)*\b', line)
    if len(found):
        altbaslik.extend(found)

print(altbaslik) # => ['VODOFONE ARENA ŞANSI']

Answer 2

为避免必须列出所有大写字符的变体，请安装并使用新的 regex module. It is extremely similar to (as yet) default re but has superior Unicode properties support。

例如，要查找任何大写字符，您可以使用 Unicode 属性 \p{Lu}:

import regex

text = 'VODOFONE ARENA ŞANSI Ama asıl önemli olan nokta Murat Çetinkaya, ' \
       'ΘΑΥΜΑΣΙΟΣ Cumhurbaşkanı Erdoğan ve Başbakan Davutoğlu’nun ittifakıyla seçildi.'

found = regex.findall(r'\b\p{Lu}+(?: \p{Lu}+)*\b', text)
print (found)  # => ['VODOFONE ARENA ŞANSI', 'ΘΑΥΜΑΣΙΟΣ']

Regex , 找出句子，全部是大写字母

Regex , Find the sentence, all of which are capital letters

python

regex

text-extraction

nltk

stringtokenizer