在文本中查找大写单词

Find capitalized words in a text

如何指定以大写字母开头的单词以及该单词在文本中的个数?如果在文本中没有找到具有该属性的单词,则将其打印在 None 输出中。句首的话不应该考虑。不应考虑数字,如果分号在单词的末尾,则应省略该分号。

像下面的例子:

输入:

The University of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929.

输出:

2:University
4:Edinburgh
11:Edinburgh
12:Scotland
14:University
16:Texas
21:Association
23:American
24:Universities

试试这个代码

只是你要在字符串上使用.istitle()方法检查它是否以大写字母开头,其余都是小写

并且使用正则表达式,您可以取出不包括末尾符号的单词(假设您不想像您提到的那样包含符号以忽略单词末尾的分号)

import re

inp = 'The University; of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929'
inp2 = ''

def capitalized_words_in_a_text(inp):
    lst = inp.split(' ')[1:]
    res = [f"{i}: {re.match(r'^[A-Za-z]+', j).group()}" for i,j in enumerate(lst, start=2) if j.istitle()]

    if len(res) == 0:
        return
    return '\n'.join(res)

print(capitalized_words_in_a_text(inp))
print(capitalized_words_in_a_text(inp.lower()))

输出:

2: University
4: Edinburgh
11: Edinburgh
12: Scotland
13: The
14: University
16: Texas
21: Association
23: American
24: Universities
None # this is from the inp.lower() line, as there's no capital letters in the string

如果它不起作用请告诉我...

这是代码。您可以将任何其他字符添加到 strip 中,它应该将其从单词的末尾删除。您也可以将上次打印的内容更改为您想要的任何内容。

import numpy as np

s1="The University of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929."

n = []

for index, word in enumerate(s1.split()):
    if word[0].isupper():
        if string[index-1][-1] == ".": #check that previous word does not end in a ".". 
            continue
        print(f"""{index+1}:{word.strip(",.;:")}""") #python index is one number lower, so add one to it to get the numbers you requested
        n.append(word) #this is just to be able to print something if no words have capital letters
if len(n) == 0:
    print("None")

The words at the beginning of the sentence should not be considered

这会使过程变得更加困难,因为您应该首先确定句子是如何分开的。一个句子可以用标点符号结束,例如 . or ! or ?。但是你没有用句号结束​​你例子中的最后一句话。为此必须首先对您的语料库进行预处理!


抛开这个问题,假设是这样的场景:

import re

inp = "The University of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929! The last Sentence."

sentences = re.findall(r"[\w\s,]*[\.\!\?]",inp)
counter = 0
for sentence in sentences:
    sentence = re.sub(r"\W", " ",sentence)
    sentence = re.sub(r"\s+", " ", sentence)
    words = re.split(r"\s", sentence)
    words = [w for w in words if w!=""]
    for i, word in enumerate(words):
        if word != "" and i != 0:
            if re.search(r"[A-Z]+", word):
                print("%d:%s" % (counter+i+1, word))
    counter += len(words)

这段代码正是您想要的。这不是最佳实践,但它是一个紧凑而简单的代码。注意首先需要为输入的句子指定每句末尾的标点符号!!!


输出:

2:University                                                                                                                          
4:Edinburgh                                                                                                                           
11:Edinburgh                                                                                                                          
12:Scotland                                                                                                                           
14:University                                                                                                                         
16:Texas                                                                                                                              
21:Association                                                                                                                        
23:American                                                                                                                           
24:Universities                                                                                                                       
29:Sentence