在文本中查找大写单词
Find capitalized words in a text
如何指定以大写字母开头的单词以及该单词在文本中的个数?如果在文本中没有找到具有该属性的单词,则将其打印在 None 输出中。句首的话不应该考虑。不应考虑数字,如果分号在单词的末尾,则应省略该分号。
像下面的例子:
输入:
The University of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929.
输出:
2:University
4:Edinburgh
11:Edinburgh
12:Scotland
14:University
16:Texas
21:Association
23:American
24:Universities
试试这个代码
只是你要在字符串上使用.istitle()
方法检查它是否以大写字母开头,其余都是小写
并且使用正则表达式,您可以取出不包括末尾符号的单词(假设您不想像您提到的那样包含符号以忽略单词末尾的分号)
import re
inp = 'The University; of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929'
inp2 = ''
def capitalized_words_in_a_text(inp):
lst = inp.split(' ')[1:]
res = [f"{i}: {re.match(r'^[A-Za-z]+', j).group()}" for i,j in enumerate(lst, start=2) if j.istitle()]
if len(res) == 0:
return
return '\n'.join(res)
print(capitalized_words_in_a_text(inp))
print(capitalized_words_in_a_text(inp.lower()))
输出:
2: University
4: Edinburgh
11: Edinburgh
12: Scotland
13: The
14: University
16: Texas
21: Association
23: American
24: Universities
None # this is from the inp.lower() line, as there's no capital letters in the string
如果它不起作用请告诉我...
这是代码。您可以将任何其他字符添加到 strip 中,它应该将其从单词的末尾删除。您也可以将上次打印的内容更改为您想要的任何内容。
import numpy as np
s1="The University of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929."
n = []
for index, word in enumerate(s1.split()):
if word[0].isupper():
if string[index-1][-1] == ".": #check that previous word does not end in a ".".
continue
print(f"""{index+1}:{word.strip(",.;:")}""") #python index is one number lower, so add one to it to get the numbers you requested
n.append(word) #this is just to be able to print something if no words have capital letters
if len(n) == 0:
print("None")
The words at the beginning of the sentence should not be considered
这会使过程变得更加困难,因为您应该首先确定句子是如何分开的。一个句子可以用标点符号结束,例如 . or ! or ?
。但是你没有用句号结束你例子中的最后一句话。为此必须首先对您的语料库进行预处理!
抛开这个问题,假设是这样的场景:
import re
inp = "The University of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929! The last Sentence."
sentences = re.findall(r"[\w\s,]*[\.\!\?]",inp)
counter = 0
for sentence in sentences:
sentence = re.sub(r"\W", " ",sentence)
sentence = re.sub(r"\s+", " ", sentence)
words = re.split(r"\s", sentence)
words = [w for w in words if w!=""]
for i, word in enumerate(words):
if word != "" and i != 0:
if re.search(r"[A-Z]+", word):
print("%d:%s" % (counter+i+1, word))
counter += len(words)
这段代码正是您想要的。这不是最佳实践,但它是一个紧凑而简单的代码。注意首先需要为输入的句子指定每句末尾的标点符号!!!
输出:
2:University
4:Edinburgh
11:Edinburgh
12:Scotland
14:University
16:Texas
21:Association
23:American
24:Universities
29:Sentence
如何指定以大写字母开头的单词以及该单词在文本中的个数?如果在文本中没有找到具有该属性的单词,则将其打印在 None 输出中。句首的话不应该考虑。不应考虑数字,如果分号在单词的末尾,则应省略该分号。
像下面的例子:
输入:
The University of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929.
输出:
2:University
4:Edinburgh
11:Edinburgh
12:Scotland
14:University
16:Texas
21:Association
23:American
24:Universities
试试这个代码
只是你要在字符串上使用.istitle()
方法检查它是否以大写字母开头,其余都是小写
并且使用正则表达式,您可以取出不包括末尾符号的单词(假设您不想像您提到的那样包含符号以忽略单词末尾的分号)
import re
inp = 'The University; of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929'
inp2 = ''
def capitalized_words_in_a_text(inp):
lst = inp.split(' ')[1:]
res = [f"{i}: {re.match(r'^[A-Za-z]+', j).group()}" for i,j in enumerate(lst, start=2) if j.istitle()]
if len(res) == 0:
return
return '\n'.join(res)
print(capitalized_words_in_a_text(inp))
print(capitalized_words_in_a_text(inp.lower()))
输出:
2: University
4: Edinburgh
11: Edinburgh
12: Scotland
13: The
14: University
16: Texas
21: Association
23: American
24: Universities
None # this is from the inp.lower() line, as there's no capital letters in the string
如果它不起作用请告诉我...
这是代码。您可以将任何其他字符添加到 strip 中,它应该将其从单词的末尾删除。您也可以将上次打印的内容更改为您想要的任何内容。
import numpy as np
s1="The University of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929."
n = []
for index, word in enumerate(s1.split()):
if word[0].isupper():
if string[index-1][-1] == ".": #check that previous word does not end in a ".".
continue
print(f"""{index+1}:{word.strip(",.;:")}""") #python index is one number lower, so add one to it to get the numbers you requested
n.append(word) #this is just to be able to print something if no words have capital letters
if len(n) == 0:
print("None")
The words at the beginning of the sentence should not be considered
这会使过程变得更加困难,因为您应该首先确定句子是如何分开的。一个句子可以用标点符号结束,例如 . or ! or ?
。但是你没有用句号结束你例子中的最后一句话。为此必须首先对您的语料库进行预处理!
抛开这个问题,假设是这样的场景:
import re
inp = "The University of Edinburgh is a public research university in Edinburgh, Scotland. The University of Texas was included in the Association of American Universities in 1929! The last Sentence."
sentences = re.findall(r"[\w\s,]*[\.\!\?]",inp)
counter = 0
for sentence in sentences:
sentence = re.sub(r"\W", " ",sentence)
sentence = re.sub(r"\s+", " ", sentence)
words = re.split(r"\s", sentence)
words = [w for w in words if w!=""]
for i, word in enumerate(words):
if word != "" and i != 0:
if re.search(r"[A-Z]+", word):
print("%d:%s" % (counter+i+1, word))
counter += len(words)
这段代码正是您想要的。这不是最佳实践,但它是一个紧凑而简单的代码。注意首先需要为输入的句子指定每句末尾的标点符号!!!
输出:
2:University
4:Edinburgh
11:Edinburgh
12:Scotland
14:University
16:Texas
21:Association
23:American
24:Universities
29:Sentence