Python

Question

我想这样编辑我的文字：

arr = [] 
# arr is full of tokenized words from my text

例如：

"Abraham Lincoln Hotel is very beautiful place and i want to go there with
 Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."

编辑：基本上我想检测专有名称并通过在 for 语句中使用 istitle() 和 isAlpha() 对它们进行分组，例如：

for i in arr:
    if arr[i].istitle() and arr[i].isAlpha

在这个例子中，直到下一个单词的第一个字母不是大写时，才会追加。

arr[0] + arr[1] + arr[2] = arr[0]
#Abraham Lincoln Hotel

这就是我想要的新 arr:

['Abraham Lincoln Hotel'] is very beautiful place and i want to go there with ['Barbara Palvin']. ['Also'] there are stores like ['Adidas'], ['Nike'], ['Reebok'].

"Also" 对我来说不是问题，当我尝试与我的数据集匹配时它会很有用。

Answer 1

这是你问的吗？

sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."

chars = ".!?,"                                   # Characters you want to remove from the words in the array

table = chars.maketrans(chars, " " * len(chars)) # Create a table for replacing characters
sentence = sentence.translate(table)             # Replace characters with spaces

arr = sentence.split()                           # Split the string into an array whereever a space occurs

print(arr)

输出为：

['Abraham',
 'Lincoln',
 'Hotel',
 'is',
 'very',
 'beautiful',
 'place',
 'and',
 'i',
 'want',
 'to',
 'go',
 'there',
 'with',
 'Barbara',
 'Palvin',
 'Also',
 'there',
 'are',
 'stores',
 'like',
 'Adidas',
 'Nike',
 'Reebok']

关于此代码的注意事项：chars 变量中的任何字符都将从数组中的字符串中删除。解释在代码中。

要删除非名称，只需执行以下操作：

import string
new_arr = []

for i in arr:
    if i[0] in string.ascii_uppercase:
        new_arr.append(i)

此代码将包括所有以大写字母开头的单词。

要解决此问题，您需要将 chars 更改为：

chars = ","

并将上面的代码改为：

import string
new_arr = []
end = ".!?"    

b = 1
for i in arr:
    if i[0] in string.ascii_uppercase and arr[b-1][-1] not in end:
        new_arr.append(i)
    b += 1

这将输出：

['Abraham', 
'Lincoln', 
'Hotel', 
'Barbara', 
'Palvin.', 
'Adidas', 
'Nike',
'Reebok.']

Answer 2

你可以这样做：

sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas, Nike, Reebok."
all_words = sentence.split()
last_word_index = -100
proper_nouns = []
for idx, word in enumerate(all_words):
    if(word.istitle() and word.isalpha()):
        if(last_word_index == idx-1):
            proper_nouns[-1] = proper_nouns[-1] + " " + word
        else:
            proper_nouns.append(word)
        last_word_index = idx
print(proper_nouns)

此代码将：

将所有单词拆分成一个列表
遍历所有单词和
- 如果最后一个大写的单词是前一个单词，它将把它附加到列表中的最后一个条目
- 否则它会将单词存储为列表中的新条目
- 记录最后一次找到大写单词的索引

Python - 分组顺序数组成员

Python - Group Sequential Array Members

nlp

nltk

stanford-nlp

opennlp