在二元组列表中查找名称？

Question

我有一个正在处理的文本文件，我想标记每个单词，但要将名称放在一起，例如'John Smith'.

我想使用 nltk.bigrams 来执行此操作，如果我使用它并获得一个双字母组列表，我将如何在该列表中搜索两个单词都以大写字母开头的双字母组？

bigrams = list(nltk.bigrams(text))

Answer 1

list(filter(lambda L : L[0][0].upper() == L[0][0] and L[1][0].upper() == L[1][0], list(bigrams(text))))

编辑： 作为解释，list(filter(lambda x : f(x), my_list)) 按 f(x) == True 的值过滤 my_list。在这里，我根据两个单词都以大写字母开头的值过滤了列表 list(bigrams(text))。

（因为 list(bigrams(text)) 的元素 L 是两个单词的元组，我检查 L[0] 和 L[1] 的第一个字母是否是大写字母。）

Answer 2

IIUC，您想将句子分成单词，但要将名称（以大写字母开头的两个连续单词）放在一起吗？

您可以使用一个小的正则表达式：

text = 'sentence where John Smith and Jane Doe are mentioned, here a Capital word alone'
re.findall('[A-Z]\w+\s[A-Z]\w+|\w+', text)

输出：

['sentence',
 'where',
 'John Smith',
 'and',
 'Jane Doe',
 'are',
 'mentioned',
 'here',
 'a',
 'Capital',
 'word',
 'alone']

应用双字母组

[list(nltk.bigrams(x)) for x in re.findall('[A-Z]\w+\s[A-Z]\w+|\w+', text)]

输出：

[[('s', 'e'),
  ('e', 'n'),
  ('n', 't'),
  ('t', 'e'),
  ('e', 'n'),
  ('n', 'c'),
  ('c', 'e')],
 [('w', 'h'), ('h', 'e'), ('e', 'r'), ('r', 'e')],
 [('J', 'o'),
  ('o', 'h'),
  ('h', 'n'),
  ('n', ' '),
  (' ', 'S'),
  ('S', 'm'),
  ('m', 'i'),
  ('i', 't'),
  ('t', 'h')],
 [('a', 'n'), ('n', 'd')],
 [('J', 'a'),
  ('a', 'n'),
  ('n', 'e'),
  ('e', ' '),
  (' ', 'D'),
  ('D', 'o'),
  ('o', 'e')],
 [('a', 'r'), ('r', 'e')],
 [('m', 'e'),
  ('e', 'n'),
  ('n', 't'),
  ('t', 'i'),
  ('i', 'o'),
  ('o', 'n'),
  ('n', 'e'),
  ('e', 'd')],
 [('h', 'e'), ('e', 'r'), ('r', 'e')],
 [],
 [('C', 'a'), ('a', 'p'), ('p', 'i'), ('i', 't'), ('t', 'a'), ('a', 'l')],
 [('w', 'o'), ('o', 'r'), ('r', 'd')],
 [('a', 'l'), ('l', 'o'), ('o', 'n'), ('n', 'e')]]

Answer 3

如果这是你想要的

import nltk
nltk.download('punkt')

text = "My name is Shaida Muhammad and I'm not an extremist"
text = nltk.word_tokenize(text)
bigrams = list(nltk.bigrams(text)) 


for first_word, second_word in bigrams:
    if first_word.istitle() and second_word.istitle():
        print(first_word, second_word)    # It will output Shaida Muhammad

在二元组列表中查找名称？

Finding names in a list of bigrams?

python

nltk

应用双字母组