如何从以“[”开头的数组中删除特定单词？

Question

我有一个包含很多句子的数组。我已将这些句子拆分成单词并制作另一个数组。我希望从我的数组中删除 ID 以“[”开头并以“]”结尾的单词。

例如

from nltk import sent_tokenize
sentences = sent_tokenize(text)
print(sentences[0])
z= np.array(sentences)

sentence: [42] 1987 年 1 月 20 日，在孟买 Brabourne 体育场举行的一场表演赛中，他也替 Imran Khan 的球队出场，以纪念印度板球俱乐部的金禧。

words = z[0].split()
words= list(words)
print(words)

拆分成单词后：['[42]', 'On', '20', 'January', '1987,', 'he', 'also', 'turned'、'out'、'as'、'substitute'、'for'、'Imran'、“可汗”、'side'、'in', 'an', 'exhibition', 'game', 'at', 'Brabourne', 'Stadium', 'in', 'Bombay,', 'to', 'mark', 'the', 'golden', 'jubilee', 'of', 'Cricket', 'Club', 'of', 'India.']

现在我想从数组中删除 [42]。然后把这个词连成句。我怎样才能做到这一点？我试过这种方式。但这不起作用。它删除整个数组并打印 None.

for i in words:
  if i[0]=="[":
    b=words.remove(i)
    print(b)
  else:
    print("")

Answer 1

您可以考虑使用列表理解如下：

sentence = "[42] On 20 January 1987, he also turned out as substitute for Imran Khan's side in an exhibition game at Brabourne Stadium in Bombay, to mark the golden jubilee of Cricket Club of India."
words = sentence.split()
words = [ w for w in words if w[0]!='[' and w[-1]!= ']' ]
filtered = ' '.join(words)
print(filtered)
"On 20 January 1987, he also turned out as substitute for Imran Khan's side in an exhibition game at Brabourne Stadium in Bombay, to mark the golden jubilee of Cricket Club of India."

Answer 2

使用正则表达式（不需要拆分句子）：

import re
sentence = "[42] On 20 January 1987, he also turned out as substitute for Imran Khan's side in an exhibition game at Brabourne Stadium in Bombay, to mark the golden jubilee of Cricket Club of India."
re.sub(r'\[.+\]','',sentence)

Answer 3

我建议你实现一个匹配功能来匹配你要过滤掉的词：

def check_word(word: str) -> bool:
    """Returns True iff the word starts with [ and ends with ]."""

    return word.startswith('[') and word.endswith(']')

然后您可以将它与 itertools.filterfalse() 结合使用，这将为您留下一个 迭代器 对象。

from itertools import filterfalse


def check_word(word: str) -> bool:
    """Returns True iff the word starts with [ and ends with ]."""

    return word.startswith('[') and word.endswith(']')


filtered_words = filterfalse(check_word, words)

如果您需要多次迭代它们，您可以将迭代器转换为序列，例如列表：

from itertools import filterfalse


def check_word(word: str) -> bool:
    """Returns True iff the word starts with [ and ends with ]."""

    return word.startswith('[') and word.endswith(']')


filtered_words = list(filterfalse(check_word, words))

如何从以“[”开头的数组中删除特定单词？

how to remove specific word from an array that is starts with "[ "?

python

text-processing