如何使用带有 Python 的 NLTK 从文本中删除自定义单词模式

Question

我目前正在做一个分析质量试卷的项目questions.In，这里我使用的是 Python 3.4 和 NLTK。
所以首先我想从下面给出的text.The试卷格式中单独拿出每道题。

 (Q1). What is web 3.0?
 (Q2). Explain about blogs.
 (Q3). What is mean by semantic web?
       and so on ........

所以现在我想在没有题号的情况下一个一个地提取问题（题号格式始终与上面给出的相同）。所以我的结果应该是这样的。

 What is web 3.0?
 Explain about blogs.
 What is mean by semantic web?

那么 python 3.4 和 NLTK 如何解决这个问题呢？
谢谢

Answer 1

如果每个句子都以这种模式开头，你要求的内容很容易解析，你可以使用split去掉这个前缀：

sentences = [ "(Q1). What is web 3.0?",
              "(Q2). Explain about blogs.",
              "(Q3). What is mean by semantic web?"]
for sen in sentences:
    print sen.split('). ',1)[1]

这将打印：

What is web 3.0?
Explain about blogs.
What is mean by semantic web?

Answer 2

您可能需要检测包含问题的行，然后提取问题并删除问题编号。检测问题标签的正则表达式是

qnum_pattern = r"^\s*\(Q\d+\)\.\s+"

你可以用它来提取这样的问题：

questions = [ re.sub(qnum_pattern, "", line) for line in text if 
                                            re.search(qnum_pattern, line) ]

显然，text 必须是行列表或打开以供阅读的文件。

但是，如果您不知道如何解决这个问题，那么剩下的作业就已经为您完成了。我建议花一些时间阅读 python 教程或其他介绍性材料。

Answer 3

如果(QX)在正文前总是用space隔开，你可以这样做：

>>> text = """(Q1). What is web 3.0?
...  (Q2). Explain about blogs.
...  (Q3). What is mean by semantic web?"""
>>> for line in text.split('\n'):
...     print line.strip().partition(' ')[2]
... 
What is web 3.0?
Explain about blogs.
What is mean by semantic web?

如何使用带有 Python 的 NLTK 从文本中删除自定义单词模式

How to remove a custom word pattern from a text using NLTK with Python

python

regex

nlp

tokenize

nltk