将过滤后的 ngram 写入 outfile - list of lists

Write filtered ngrams into outfile - list of lists

我按照特定模式从一堆 HTML 文件中提取了三元组。当我打印它们时,我得到一个列表列表(其中每一行都是一个三元组)。我想将它打印到一个 outfile 以进行进一步的文本分析,但是当我尝试它时,它只打印第一个 threegram。我怎样才能将所有的三元组打印到输出文件中? (三元组列表的列表)。理想情况下,我希望将所有三元组合并到一个列表中,而不是将多个列表与一个三元组合并。非常感谢您的帮助。

到目前为止,我的代码如下所示:

from nltk import sent_tokenize, word_tokenize
from nltk import ngrams
from bs4 import BeautifulSoup
from string import punctuation
import glob
import sys
punctuation_set = set(punctuation) 

# Open and read file
text = glob.glob('C:/Users/dell/Desktop/python-for-text-analysis-master/Notebooks/TEXTS/*')   
for filename in text:
with open(filename, encoding='ISO-8859-1', errors="ignore") as f:
    mytext = f.read()  

# Extract text from HTML using BeautifulSoup
soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
extracted_text = extracted_text.replace('\n', '')

# Split the text in sentences (using the NLTK sentence splitter) 
sentences = sent_tokenize(extracted_text)

# Create list of tokens with their POS tags (after pre-processing: punctuation removal, tokenization, POS tagging)
all_tokens = []

for sent in sentences:
    sent = "".join([char for char in sent if not char in punctuation_set]) # remove punctuation from sentence (optional; comment out if necessary)
    tokenized_sent = word_tokenize(sent) # split sentence into tokens (using NLTK word tokenization)
    all_tokens.extend(tokenized_sent) # add tagged tokens to list

n=3
threegrams = ngrams(all_tokens, n)


# Find ngrams with specific pattern
for (first, second, third) in threegrams: 
    if first == "a":
        if second.endswith("bb") and second.startswith("leg"):
            print(first, second, third)

首先,标点符号删除本来可以更简单,请参阅 Removing a list of characters in string

>>> from string import punctuation
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> text.translate(None, punctuation)
'The lazy birds flew over the rainbow Well not have known'

但是在进行标记化之前删除标点符号并不是真正正确的做法,您会看到 We'll -> Well,我认为这是不希望的。

可能这是更好的方法:

>>> from nltk import sent_tokenize, word_tokenize
>>> [[word for word in word_tokenize(sent) if word not in punctuation] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]

但请注意,上面的成语不处理 multi-character 标点符号。

例如,我们看到 word_tokenize() 更改 " -> `` ,并且使用上面的习语并没有删除它:

>>> sent = 'He said, "There is no room for room"'
>>> word_tokenize(sent)
['He', 'said', ',', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]
>>> [word for word in word_tokenize(sent) if word not in punctuation]
['He', 'said', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]

为了解决这个问题,明确地将 punctuation 放入列表中并向其附加 multi-character 标点符号:

>>> sent = 'He said, "There is no room for room"'
>>> punctuation
'!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'
>>> list(punctuation)
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> list(punctuation) + ['...', '``', "''"]
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\', ']', '^', '_', '`', '{', '|', '}', '~', '...', '``', "''"]
>>> p = list(punctuation) + ['...', '``', "''"]
>>> [word for word in word_tokenize(sent) if word not in p]
['He', 'said', 'There', 'is', 'no', 'room', 'for', 'room']

至于获取文档流(如您所称 all_tokens),这里有一个简洁的获取方式:

>>> from collections import Counter
>>> from nltk import sent_tokenize, word_tokenize
>>> from string import punctuation
>>> p = list(punctuation) + ['...', '``', "''"]
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> [[word for word in word_tokenize(sent) if word not in p] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]

现在进入您的实际问题部分。

您真正需要的不是检查 ngram 中的字符串,而是您应该考虑正则表达式模式匹配。

您想查找模式 \ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b,请参阅 https://regex101.com/r/zBVgp4/4

>>> import re
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This is a legobatmanbb cave hahaha")
['a legobatmanbb cave']
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This isa legobatmanbb cave hahaha")
[]

现在要将字符串写入文件,可以使用这个成语,参见https://docs.python.org/3/whatsnew/3.0.html#print-is-a-function:

with open('filename.txt', 'w') as fout:
    print('Hello World', end='\n', file=fout)

事实上,如果您只对没有标记的 ngram 感兴趣,则无需过滤或标记文本;P

您可以将代码简化为:

soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
pattern = r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b"

with open('filename.txt', 'w') as fout:
    for interesting_ngram in re.findall(pattern, extracted_text):
        print(interesting_ngram, end='\n', file=fout)