如何为句子列表创建 window/chunk?
How to create window/chunk for list of sentences?
我有句子列表,我想创建 skipgram (window size = 3)
但我 不要 希望计数器跨越句子,因为它们都是不相关的。
所以,如果我有以下句子:
[["my name is John"] , ["This PC is black"]]
三胞胎将是:
[my name is]
[name is john]
[this PC is]
[PC is black]
最好的方法是什么?
这是一个简单的函数。
def skipgram(corpus, window_size = 3):
sg = []
for sent in corpus:
sent = sent[0].split()
if len(sent) <= window_size:
sg.append(sent)
else:
for i in range(0, len(sent)-window_size+1):
sg.append(sent[i: i+window_size])
return sg
corpus = [["my name is John"] , ["This PC is black"]]
skipgram(corups)
试试这个!
from nltk import ngrams
def generate_ngrams(sentences,window_size =3):
for sentence in sentences:
yield from ngrams(sentence[0].split(), window_size)
sentences= [["my name is John"] , ["This PC is black"]]
for c in generate_ngrams(sentences,3):
print (c)
#output:
('my', 'name', 'is')
('name', 'is', 'John')
('This', 'PC', 'is')
('PC', 'is', 'black')
你并不真的想要一个 skipgram
本身,但你想要一个按大小划分的块,试试这个:
from lazyme import per_chunk
tokens = "my name is John".split()
list(per_chunk(tokens, 2))
[输出]:
[('my', 'name'), ('is', 'John')]
如果你想要滚动 window,即 ngrams
:
from lazyme import per_window
tokens = "my name is John".split()
list(per_window(tokens, 2))
[输出]:
[('my', 'name'), ('name', 'is'), ('is', 'John')]
类似地在 NLTK 中用于 ngrams:
from nltk import ngrams
tokens = "my name is John".split()
list(ngrams(tokens, 2))
[输出]:
[('my', 'name'), ('name', 'is'), ('is', 'John')]
如果你想要实际的 skipgrams,
from nltk import skipgrams
tokens = "my name is John".split()
list(skipgrams(tokens, n=2, k=1))
[输出]:
[('my', 'name'),
('my', 'is'),
('name', 'is'),
('name', 'John'),
('is', 'John')]
我有句子列表,我想创建 skipgram (window size = 3)
但我 不要 希望计数器跨越句子,因为它们都是不相关的。
所以,如果我有以下句子:
[["my name is John"] , ["This PC is black"]]
三胞胎将是:
[my name is]
[name is john]
[this PC is]
[PC is black]
最好的方法是什么?
这是一个简单的函数。
def skipgram(corpus, window_size = 3):
sg = []
for sent in corpus:
sent = sent[0].split()
if len(sent) <= window_size:
sg.append(sent)
else:
for i in range(0, len(sent)-window_size+1):
sg.append(sent[i: i+window_size])
return sg
corpus = [["my name is John"] , ["This PC is black"]]
skipgram(corups)
试试这个!
from nltk import ngrams
def generate_ngrams(sentences,window_size =3):
for sentence in sentences:
yield from ngrams(sentence[0].split(), window_size)
sentences= [["my name is John"] , ["This PC is black"]]
for c in generate_ngrams(sentences,3):
print (c)
#output:
('my', 'name', 'is')
('name', 'is', 'John')
('This', 'PC', 'is')
('PC', 'is', 'black')
你并不真的想要一个 skipgram
本身,但你想要一个按大小划分的块,试试这个:
from lazyme import per_chunk
tokens = "my name is John".split()
list(per_chunk(tokens, 2))
[输出]:
[('my', 'name'), ('is', 'John')]
如果你想要滚动 window,即 ngrams
:
from lazyme import per_window
tokens = "my name is John".split()
list(per_window(tokens, 2))
[输出]:
[('my', 'name'), ('name', 'is'), ('is', 'John')]
类似地在 NLTK 中用于 ngrams:
from nltk import ngrams
tokens = "my name is John".split()
list(ngrams(tokens, 2))
[输出]:
[('my', 'name'), ('name', 'is'), ('is', 'John')]
如果你想要实际的 skipgrams,
from nltk import skipgrams
tokens = "my name is John".split()
list(skipgrams(tokens, n=2, k=1))
[输出]:
[('my', 'name'),
('my', 'is'),
('name', 'is'),
('name', 'John'),
('is', 'John')]