从字符串生成 n-gram
Generating n-grams from a string
我需要为从 1 到 M 的每个整数制作一个从字符串头部开始的所有 -gram 的列表。然后 return 一个包含 M 个这样的列表的元组。
def letter_n_gram_tuple(s, M):
s = list(s)
output = []
for i in range(0, M+1):
output.append(s[i:])
return(tuple(output))
来自 letter_n_gram_tuple("abcd", 3)
的输出应该是:
(['a', 'b', 'c', 'd'], ['ab', 'bc', 'cd'], ['abc', 'bcd']))
然而,我的输出是:
(['a', 'b', 'c', 'd'], ['b', 'c', 'd'], ['c', 'd'], ['d']).
我应该使用字符串切片然后将切片保存到列表中吗?
你可以使用嵌套,首先是关于 n-gram,其次是对字符串进行切片
def letter_n_gram_tuple(s, M):
output = []
for i in range(1, M + 1):
gram = []
for j in range(0, len(s)-i+1):
gram.append(s[j:j+i])
output.append(gram)
return tuple(output)
或仅按列表理解一行:
output = [[s[j:j+i] for j in range(0, len(s)-i+1)] for i in range(1, M + 1)]
或在more_itertools
中使用windowed
:
import more_itertools
output = [list(more_itertools.windowed(s, i)) for i in range(1, M + 1)]
测试和输出:
print(letter_n_gram_tuple("abcd", 3))
(['a', 'b', 'c', 'd'], ['ab', 'bc', 'cd'], ['abc', 'bcd'])
您还需要一个 for
循环来遍历字母或 str
:
def letter_n_gram_tuple(s, M):
output = []
for i in range(0, M):
vals = [s[j:j+i+1] for j in range(len(s)) if len(s[j:j+i+1]) == i+1]
output.append(vals)
return tuple(output)
print(letter_n_gram_tuple("abcd", 3))
输出:
(['a', 'b', 'c', 'd'], ['ab', 'bc', 'cd'], ['abc', 'bcd'])
使用以下函数:
def letter_n_gram_tuple(s, M):
s = list(s)
output = [s]
for i in range(M + 1):
output.append([''.join(sorted(set(a + b), key=lambda x: (a + b).index(x))) for a, b in zip(output[-1], output[-1][1:])])
return tuple(filter(lambda x: len(x) > 1, output))
现在:
print(letter_n_gram_tuple('abcd',3))
Returns:
(['a', 'b', 'c', 'd'], ['ab', 'bc', 'cd'], ['abc', 'bcd'])
def n_grams(word,max_size):
i=1
output=[]
while i<= max_size:
index = 0
innerArray=[]
while index < len(word)-i+1:
innerArray.append(word[index:index+i])
index+=1
i+=1
output.append(innerArray)
innerArray=[]
return tuple(output)
print(n_grams("abcd",3))
我需要为从 1 到 M 的每个整数制作一个从字符串头部开始的所有 -gram 的列表。然后 return 一个包含 M 个这样的列表的元组。
def letter_n_gram_tuple(s, M):
s = list(s)
output = []
for i in range(0, M+1):
output.append(s[i:])
return(tuple(output))
来自 letter_n_gram_tuple("abcd", 3)
的输出应该是:
(['a', 'b', 'c', 'd'], ['ab', 'bc', 'cd'], ['abc', 'bcd']))
然而,我的输出是:
(['a', 'b', 'c', 'd'], ['b', 'c', 'd'], ['c', 'd'], ['d']).
我应该使用字符串切片然后将切片保存到列表中吗?
你可以使用嵌套,首先是关于 n-gram,其次是对字符串进行切片
def letter_n_gram_tuple(s, M):
output = []
for i in range(1, M + 1):
gram = []
for j in range(0, len(s)-i+1):
gram.append(s[j:j+i])
output.append(gram)
return tuple(output)
或仅按列表理解一行:
output = [[s[j:j+i] for j in range(0, len(s)-i+1)] for i in range(1, M + 1)]
或在more_itertools
中使用windowed
:
import more_itertools
output = [list(more_itertools.windowed(s, i)) for i in range(1, M + 1)]
测试和输出:
print(letter_n_gram_tuple("abcd", 3))
(['a', 'b', 'c', 'd'], ['ab', 'bc', 'cd'], ['abc', 'bcd'])
您还需要一个 for
循环来遍历字母或 str
:
def letter_n_gram_tuple(s, M):
output = []
for i in range(0, M):
vals = [s[j:j+i+1] for j in range(len(s)) if len(s[j:j+i+1]) == i+1]
output.append(vals)
return tuple(output)
print(letter_n_gram_tuple("abcd", 3))
输出:
(['a', 'b', 'c', 'd'], ['ab', 'bc', 'cd'], ['abc', 'bcd'])
使用以下函数:
def letter_n_gram_tuple(s, M):
s = list(s)
output = [s]
for i in range(M + 1):
output.append([''.join(sorted(set(a + b), key=lambda x: (a + b).index(x))) for a, b in zip(output[-1], output[-1][1:])])
return tuple(filter(lambda x: len(x) > 1, output))
现在:
print(letter_n_gram_tuple('abcd',3))
Returns:
(['a', 'b', 'c', 'd'], ['ab', 'bc', 'cd'], ['abc', 'bcd'])
def n_grams(word,max_size):
i=1
output=[]
while i<= max_size:
index = 0
innerArray=[]
while index < len(word)-i+1:
innerArray.append(word[index:index+i])
index+=1
i+=1
output.append(innerArray)
innerArray=[]
return tuple(output)
print(n_grams("abcd",3))