按 python 中的给定分隔符将列表拆分为子列表
Splitting list into sublists by given separator in python
我正在尝试构建不跨越句点符号的 n-gram。 Split() 仅适用于函数,而 list[index] 仅适用于索引。有没有办法通过给它一个 string/an 元素来 access/split/divide 一个列表?这是我当前函数的一个片段:
text = ["split","this","stuff",".","my","dear"]
def generate_ngram(rawlist, ngram_order):
"""
Input: List of words or characters, ngram-order ["this", "is", "an", "example"], 2
Output: Set of tuples or words or characters {("this", "is"),("is","an"),...}
"""
list_of_tuples = []
for i in range(0, len(rawlist) - ngram_order + 1):
ngram_order_index = i + ngram_order
generated_ngram = rawlist[i : ngram_order_index]
#if "." in generated_ngram:
#generated_ngram . . .
generated_tuple = tuple(generated_ngram)
list_of_tuples.append(generated_tuple)
return set(list_of_tuples)
generate_ngram(text,3)
目前 returns:
{('.', 'my', 'dear'),
('stuff', '.', 'my'),
('split', 'this', 'stuff'),
('this', 'stuff', '.')}
但理想情况下应该 return:
{('split', 'this', 'stuff'),
('this', 'stuff', '.')}
知道如何实现吗?感谢您的帮助!
我不确定这是否正是您所需要的,但此函数生成的 ngram 只能在末尾包含停用词(在本例中为句点):
STOPWORDS = {"."}
def generate_ngram(rawlist, ngram_order):
# All ngrams
ngrams = zip(*(rawlist[i:] for i in range(ngram_order)))
# Generate only those ngrams that do not contain stop words before the end
return (ngram for ngram in ngrams if not any(w in STOPWORDS for w in ngram[:-1]))
text = ["split", "this", "stuff", ".", "my", "dear"]
print(*generate_ngram(text, 3), sep="\n")
# ('split', 'this', 'stuff')
# ('this', 'stuff', '.')
print(*generate_ngram(text, 2), sep="\n")
# ('split', 'this')
# ('this', 'stuff')
# ('stuff', '.')
# ('my', 'dear')
注意这个函数 returns 一个生成器。如果需要,您可以将其转换为用 list(...)
包裹的列表,或者直接迭代它。
编辑:您可能会发现下面的等效语法更具可读性。
def generate_ngram(rawlist, ngram_order):
# Iterate over all ngrams
for ngram in zip(*(rawlist[i:] for i in range(ngram_order))):
# Yield only those not containing stop words before the end
if not any(w in STOPWORDS for w in ngram[:-1]):
yield ngram
我正在尝试构建不跨越句点符号的 n-gram。 Split() 仅适用于函数,而 list[index] 仅适用于索引。有没有办法通过给它一个 string/an 元素来 access/split/divide 一个列表?这是我当前函数的一个片段:
text = ["split","this","stuff",".","my","dear"]
def generate_ngram(rawlist, ngram_order):
"""
Input: List of words or characters, ngram-order ["this", "is", "an", "example"], 2
Output: Set of tuples or words or characters {("this", "is"),("is","an"),...}
"""
list_of_tuples = []
for i in range(0, len(rawlist) - ngram_order + 1):
ngram_order_index = i + ngram_order
generated_ngram = rawlist[i : ngram_order_index]
#if "." in generated_ngram:
#generated_ngram . . .
generated_tuple = tuple(generated_ngram)
list_of_tuples.append(generated_tuple)
return set(list_of_tuples)
generate_ngram(text,3)
目前 returns:
{('.', 'my', 'dear'),
('stuff', '.', 'my'),
('split', 'this', 'stuff'),
('this', 'stuff', '.')}
但理想情况下应该 return:
{('split', 'this', 'stuff'),
('this', 'stuff', '.')}
知道如何实现吗?感谢您的帮助!
我不确定这是否正是您所需要的,但此函数生成的 ngram 只能在末尾包含停用词(在本例中为句点):
STOPWORDS = {"."}
def generate_ngram(rawlist, ngram_order):
# All ngrams
ngrams = zip(*(rawlist[i:] for i in range(ngram_order)))
# Generate only those ngrams that do not contain stop words before the end
return (ngram for ngram in ngrams if not any(w in STOPWORDS for w in ngram[:-1]))
text = ["split", "this", "stuff", ".", "my", "dear"]
print(*generate_ngram(text, 3), sep="\n")
# ('split', 'this', 'stuff')
# ('this', 'stuff', '.')
print(*generate_ngram(text, 2), sep="\n")
# ('split', 'this')
# ('this', 'stuff')
# ('stuff', '.')
# ('my', 'dear')
注意这个函数 returns 一个生成器。如果需要,您可以将其转换为用 list(...)
包裹的列表,或者直接迭代它。
编辑:您可能会发现下面的等效语法更具可读性。
def generate_ngram(rawlist, ngram_order):
# Iterate over all ngrams
for ngram in zip(*(rawlist[i:] for i in range(ngram_order))):
# Yield only those not containing stop words before the end
if not any(w in STOPWORDS for w in ngram[:-1]):
yield ngram