如何删除列表中的起始词?

How to remove inception words in list?

给定一个包含 "inception" 个单词的列表,如何删除初始单词?如何找到更大的起始词?

让我们将起始词定义为出现在同一列表中的较大词的一部分的词。

任务

To make it very clear, if a list contains ['a', 'b', 'a b c'], removes 'a' and 'b' because there is an element that contains 'a' and 'b' that is bigger itself.

示例 1, [in]:

[u'dose rate', u'object', u'dose', u'rate', u'computation']

[输出]:

[u'dose rate', u'object',u'computation']

示例 2, [in]:

[u'shift', u'magnetic', u'system', u'magnetic sensor', u'phase shift', u'phase', u'output', u'sensor', u'sensing', u'sensor system']

由于 'magnetic'、'sensor'、'system'、'magnetic sensor' 和 'sensor system' 存在,我们可以:

期望的输出,[out]:

[u'system', u'magnetic sensor', u'phase shift', u'output', u'sensing']

或[出]:

[u'magnetic'u'phase shift', u'output', u'sensing', u'sensor system']

我已尝试以下操作,但未获得所需的输出:

ls = [u'dose rate', u'object', u'dose', u'rate', u'computation']
>>> set([i for i in ls for j in ls if i!=j or i not in j])
set([u'dose rate', u'object', u'rate', u'computation', u'dose'])
>>> set([j for i in ls for j in ls if i!=j or i not in j])
set([u'rate', u'object', u'dose rate', u'computation', u'dose'])
>>> set([j for j in ls for i in ls if i!=j or i not in j])
set([u'dose rate', u'object', u'rate', u'computation', u'dose'])

给定一个单词列表

>>> words = [u'dose rate', u'object', u'dose', u'rate', u'computation']

以及初始词的定义:

>>> inception = lambda x: any(x in w for w in words if len(x) < len(w))

我们可以像这样构造一个 'non inception words' 的列表:

>>> [w for w in words if not inception(w)]
[u'dose rate', u'object', u'computation']

所以为了满足第一个例子,你可以做类似的事情,

>>> words = [u'dose rate', u'object', u'dose', u'rate', u'computation']
>>> [w1 for w1 in words if not any(w1 in w2 for w2 in words if w2 != w1)]
[u'dose rate', u'object', u'computation']

但是你的第二个例子表明你的要求有点复杂。您不能多次使用同一个小词来组成一个字符串。

不幸的是,单线是不可能的。尝试类似的东西,

def remove_comprising(words):
    seen = set()
    result_words = []
    for word in words:
        for small_word in words:
            if small_word in word and small_word != word:
                if small_word in seen:
                    word = word.replace(small_word, '')
                else:
                    seen.add(small_word)
        result_words.append(word)
    return [word.strip() for word in result_words if word not in seen]

然后我们得到了示例 1 的正确结果,

>>> words = [u'dose rate', u'object', u'dose', u'rate', u'computation']
>>> remove_comprising(words)
[u'dose rate', u'object', u'computation']

和示例 2,

>>> words = [u'shift', u'magnetic', u'system', u'magnetic sensor', u'phase shift', u'phase', u'output', u'sensor', u'sensing', u'sensor system']
>>> remove_comprising(words)
[u'magnetic sensor', u'phase shift', u'output', u'sensing', u'system']

阅读起来有点复杂:在实现上不是 pythonic,但应该可以解决问题。

基本思路是:评估并标记列表中的每个单词是否应包含在内。 然后使用那个标志,实际打印出这些词。

问题是你想找到可以成为其他 2 个更大单词的一部分的单词,这使得标记更加细化(不仅仅是保留或拒绝,而是保留,继续保留和拒绝)

import copy
def inception(wordlist):

    # dont want to mutilate original list
    new_wordlist = copy.deepcopy(wordlist)

    # find length of wordlist to know when original length is traversed
    word_count = len(new_wordlist)
    output_set = set()
    output_list = [] # flags existence, -1 = evaluation postponed, 0 = exclude, 1= include
    eval_list = []

    # iterate through list
    for idx, word in enumerate(new_wordlist):
        inner_words = word.split()

        # if its only 1 word, evaluate at the end 
        # Can be made smarter to reject earlier
        if len(inner_words) == 1 and idx < word_count:
            output_list.append(-1)
            eval_list.append(word)
            new_wordlist.append(word)
            continue        

        # Flag existence of inner words if they haven't been found
        existence = 0
        for in_wrd in inner_words:
            if in_wrd in output_set:
                output_list.append(0)       
            else:
                # keep continued 
                existence += 1
                output_set.add(in_wrd)
                output_list.append(existence)
            eval_list.append(in_wrd)

    # now evaluate by position of flags
    final_set = set()
    for idx, word in enumerate(eval_list):
        if output_list[idx] > 0:

            # combine if words are in order
            if output_list[idx] > 1:
                final_set.remove(eval_list[idx-1])
                word = ' '.join([eval_list[idx-1], eval_list[idx]])
            final_set.add(word) 
    return list(final_set)

我只用您提供的 2 套进行了测试。如果您有失败的设置,请将它们添加到评论中,我当然会更正。