删除 python 中列表的单词互换元素

removing word interchanged elements of list in python

我有一个列表,其中包含已互换的重复值。例如

dataList=["john is student", "student is john", "john student is", "john is student", "alica is student", "good weather", "weather good"]

我想删除所有这些重复值,如下所示:

预期输出:

dataList=["john is student","john is student", "john is student","john is student","alica is student", "good weather", "good weather"]

我尝试使用的代码是:

for i in dataList:
    first=(i.split()[0]) +  i.split()[1] + i.split()[2]) in studentList
    ........

我卡在了一个逻辑中。我可以知道如何获得所需的结果吗

如果您认为在最终列表中第一次出现是正确的,那么您可以尝试以下操作:

dataList= ["john is student", 
           "student is john", 
           "john student is", 
           "alica is student", 
           "good weather", 
           "weather good",
          ]

data = {}
for words in dataList:
    data.setdefault(frozenset(words.split()), words)

dataList = data.values() 
 # dataList is you need

编辑

自从我上次回答问题后,问题已更新为保留重复值的要求。

[回答]

dataList= ["john is student", 
           "student is john", 
           "john student is",
           "alica is student",
           "good weather", 
           "weather good",
          ]

class WordFrequence:
    def __init__(self, word, frequence=1):
        self.word = word
        self.frequence = frequence

    def as_list(self):
        return [self.word] * self.frequence

    def __repr__(self):
        return "{}({}, {})".format(self.__class__.__name__, self.word, self.frequence)    

counter = {} 
for words in dataList:
    key = frozenset(words.split())
    if key in counter:
        counter[key].frequence += 1
    else:
        counter[key] = WordFrequence(words)

dataList = [] # this is what you need
for wf in counter.values():
    dataList.extend(wf.as_list())

对于长输入 dataList,您可以通过将 WordFrequence 替换为 recordclass

来改进我的代码

考虑到第一次出现是正确的。

dataList= ["john is student", 
           "student is john", 
           "john student is", 
           "alica is student", 
           "good weather", 
           "weather good",
          ]

filterdData = {}
for statement in dataList:
    filterdData.setdefault(''.join(sorted(statement)), statement)

dataList = filterdData.values() 
print(dataList)

您还可以使用迭代包装语法检查库,以仅接受英语的正确形式。

@Grijesh 已经给出了一个非常干净的解决方案,只是重新迭代他的代码 -

dataList=["john is student", "student is john", "john student is", 
          "alica is student", "good weather", "weather good"]

final_data = {} 
for i in dataList:
    data[" ".join(sorted(set(i.split())))] = i

输出

>>>list(final_data.values())
   ['john student is', 'alica is student', 'weather good']

在上面,我们滑动句子来获取单词,然后我们创建了一个独特的单词集并对其进行排序以捕获独特的实例,即使在句子中也是如此。

现在我们用它做了一个字典,我们知道字典只能保存唯一的键,所以它只会保留唯一的集合(我们通过连接最终得到了一个字符串)

您可以创建字典 seen 存储每个元素的 frozenset 个单词,其中单词第一次出现。您可以先签入 seen dict 并使用 {}.setdefault( ) 设置或获取旧值。

dataList= ["john is student", 
           "student is john", 
           "john student is",
           "alica is student",
           "good weather", 
           "weather good",
          ]

seen = {}
data = []
for words in dataList:
    key = frozenset(words.split())
    words = seen.setdefault(key, words)
    data.append(words)

输出:

>>> data
['john is student',
 'john is student',
 'john is student',
 'alica is student',
 'good weather',
 'good weather']