如何在不使用集合的情况下从 python 中的列表中删除重复的单词？

Question

我有以下 python 代码，几乎对我有用（我太接近了！）。我正在打开一部莎士比亚戏剧中的文本文件：原始文本文件：

“但是柔和的光线穿过那边window打破

是东方，朱丽叶是太阳

升起明媚的太阳，杀死嫉妒的月亮

谁已经病入膏肓，悲痛欲绝

我写的代码的结果是这样的：

['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and', 'breaks'、'east'、'envious'、'fair'、'grief'、'is'、'is'、'is'、'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the', 'through'、'what'、'window'、'with'、'yonder']

这几乎就是我想要的：它已经在按我想要的方式排序的列表中，但是如何删除重复的单词？我正在尝试创建一个新的 ResultsList 并将单词附加到它，但它给了我上面的结果而没有去掉重复的单词。如果我 "print ResultsList" 它只会吐出一大堆单词。我现在拥有它的方式很接近，但我想摆脱额外的 "and's"、"is's"、"sun's" 和 "the's"...我想保留它简单并使用 append()，但我不确定如何让它工作。我不想对代码做任何疯狂的事情。为了删除重复的单词，我的代码中缺少什么简单的东西？

fname = raw_input("Enter file name: ")  
fhand = open(fname)
NewList = list()      #create new list
ResultList = list()    #create new results list I want to append words to

for line in fhand:
    line.rstrip()       #strip white space
    words = line.split()    #split lines of words and make list
        NewList.extend(words)   #make the list from 4 lists to 1 list

    for word in line.split():   #for each word in line.split()
        if words not in line.split():    #if a word isn't in line.split
            NewList.sort()             #sort it
            ResultList.append(words)   #append it, but this doesn't work.


print NewList
#print ResultList (doesn't work the way I want it to)

Answer 1

mylist = ['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']
newlist = sorted(set(mylist), key=lambda x:mylist.index(x))
print(newlist)
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']

newlist 包含来自 mylist 的一组唯一值的列表，按 mylist.

中每个项目的索引排序

Answer 2

使用字典替代 set 是一个不错的选择。 collections module contains a class called Counter 是专门用于计算每个键被看到次数的字典。使用它你可以做这样的事情：

from collections import Counter

wordlist = ['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and',
            'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is',
            'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun',
            'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']

newlist = sorted(Counter(wordlist), 
                 key=lambda w: w.lower())  # case insensitive sort
print(newlist)

输出：

['already', 'and', 'Arise', 'breaks', 'But', 'east', 'envious', 'fair',
 'grief', 'is', 'It', 'Juliet', 'kill', 'light', 'moon', 'pale', 'sick',
 'soft', 'sun', 'the', 'through', 'what', 'Who', 'window', 'with', 'yonder']

Answer 3

使用普通的旧列表。几乎可以肯定不如 Counter.

有效

fname = raw_input("Enter file name: ")  

Words = []
with open(fname) as fhand:
    for line in fhand:
        line = line.strip()
        # lines probably not needed
        #if line.startswith('"'):
        #    line = line[1:]
        #if line.endswith('"'):
        #    line = line[:-1]
        Words.extend(line.split())

UniqueWords = []
for word in Words:
    if word.lower() not in UniqueWords:
        UniqueWords.append(word.lower())

print Words
UniqueWords.sort()
print UniqueWords

这始终会检查单词的小写版本，以确保相同单词但大小写不同的配置不会被计为 2 个不同的单词。

我添加了检查以删除文件开头和结尾的双引号，但如果它们不存在于实际文件中。这些行可以忽略不计。

Answer 4

您的代码有问题。我想你的意思是：

for word in line.split():   #for each word in line.split()
    if words not in ResultList:    #if a word isn't in ResultList

Answer 5

您的代码确实存在一些逻辑错误。我修好了，希望对你有帮助。

fname = "stuff.txt"
fhand = open(fname)
AllWords = list()      #create new list
ResultList = list()    #create new results list I want to append words to

for line in fhand:
    line.rstrip()   #strip white space
    words = line.split()    #split lines of words and make list
    AllWords.extend(words)   #make the list from 4 lists to 1 list

AllWords.sort()  #sort list

for word in AllWords:   #for each word in line.split()
    if word not in ResultList:    #if a word isn't in line.split            
        ResultList.append(word)   #append it.


print(ResultList)

在 Python 3.4 上测试，未导入。

Answer 6

这应该可行，它遍历列表并将元素添加到新列表（如果它们与添加到新列表的最后一个元素不同）。

def unique(lst):
    """ Assumes lst is already sorted """
    unique_list = []
    for el in lst:
        if el != unique_list[-1]:
            unique_list.append(el)
    return unique_list

您也可以使用 collections.groupby 其工作方式类似

from collections import groupby

# lst must already be sorted 
unique_list = [key for key, _ in groupby(lst)]

Answer 7

以下功能可能会有所帮助。

   def remove_duplicate_from_list(temp_list):
        if temp_list:
            my_list_temp = []
            for word in temp_list:
                if word not in my_list_temp:
                    my_list_temp.append(word)
            return my_list_temp
        else: return []

Answer 8

这应该可以完成工作：

fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
    line = line.rstrip()
    words = line.split()
    for word in words:
        if word not in lst:
            lst.append(word)
lst.sort()
print(lst)

如何在不使用集合的情况下从 python 中的列表中删除重复的单词？

How do I remove duplicate words from a list in python without using sets?

python

list

duplicates