选择独特的线路问题

Question

我的脑袋现在就像一颗炸弹，我不明白这是怎么回事？

config = open('s1','r').read().splitlines()
new = open('s2','r').read().splitlines()

for clean1 in config:
    x = clean1.split(" ")
for clean2 in new
    x2 = clean2.split(" ")
    if x[0] in x2[0]:
        print x[0] + " already exists."
        break
    if x[0] not in x2[0]:
        print x[0] + " is new."
        break

让我解释一下：

在文件 s1 中我们得到：

192.168.1.1 test test
192.168.1.2 test test

在文件 s2 中我们得到：

192.168.1.1 test test
192.168.1.2 test test
192.168.1.3 test test

关于这个条件：

    if x[0] in x2[0]:
        print x[0] + " already exists."
        break
    if x[0] not in x2[0]:
        print x[0] + " is new."
        break

结果必须是：

 192.168.1.1 already exists.
 192.168.1.2 already exists.
 192.168.1.3 is new.

但结果是：

 192.168.1.1 already exists.
 192.168.1.2 is new.

如果你能帮助我，我想要一个解决这个问题的方法。

重要提示：

不要给我一个 set() 或任何类型的库来找到唯一记录的解决方案。我想要一个经典的解决方案。

Answer 1

如果您想比较文件 1 和文件 2 中的唯一键，您可以使用 python 字典。

m = {}
for line in s1:
    key = line.strip().split(' ')[0]
    if key not in m:
        m[key] = ''

for line in s2:
    key = line.strip().split(' ')[0]
    if key in m:
        # Found key 
        print key + "  Already exists"
    else:
        print key + "  is new"

另一种简单的方法是使用 set()。这也是利用内置于 python

中的集合逻辑的 pythonic 方法

s1_set = set([line.strip().split(' ')[0] for line in s1])
s2_set = set([line.strip().split(' ')[0] for line in s2])

for key in s1_set.intersection(s2_set): print key + "  Already exists"

#For missing keys
if len(s1_set) > len(s2_set):
    for key in s1_set - s2_set : print key + "  is new"
else:
    for key in s2_set - s1_set : print key + "  is new"

Answer 2

>>> s1 = open('s1', 'r').readlines()
>>> s2 = open('s2', 'r').readlines()

>>> s1Codes = [x.split()[0] for x in s1]
>>> s2Codes = [x.split()[0] for x in s2]

>>> newCodes = [code for code in s2Codes if code not in s1Codes]
>>> print (newCodes)

192.168.1.3

或者如果您想坚持使用与您的解决方案类似的东西：

>>> s1 = open('s1', 'r').readlines()
>>> s2 = open('s2', 'r').readlines()

>>> s1Codes = [x.split()[0] for x in s1]
>>> s2Codes = [x.split()[0] for x in s2]

>>> for code in s2Codes:
...     if code in s1Codes:
...         print(code + " already exists")
...     else:
...         print(code + " is a new code")

192.168.1.1 already exists
192.168.1.2 already exists
192.168.1.3 is a new code

但是，正如其他人所说，在这里使用 set() 是最理想的。

Answer 3

查字典是最好的办法。 set() 看起来是一个显而易见的解决方案，但它比 dict() 慢，因为 dict() 使用散列存储其条目。因此，根据您的需要，如果您不打算将算法用于大量数据（就像从示例文件中看到的那样），请使用如上所示的列表理解，否则，请使用字典。我不会使用 operator in，而是使用 dict.has_key()，但这只是我的风格。速度应该没有差异。

集合实际上不应该与字符串一起使用，但人们总是这样做。 :D

现在补充几点：

Correct! set() also uses hash table.
set() is implemented as a dictionary without values, using only keys.
Nearly exactly what we would do if we use dict() for duplicate detection.
As set() doesn't even support indexing (element order changing according to hashtable),
its natural use would be for stuff such as our question.
Yes, set() should be faster, but it is not.
I can proove it. Try this:
# python -m timeit -s "s = set(range(10**7))" "5*10**6 in s"
2.7: 1000000 loops, best of 3: 0.161 usec per loop
2.5: 1000000 loops, best of 3: 0.163 usec per loop

# python -m timeit -s "d = dict.fromkeys(range(10**7))" "5*10**6 in d"
2.7: 10000000 loops, best of 3: 0.144 usec per loop
2.5: 10000000 loops, best of 3: 0.133 usec per loop 

We measure here how much time is needed per loop for "in" operator in nearly worst case.
The numbers before results stands for Python 2.7 on Cygwin and Python 2.5 native. That's my config.
I saw more drastic results, on other computers and/or systems where "in dict()" takes 0.0xxx usec, while "in set()" is stil over 0.15xx usec.
I don't know why this difference in speed.
When set() was first added to Python, it was almost a copy-paste of dict() code. It even used dummy values internally.
Not to mention Set() from module sets (Python 2.3 till 2.6 (deprecated)), which actually USES dictionaries.
Now, set() takes somewhat less memory than dict() with dummy values (as we would use), but, obviously, its search is slower.

But regarding the original question this discussion is really unnecessary.
As far as I can tell Brian is comparing two /etc/hosts like files and lists are more than enough for that.
I experienced the speed dilemma and just mentioned my discovery on Stack Overflow for future notice.

This is a trick I found here for solving the problem of duplicates and can easily be modified to solve Brian's problem:
...
l1 = f.readlines()
...
l2 = f.readlines()
...
found = {} # Duplicate entry checker dict
# Take method pointer out to speed up getting to function:
setdef = found.setdefault
# You can construct new list containing old and new entries with no duplicates,
# while keeping order as much as possible, as this:
no_duplicates = [setdef(x, x) for x in l1+l2 if (x not in found)]
del setdef, found
# Get only new-ones (order of l2 is kept):
old = dict.fromkeys(l1)
setdef = old.setdefault
del l1 # If you do not need it any longer and it's really big :D
newcomers = [setdef(x, x) for x in l2 if (x not in old)]
del setdef, old
# Old-ones can be found by reversing places of l1 and l2. (obviously)
# To understand trick with setdefault(), see help(dict.setdefault)

如果确实需要边走边打印，很容易从列表理解切换到真正的循环。我在有数千页的书上使用这个算法来过滤掉重复的行。（页眉和页脚）。速度令人难以置信。

Why not strings in set()? Well, the name associates to mathematical set, and hashing numbers is easier and faster than hashing strings.
Well dict.has_key() --> :D

我是 Python 2.5 和 2.7 怪胎，我一点也不喜欢 Python 3。所以，请原谅我喜欢它。但是，如您所见，我也在使用运算符 "in"。 :D

P.S. Don't ask me why formatting is as it is. Just correct it if you know how.

选择独特的线路问题

Selecting unique line problems

python

records

lines