检查索引文件中是否存在MD5值
Check if MD5 value exists in an index file
我想找出一种方法来验证我的代码是否可以交叉验证索引文件中 url 字符串的 md5 转换值是否存在,如果是,则跳过扫描。
下面是我的代码
形成的 url 转换 为 md5 字符串,然后在扫描完成后存储在 idx 文件中,目标是以后的扫描不应该拾取相同的 url。我看到的问题是 if str(md5url) in line
没有被执行,可能是因为在将散列添加到文件时没有使用 '\n' 作为后缀。但是我试了还是不行。
有什么想法吗?
def computeMD5hash(string_for_hash):
m = hashlib.md5()
m.update(string_for_hash.encode('utf-8'))
return m.hexdigest()
def writefilehash(formation_URL):
fn="urlindex.idx"
try:
afile = open(fn, 'a')
afile.write(computeMD5hash(formation_URL))
afile.close()
except IOError:
print("Error writing to the index file")
fn="urlindex.idx"
try:
afile = open(fn, 'r')
except IOError:
afile = open(fn, 'w')
for f in files:
formation=repouri + "/" + f
#print(computeMD5hash(formation))
md5url=computeMD5hash(formation)
hashlist = afile.readlines()
for line in hashlist:
if str(md5url) in line:
print ("Skipping " + formation + " because its already scanned and indexed as " + line)
else:
if downloadengine(formation):
print ("Download completed " + formation)
print ("Starting to write to database..")
#writetodatabase()
print ("Writing hash value ..")
writefilehash(formation)
print("Closing..")
afile.close()
您正在循环测试。对于不匹配的每一行,您下载:
line1
if hash in line:
print something
else
download
line2
if hash in line:
print something
else
download
line3
if hash in line:
print something
else
download
如果哈希在第 1 行,那么您仍然下载,因为哈希不在第 2 行或第 3 行。您不应该决定下载直到测试所有行.
执行此操作的最佳方法是一次性将所有哈希值读取到一个集合对象中(因为针对集合的包含测试速度更快)。删除行分隔符:
try:
with open(fn) as hashfile:
hashes = {line.strip() for line in hashfile}
except IOError:
# no file yet, just use an empty set
hashes = set()
然后在测试新哈希值时使用:
urlhash = computeMD5hash(formation)
if urlhash not in hashes:
# not seen before, download
# record the hash
hashes.add(urlhash)
with open(fn, 'a') as hashfile:
hashfile.write(urlhash + '\n')
我想找出一种方法来验证我的代码是否可以交叉验证索引文件中 url 字符串的 md5 转换值是否存在,如果是,则跳过扫描。
下面是我的代码
形成的 url 转换 为 md5 字符串,然后在扫描完成后存储在 idx 文件中,目标是以后的扫描不应该拾取相同的 url。我看到的问题是 if str(md5url) in line
没有被执行,可能是因为在将散列添加到文件时没有使用 '\n' 作为后缀。但是我试了还是不行。
有什么想法吗?
def computeMD5hash(string_for_hash):
m = hashlib.md5()
m.update(string_for_hash.encode('utf-8'))
return m.hexdigest()
def writefilehash(formation_URL):
fn="urlindex.idx"
try:
afile = open(fn, 'a')
afile.write(computeMD5hash(formation_URL))
afile.close()
except IOError:
print("Error writing to the index file")
fn="urlindex.idx"
try:
afile = open(fn, 'r')
except IOError:
afile = open(fn, 'w')
for f in files:
formation=repouri + "/" + f
#print(computeMD5hash(formation))
md5url=computeMD5hash(formation)
hashlist = afile.readlines()
for line in hashlist:
if str(md5url) in line:
print ("Skipping " + formation + " because its already scanned and indexed as " + line)
else:
if downloadengine(formation):
print ("Download completed " + formation)
print ("Starting to write to database..")
#writetodatabase()
print ("Writing hash value ..")
writefilehash(formation)
print("Closing..")
afile.close()
您正在循环测试。对于不匹配的每一行,您下载:
line1
if hash in line:
print something
else
download
line2
if hash in line:
print something
else
download
line3
if hash in line:
print something
else
download
如果哈希在第 1 行,那么您仍然下载,因为哈希不在第 2 行或第 3 行。您不应该决定下载直到测试所有行.
执行此操作的最佳方法是一次性将所有哈希值读取到一个集合对象中(因为针对集合的包含测试速度更快)。删除行分隔符:
try:
with open(fn) as hashfile:
hashes = {line.strip() for line in hashfile}
except IOError:
# no file yet, just use an empty set
hashes = set()
然后在测试新哈希值时使用:
urlhash = computeMD5hash(formation)
if urlhash not in hashes:
# not seen before, download
# record the hash
hashes.add(urlhash)
with open(fn, 'a') as hashfile:
hashfile.write(urlhash + '\n')