python

Question

这是我的第一个问题。我对 python 还是个新手，所以可能是我不知道如何正确提出问题而在 Whosebug 上错过了它！

我想要的：自动检查网站的变化。我希望它在每次有变化时给我发送通知，并告诉我变化是什么。

到目前为止，我有 2 段独立的代码可以工作：

API 称 returns 为 json 格式的结果列表。（列表中始终有 30 个结果）
一个 diff 工具，用于检查 json 文件是否相同，如果不同则找出差异。

如果我运行自己调用 API，它会很好地工作并将 json 结果保存到文件中。

如果我一次对每个文件进行差异比较，差异代码会工作并吐出更改。

我想让它们一起工作 - 最终结果是我可以设置一个 cron 作业 + 通知并开始我的生活，节省时间不检查这些网站，除非我知道有变化。

我的想法是不断检查最近一次拉取与最后一次拉取，因此我将结果存储在一个文件夹中。

为了让不同的部分工作，我将文件夹中的旧结果与新结果分开，意识到我不确定如何告诉代码区分新旧结果。

我想遍历文件夹，找到匹配的旧文件和新文件对，使每个文件成为一个 json 对象，然后比较两者。

我试过的部分方法有效，但我对如何将旧文件和新文件配对在一起感到困惑。

这是我正在处理的内容：

new_files = []
old_files = []
docs = for_docs[0]

for unid in uid_list:
    with open('%s_my_results' % uid, 'w+') as outfile:
        json.dump(docs, outfile)

        for newFiles in os.walk('FILEPATH/new_files'):
            newfiles.append(newFiles)
       
        unpack_newFiles = sorted(newfiles[2])

os.chdir('FILEPATH/old_files'):

for oldfiles in os.walk('FILEPATH/old_files'):
    old_files.append(oldfiles[2])

for fname in unpack_oldFiles:
    if fname.endswith('.json'):
       with open(fname, mode='rb+') as oldFile:
           try:
               unpack_oldFiles = json.load(oldFile)
           except json.decoder.JSONDecodeError:
                continue

这有效 - 但我认为解压缩的 json 对象仍然是 json 对象的未排序列表。所以我在这里肯定是一头雾水，想解脱。

我使用排序的原因是希望我可以强制它们按顺序匹配，因为它们总是以相同的顺序下载。我想我发现 sorted 不是正确的工具，但我确实对解决方案感到困惑。

这是用于区分我的 json 文件的代码：

    with open('FILEPATH/old_file.json') as f:
        old_docs = json.load(f)
    
    with open('FILEPATH/new_file.json') as fc:
        new_docs = json.load(fc)
    
    # compare the two objects 
    
    thing = (old_docs==new_docs)
    
    # log time and result 
    
    if thing is not True:
        with open('logfile.txt', 'a+') as sys.stdout:
            print(f'{date} this item was added:  ')
            print((DeepDiff(old_docs,new_docs)))
            sys.stdout.close()
    if thing is True:
        with open('logfile.txt', 'a+') as sys.stdout:
            print(f'{date} No Change') 
            sys.stdout.close()

我知道我想要的是：

#for file in list: 
    # if uid in file name matches:
        # decode each file to json 
        # diff the two files 
        # spit out the result

为此，我开始编写以下内容的变体，但我肯定遗漏了一些东西。我找到了 fnmatch，但我不确定如何使用它。

for fname in folder 1, folder2:
   if UID-in-filename matches: # I do not know how to set this up
       thing = (oldfile == newfile)
       if thing is not True:
       with open('logfile.txt', 'a+') as sys.stdout:
          print(f'{date} {UID} this item was added:')
          print((DeepDiff(oldfile, newfile)
          print(no change)
        if thing is True:
       with open('logfile.txt', 'a+') as sys.stdout:
          print(f'{date} {UID} no change')
          sys.stdout.close()

我希望我的第一个问题是公正的。感谢大家！

Answer 1

因此，如果我对你的理解正确的话，你的目录结构类似于：

data_files/
├── new_data
│   ├── data_file_1970_01_01_e24520c7-94ef-41c6-94b3-a16049b0d882.json
│   ├── data_file_1970_01_03_827a591b-8d10-4f8e-b55d-5a36bdaa96d7.json
│   └── data_file_1971_01_02_18bfab97-aeb9-476e-9332-94f4bb30157b.json
└── old_data
    ├── old_name_1970_01_01_e24520c7-94ef-41c6-94b3-a16049b0d882.json
    ├── old_name_1970_01_03_827a591b-8d10-4f8e-b55d-5a36bdaa96d7.json
    └── old_name_1971_01_02_18bfab97-aeb9-476e-9332-94f4bb30157b.json

你的文件夹中充满了不同的 json 文件，这些文件不共享一个确切的名称，但在名称的某处共享一个 uuid，你需要读入具有相同 uuid 和然后运行你的 diff 程序在他们身上。我会这样做：

import json
import os
import re

from pprint import pprint


uuid_regex = re.compile(r'[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}')


def parse_directory(uuid_dict, filelist, key):
    for file in filelist:

        uuid_matcher = uuid_regex.findall(file)

        # check that a uuid was found in the input filename
        if (uuid_matcher):

            uuid = uuid_matcher[0]

            # dict.get returns existing sub dictionary if found, or defaults to a new dictionary
            per_uuid_subdict = uuid_dict.get(uuid, dict())
            per_uuid_subdict[key] = file

            uuid_dict[uuid] = per_uuid_subdict


old_files = [os.path.abspath(os.path.join('data_files/old_data', i)) for i in os.listdir('data_files/old_data') if i.endswith('.json')]
new_files = [os.path.abspath(os.path.join('data_files/new_data', i)) for i in os.listdir('data_files/new_data') if i.endswith('.json')]

uuid_dict = dict()
parse_directory(uuid_dict, new_files, 'new')
parse_directory(uuid_dict, old_files, 'old')

这使用正则表达式将 uuid 添加到字典中，以便创建 uuid 到包含该 uuid 的文件的映射。一个看起来像的例子...

pprint(uuid_dict)
# prints:
# {'18bfab97-aeb9-476e-9332-94f4bb30157b': {'new': '/path/to/file/data_files/new_data/data_file_1971_01_02_18bfab97-aeb9-476e-9332-94f4bb30157b.json',
#                                           'old': '/path/to/file/data_files/old_data/old_name_1971_01_02_18bfab97-aeb9-476e-9332-94f4bb30157b.json'},
#  '827a591b-8d10-4f8e-b55d-5a36bdaa96d7': {'new': '/path/to/file/data_files/new_data/data_file_1970_01_03_827a591b-8d10-4f8e-b55d-5a36bdaa96d7.json',
#                                           'old': '/path/to/file/data_files/old_data/old_name_1970_01_03_827a591b-8d10-4f8e-b55d-5a36bdaa96d7.json'},
#  'e24520c7-94ef-41c6-94b3-a16049b0d882': {'new': '/path/to/file/data_files/new_data/data_file_1970_01_01_e24520c7-94ef-41c6-94b3-a16049b0d882.json',
#                                           'old': '/path/to/file/data_files/old_data/old_name_1970_01_01_e24520c7-94ef-41c6-94b3-a16049b0d882.json'}}

从那里开始，只需迭代结果即可。

for uuid, filelist in uuid_dict.items():

    if len(filelist) != 2:

        print('Too many files to diff for uuid: {}'.format(uuid))
        continue

    try:
        with open(filelist['new'], 'r') as file_handler:
            new_file = json.load(file_handler)
    except json.decoder.JSONDecodeError:
        continue

    try:
        with open(filelist['old'], 'r') as file_handler:
            old_file = json.load(file_handler)
    except json.decoder.JSONDecodeError:
        continue

    # DeepDiff handling logic around hereish
    DeepDiff(old_file, new_file)

python - 如何根据文件名中的内容匹配文件

python - how to match files by something in filename

diff

json