使用 Dask 读取嵌套 JSON 文件时遇到 "NoneType Error"

Question

我正在尝试使用 dask 包首先读取嵌套的大 json 文件，然后将其展平为 dask 数据框，然后将其保存为 csv。但是，我在展平过程中遇到了none类型错误“TypeError: 'NoneType' object is not subscriptable”。这是我的代码：

    b = db.read_text('2019-12-16-latest-level.json').map(json.loads)

    def flatten(record):
        return {
            'uuid': record['uuid'],
            'level_id': record['level_instance_json']['meta']['level_id'],
            'previous_attempts': record['level_instance_json']['meta']['previous_attempts'],
            'early_termination': record['level_instance_json']['meta']['early_termination'],
            'platform': record['level_instance_json']['meta']['platform'],
            'app_version': record['level_instance_json']['meta']['app_version']
    }

数据看起来像这样，

    {'uuid': 'bef72f2d-f0af-447b-a173-9f04979cc35f',
      'level_instance_json': {'meta': {'user_id': 0,
        'level_id': 13,
        'previous_attempts': 1,
        'early_termination': False,
        'platform': 'ANDROID',
        'app_version': '1.2.2',}}}

这是错误：

    <ipython-input-71-3c5cd8597f07> in flatten(record)
    ----> 6         'level_id': record['level_instance_json']['meta']['level_id'],
          7         'previous_attempts': record['level_instance_json']['meta']['previous_attempts'],
          8         'early_termination': record['level_instance_json']['meta']['early_termination'],

TypeError: 'NoneType' object is not subscriptable

我想在加载数据时省略其数据在 level_id 中具有“none”的用户以避免错误。有什么建议吗？

Answer 1

我不得不将你的 json 改成如下所示的格式，因为我认为 dask 严格遵循 json 规则，[请研究一些 json 语法标准。我对此没有深入的了解]

{"uuid": "bef72f2d-f0af-447b-a173-9f04979cc35f","level_instance_json": {"meta": {"user_id":0,"level_id":13,"previous_attempts": 1,"early_termination": false,"platform": "ANDROID","app_version": "1.2.2"}}}

然后我以这种格式调用了你的 flatten 函数：

 flatten(b.take(1)[0])

它运行良好。

基本上，我已经从您的 json 文件中删除了所有新行，并使用双引号代替单引号。

Answer 2

如果原始json不包含必要的字段，这将触发dask的计算错误。避免这种情况的一种选择是使用 .get 而不是直接引用密钥，例如：

# 'level_id': record['level_instance_json']['meta']['level_id']
'level_id': record.get('level_instance_json', {'meta': {'level_id': 'n/a/'})['meta']['level_id']

.get 的第二个参数是 python 将在密钥不可用时使用的参数，因此您想在此处放置不会触发进一步错误的内容（例如，通过提供预期的密钥具有 missing/not 个可用值）。

使用 Dask 读取嵌套 JSON 文件时遇到 "NoneType Error"

Encounterin "NoneType Error" when using Dask to Read Nested JSON file

python

bigdata

nonetype

dask

dask-dataframe