将文本文档转换为 jsonl(json 行)格式

Converting a text document into a jsonl (json lines) format

我想使用 Python 将文本文件转换为 json 行格式。我需要它适用于任何长度的文本文件(字符或单词)。

例如,我想转换以下文本;

A lot of effort in classification tasks is placed on feature engineering and parameter optimization, and rightfully so. 

These steps are essential for building models with robust performance. However, all these efforts can be wasted if you choose to assess these models with the wrong evaluation metrics.

为此:

{"text": "A lot of effort in classification tasks is placed on feature engineering and parameter optimization, and rightfully so."}
{"text": "These steps are essential for building models with robust performance. However, all these efforts can be wasted if you choose to assess these models with the wrong evaluation metrics."}

我试过这个:

text = ""
with open(text.txt", encoding="utf8") as f:
    for line in f:
        text = {"text": line}

但不是运气。

执行此操作的一种 hacky 方法是将文本文件粘贴到 csv 中。确保在 csv 的第一个单元格中写入文本,然后使用此代码:

import pandas as pd 

df = pd.read_csv(knowledge)
    df.to_json(knowledge_jsonl,
               orient="records",
               lines=True)

不理想,但有效。

您的 for 循环的基本思想是正确的,但是 text = {"text": line} 行每次都只是覆盖前一行,而您想要的是生成一个行列表。

尝试以下操作:

import json

# Generate a list of dictionaries
lines = []
with open("text.txt", encoding="utf8") as f:
    for line in f.read().splitlines():
        if line:
            lines.append({"text": line})

# Convert to a list of JSON strings
json_lines = [json.dumps(l) for l in lines]

# Join lines and save to .jsonl file
json_data = '\n'.join(json_lines)
with open('my_file.jsonl', 'w') as f:
    f.write(json_data)

splitlines 删除 \n 个字符,if line: 忽略空行。