查找管道到 awk 重定向到新文件

Question

我正在尝试查找一组文件

> find . -type f -iregex .*geojson$
> ./dir1/london.geojson
  ./manchester.geojson

然后对于找到的每个文件（许多嵌套文件夹中有 30 到 40 个），我想在原始文件周围添加我自己的 json 结构，添加文件名和提取的 ID。就像这样:

> cat manchester.geojson
  {"properties": { "id": 11.0, "borough": "Didsbury" }, "geometry": {"removed": 0} }
  {"properties": { "id": 22.0, "borough": "Chorlton" }, "geometry": {"removed": 0} }

我想要以下结果：

{"_id": 11.0, filename": "manchester.geojson", "document": {"properties": { "id": 11.0, "borough": "Didsbury" }, "geometry": {"removed": 0} }}
{"_id": 22.0, filename": "manchester.geojson", "document": {"properties": { "id": 22.0, "borough": "Chorlton" }, "geometry": {"removed": 0} }}

我得到的最接近的是像这样连接到 xargs 和 awk：

> find . -type f -iregex .*geojson$ | xargs -d '\n' awk -F'[{:,]' '{print "{ \"_id\":"", \"file\": \""FILENAME"\", \"doc\": " [=13=] " }"}'

  }"_id": 11.0, "file": "./manchester.geojson", "doc": { "type": "Feature", "properties": { "id": 11.0, "borough": "Didsbury" }, "geometry": {"removed": 0} }}
  }"_id": 22.0, "file": "./manchester.geojson", "doc": { "type": "Feature", "properties": { "id": 22.0, "borough": "Chorlton" }, "geometry": {"removed": 0} }}

不知道左花括号到底有什么问题？

我可以得到我想要的所有变量，看这个例子：

> find . -type f -iregex .*geojson$ | xargs -d '\n' awk -F'[{:,]' '{print   " " FILENAME " " [=14=]}'

  11.0 ./manchester.geojson { "type": "Feature", "properties": { "id": 11.0, "borough": "Didsbury" }, "geometry": {"removed": 0} }}
  22.0 ./manchester.geojson { "type": "Feature", "properties": { "id": 22.0, "borough": "Chorlton" }, "geometry": {"removed": 0} }}

最后是将每个文件输出发送到同名但具有新扩展名的新文件的问题。我可以通过简单的 > 重定向将许多文件的全部输出发送到一个大文件中，但这不是我需要的。如有任何想法，我们将不胜感激。

Answer 1

使用 JSON 解析器处理 JSON 数据。 jq 不错。

jqbody='{_id: .properties.id, filename: input_filename, document: .}'
find . -type f -name \*geojson -print0 | while read -rd "" filename; do
    jq  -c "$jqbody" "$filename" ## > ./tmpfile && mv ./tmpfile "$filename"
done

如果一切正常，请删除 ## 评论。

我看不到 jq 的等效 "edit inplace" 选项，所以我需要使用 shell while 循环来获取文件名，而不是 xargs。

输出：

{"_id":11,"filename":"./manchester.geojson","document":{"properties":{"id":11,"borough":"Didsbury"},"geometry":{"removed":0}}}
{"_id":12,"filename":"./manchester.geojson","document":{"properties":{"id":12,"borough":"Chorlton"},"geometry":{"removed":0}}}

我看到身份证号码 "integerized"。为避免这种情况，您的原始 JSON 应引用 id 值，以便将其作为字符串逐字处理。

Answer 2

感谢@EdMorton 和@glenjackman 帮助我指明了正确的方向。最后，我几乎已经有了问题。一旦清除了不可靠的行结尾，下面的一行就可以完成工作：

> find . -type f -name \*geojson | xargs -d '\n' awk -i inplace -F'[:,]' '{print "{ \"_id\":"  ", \"file\": \"" FILENAME "\", \"doc\": "[=10=]"}"}'

缺少的部分是 -i inplace 修改文件的地方，这是我最初没有考虑过的选项。

查找管道到 awk 重定向到新文件

Find piped to awk redirected to new files

bash

awk

xargs