正则表达式可以减少巨大的文件大小?
Reg Expression to reduce huge file size?
我导出了一系列 gigantic (40-80mb) Google 位置历史记录 JSON
文件,我的任务是分析这些文件select activity 数据。不幸的是 Google 在他们的 download site 没有参数或选项来选择除了 "one giant JSON containing forever" 之外的任何东西。 (KML
选项是两倍大。)
显而易见的选择,例如 JSON-Converter
(laexcel-test incarnation of VBA-JSON
); parsing line-by line with VBA; even Notepad++。它们都 崩溃并燃烧 。我在想 RegEx 可能就是答案。
This Python script 可以在两秒内从一个 40mb 的文件中提取时间戳和位置(使用 RegEx?)。 Python 怎么这么快?(它会像 VBA 一样快吗?)
如果我有一个神奇的 RegEx
块,我就能一点一点地提取我需要的所有东西,也许是这样的逻辑:
删除所有内容 除了:
当timestampMs
和WALKING
出现在*同一组[
方括号]
之间时:
- 我需要
timestampMS
、和、 后面的13位数字
WALKING
后的一到三位数。
如果包含更多数据更简单,例如“all 时间戳”或“all 活动”,我可以稍后轻松筛选。 目标是使文件足够小,这样我就可以在不需要rent a supercomputer的情况下操作它,哈哈。
我试过改编现有的 RegEx,但我在 RegEx 和乐器方面都遇到了一个严重的问题:我怎么努力都没用,我只是 无法理解它。所以,这确实是一道"please write code for me"题,但只是一种表达方式,今天我会通过写代码来回报别人!谢谢... ☺
.
}, {
"timestampMs" : "1515564666086", ◁― (Don't need this but it won't hurt)
"latitudeE7" : -6857630899,
"longitudeE7" : -1779694452999,
"activity" : [ {
"timestampMs" : "1515564665992", ◁― EXAMPLE: I want only this, and...
"activity" : [ {
"type" : "STILL",
"confidence" : 65
}, { ↓
"type" : "TILTING",
"confidence" : 4
}, {
"type" : "IN_RAIL_VEHICLE",
"confidence" : 20 ↓
}, {
"type" : "IN_ROAD_VEHICLE",
"confidence" : 5
}, {
"type" : "ON_FOOT", ↓
"confidence" : 3
}, {
"type" : "UNKNOWN",
"confidence" : 3
}, {
"type" : "WALKING", ◁―┬━━ ...AND, I also want this.
"confidence" : 3 ◁―┘
} ]
} ]
}, {
"timestampMs" : "1515564662594", ◁― (Don't need this but it won't hurt)
"latitudeE7" : -6857630899,
"longitudeE7" : -1779694452999,
"altitude" : 42
}, {
编辑:
出于测试目的,我制作了一个示例文件,代表原始文件(大小除外)。原始 JSON 可以直接从 this Pastebin link, or downloaded as a local copy with this TinyUpload link、 或 复制为下面的 "one long line" 加载:
{"locations" : [ {"timestampMs" : "1515565441334","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 2299}, {"timestampMs" : "1515565288606","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 12,"velocity" : 0,"heading" : 350,"altitude" : 42,"activity" : [ {"timestampMs" : "1515565288515","activity" : [ {"type" : "STILL","confidence" : 98}, {"type" : "ON_FOOT","confidence" : 1}, {"type" : "UNKNOWN","confidence" : 1}, {"type" : "WALKING","confidence" : 1} ]} ]}, {"timestampMs" : "1515565285131","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 12,"velocity" : 0,"heading" : 350,"altitude" : 42}, {"timestampMs" : "1513511490011","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 25,"altitude" : -9,"verticalAccuracy" : 2}, {"timestampMs" : "1513511369962","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 25,"altitude" : -9,"verticalAccuracy" : 2}, {"timestampMs" : "1513511179720","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513511059677","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513510928842","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513510942911","activity" : [ {"type" : "STILL","confidence" : 100} ]} ]}, {"timestampMs" : "1513510913776","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 15,"altitude" : -11,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513507320258","activity" : [ {"type" : "TILTING","confidence" : 100} ]} ]}, {"timestampMs" : "1513510898735","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513510874140","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 19,"altitude" : -12,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513510874245","activity" : [ {"type" : "STILL","confidence" : 100} ]} ]} ]}
文件通过 JSONLint and FreeFormatter 测试为有效。
你可以试试这个
(?s)^.*?\"longitude[^\[]*?\"activity[^\[]*\[[^\]]*?timestampMs\"[^\"\]]*\"(\d+)\"[^\]]*WALKING[^\]]*?confidence\"\s*:\s*(\b\d{1,3}\b)[^\]]*?\].*$
Regex Demo,其中我通过“longitude
”、“activity
”等关键字搜索并接近目标捕获值(timestamp value, walking value
) , "[
", "timestampMs
", "]
", "walking
", "confidence
".
Python 脚本
ss=""" copy & paste the file contents' strings (above sample text) in this area """
regx= re.compile(r"(?s)^.*?\"longitude[^\[]*?\"activity[^\[]*\[[^\]]*?timestampMs\"[^\"\]]*\"(\d+)\"[^\]]*WALKING[^\]]*?confidence\"\s*:\s*(\b\d{1,3}\b)[^\]]*?\].*$")
matching= regx.match(ss) # method 1 : using match() function's capturing group
timestamp= matching.group(1)
walkingval= matching.group(2)
print("\ntimestamp is %s\nwalking value is %s" %(timestamp,walkingval))
print("\n"+regx.sub(r' ',ss)) # another method by using sub() function
输出是
timestamp is 1515564665992
walking value is 3
1515564665992 3
很遗憾,您似乎选择了一种没有高性能 JSON 解析器的语言。
有了 Python 你可以:
#!/usr/bin/env python3
import time
import json
def get_history(filename):
with open(filename) as history_file:
return json.load(history_file)
def walking_confidence(history):
for location in history["locations"]:
if "activity" not in location:
continue
for outer_activity in location["activity"]:
confidence = extract_walking_confidence(outer_activity)
if confidence:
timestampMs = int(outer_activity["timestampMs"])
yield (timestampMs, confidence)
def extract_walking_confidence(outer_activity):
for inner_activity in outer_activity["activity"]:
if inner_activity["type"] == "WALKING":
return inner_activity["confidence"]
if __name__ == "__main__":
start = time.clock()
history = get_history("history.json")
middle = time.clock()
wcs = list(walking_confidence(history))
end = time.clock()
print("load json: " + str(middle - start) + "s")
print("loop json: " + str(end - middle) + "s")
在我的 98MB JSON 历史记录中打印:
load json: 3.10292s
loop json: 0.338841s
性能不是很好,但肯定不错。
Obvious choices ...
这里显而易见的选择是 JSON-aware 可以快速处理大文件的工具。下面,我将使用jq,只要内存中有足够的 RAM 来保存文件,它可以轻松快速地处理 gigabyte-size 个文件,并且即使有,它也可以处理非常大的文件RAM 不足以在内存中保存 JSON。
首先,我们假设该文件由所示形式的 JSON 个对象数组组成,目标是为每个可接受的 sub-object.[=17 提取两个值=]
这是一个可以完成这项工作的 jq 程序:
.[].activity[]
| .timestampMs as $ts
| .activity[]
| select(.type == "WALKING")
| [$ts, .confidence]
对于给定的输入,这将产生:
["1515564665992",3]
更具体地说,假设上述程序位于名为 program.jq 的文件中并且输入文件为 input.json,jq 的合适调用如下:
jq -cf program.jq input.json
应该很容易修改上面给出的jq程序来处理其他情况,例如如果 JSON 模式比上面假设的更复杂。例如,如果模式中存在一些不规则性,请尝试添加一些后缀 ?
s,例如:
.[].activity[]?
| .timestampMs as $ts
| .activity[]?
| select(.type? == "WALKING")
| [$ts, .confidence]
我导出了一系列 gigantic (40-80mb) Google 位置历史记录 JSON
文件,我的任务是分析这些文件select activity 数据。不幸的是 Google 在他们的 download site 没有参数或选项来选择除了 "one giant JSON containing forever" 之外的任何东西。 (KML
选项是两倍大。)
显而易见的选择,例如 JSON-Converter
(laexcel-test incarnation of VBA-JSON
); parsing line-by line with VBA; even Notepad++。它们都 崩溃并燃烧 。我在想 RegEx 可能就是答案。
This Python script 可以在两秒内从一个 40mb 的文件中提取时间戳和位置(使用 RegEx?)。 Python 怎么这么快?(它会像 VBA 一样快吗?)
如果我有一个神奇的
RegEx
块,我就能一点一点地提取我需要的所有东西,也许是这样的逻辑:删除所有内容 除了:
当timestampMs
和WALKING
出现在*同一组[
方括号]
之间时:- 我需要
timestampMS
、和、 后面的13位数字
WALKING
后的一到三位数。
- 我需要
如果包含更多数据更简单,例如“all 时间戳”或“all 活动”,我可以稍后轻松筛选。 目标是使文件足够小,这样我就可以在不需要rent a supercomputer的情况下操作它,哈哈。
我试过改编现有的 RegEx,但我在 RegEx 和乐器方面都遇到了一个严重的问题:我怎么努力都没用,我只是 无法理解它。所以,这确实是一道"please write code for me"题,但只是一种表达方式,今天我会通过写代码来回报别人!谢谢... ☺ .
}, {
"timestampMs" : "1515564666086", ◁― (Don't need this but it won't hurt)
"latitudeE7" : -6857630899,
"longitudeE7" : -1779694452999,
"activity" : [ {
"timestampMs" : "1515564665992", ◁― EXAMPLE: I want only this, and...
"activity" : [ {
"type" : "STILL",
"confidence" : 65
}, { ↓
"type" : "TILTING",
"confidence" : 4
}, {
"type" : "IN_RAIL_VEHICLE",
"confidence" : 20 ↓
}, {
"type" : "IN_ROAD_VEHICLE",
"confidence" : 5
}, {
"type" : "ON_FOOT", ↓
"confidence" : 3
}, {
"type" : "UNKNOWN",
"confidence" : 3
}, {
"type" : "WALKING", ◁―┬━━ ...AND, I also want this.
"confidence" : 3 ◁―┘
} ]
} ]
}, {
"timestampMs" : "1515564662594", ◁― (Don't need this but it won't hurt)
"latitudeE7" : -6857630899,
"longitudeE7" : -1779694452999,
"altitude" : 42
}, {
编辑:
出于测试目的,我制作了一个示例文件,代表原始文件(大小除外)。原始 JSON 可以直接从 this Pastebin link, or downloaded as a local copy with this TinyUpload link、 或 复制为下面的 "one long line" 加载:
{"locations" : [ {"timestampMs" : "1515565441334","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 2299}, {"timestampMs" : "1515565288606","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 12,"velocity" : 0,"heading" : 350,"altitude" : 42,"activity" : [ {"timestampMs" : "1515565288515","activity" : [ {"type" : "STILL","confidence" : 98}, {"type" : "ON_FOOT","confidence" : 1}, {"type" : "UNKNOWN","confidence" : 1}, {"type" : "WALKING","confidence" : 1} ]} ]}, {"timestampMs" : "1515565285131","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 12,"velocity" : 0,"heading" : 350,"altitude" : 42}, {"timestampMs" : "1513511490011","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 25,"altitude" : -9,"verticalAccuracy" : 2}, {"timestampMs" : "1513511369962","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 25,"altitude" : -9,"verticalAccuracy" : 2}, {"timestampMs" : "1513511179720","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513511059677","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513510928842","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513510942911","activity" : [ {"type" : "STILL","confidence" : 100} ]} ]}, {"timestampMs" : "1513510913776","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 15,"altitude" : -11,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513507320258","activity" : [ {"type" : "TILTING","confidence" : 100} ]} ]}, {"timestampMs" : "1513510898735","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513510874140","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 19,"altitude" : -12,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513510874245","activity" : [ {"type" : "STILL","confidence" : 100} ]} ]} ]}
文件通过 JSONLint and FreeFormatter 测试为有效。
你可以试试这个
(?s)^.*?\"longitude[^\[]*?\"activity[^\[]*\[[^\]]*?timestampMs\"[^\"\]]*\"(\d+)\"[^\]]*WALKING[^\]]*?confidence\"\s*:\s*(\b\d{1,3}\b)[^\]]*?\].*$
Regex Demo,其中我通过“longitude
”、“activity
”等关键字搜索并接近目标捕获值(timestamp value, walking value
) , "[
", "timestampMs
", "]
", "walking
", "confidence
".
Python 脚本
ss=""" copy & paste the file contents' strings (above sample text) in this area """
regx= re.compile(r"(?s)^.*?\"longitude[^\[]*?\"activity[^\[]*\[[^\]]*?timestampMs\"[^\"\]]*\"(\d+)\"[^\]]*WALKING[^\]]*?confidence\"\s*:\s*(\b\d{1,3}\b)[^\]]*?\].*$")
matching= regx.match(ss) # method 1 : using match() function's capturing group
timestamp= matching.group(1)
walkingval= matching.group(2)
print("\ntimestamp is %s\nwalking value is %s" %(timestamp,walkingval))
print("\n"+regx.sub(r' ',ss)) # another method by using sub() function
输出是
timestamp is 1515564665992
walking value is 3
1515564665992 3
很遗憾,您似乎选择了一种没有高性能 JSON 解析器的语言。
有了 Python 你可以:
#!/usr/bin/env python3
import time
import json
def get_history(filename):
with open(filename) as history_file:
return json.load(history_file)
def walking_confidence(history):
for location in history["locations"]:
if "activity" not in location:
continue
for outer_activity in location["activity"]:
confidence = extract_walking_confidence(outer_activity)
if confidence:
timestampMs = int(outer_activity["timestampMs"])
yield (timestampMs, confidence)
def extract_walking_confidence(outer_activity):
for inner_activity in outer_activity["activity"]:
if inner_activity["type"] == "WALKING":
return inner_activity["confidence"]
if __name__ == "__main__":
start = time.clock()
history = get_history("history.json")
middle = time.clock()
wcs = list(walking_confidence(history))
end = time.clock()
print("load json: " + str(middle - start) + "s")
print("loop json: " + str(end - middle) + "s")
在我的 98MB JSON 历史记录中打印:
load json: 3.10292s
loop json: 0.338841s
性能不是很好,但肯定不错。
Obvious choices ...
这里显而易见的选择是 JSON-aware 可以快速处理大文件的工具。下面,我将使用jq,只要内存中有足够的 RAM 来保存文件,它可以轻松快速地处理 gigabyte-size 个文件,并且即使有,它也可以处理非常大的文件RAM 不足以在内存中保存 JSON。
首先,我们假设该文件由所示形式的 JSON 个对象数组组成,目标是为每个可接受的 sub-object.[=17 提取两个值=]
这是一个可以完成这项工作的 jq 程序:
.[].activity[]
| .timestampMs as $ts
| .activity[]
| select(.type == "WALKING")
| [$ts, .confidence]
对于给定的输入,这将产生:
["1515564665992",3]
更具体地说,假设上述程序位于名为 program.jq 的文件中并且输入文件为 input.json,jq 的合适调用如下:
jq -cf program.jq input.json
应该很容易修改上面给出的jq程序来处理其他情况,例如如果 JSON 模式比上面假设的更复杂。例如,如果模式中存在一些不规则性,请尝试添加一些后缀 ?
s,例如:
.[].activity[]?
| .timestampMs as $ts
| .activity[]?
| select(.type? == "WALKING")
| [$ts, .confidence]