Python - 如何流式传输大 (11 gb) JSON 文件以进行分解
Python - How to stream large (11 gb) JSON file to be broken up
我有一个非常大的 JSON (11 gb) 文件,太大而无法读入我的内存。
我想把它分解成更小的文件来分析数据。我目前正在使用 Python 和 Pandas 进行分析,我想知道是否有某种方法可以访问文件的块,以便可以在不使程序崩溃的情况下将其读入内存。理想情况下,我想将多年的数据分解成跨越一周左右的较小的可管理文件,但是没有固定的数据大小,尽管如果它们是设定的时间间隔并不重要。
这里是数据格式
{
"actor" :
{
"classification" : [ "suggested" ],
"displayName" : "myself",
"followersCount" : 0,
"followingCount" : 0,
"followingStocksCount" : 0,
"id" : "person:stocktwits:183087",
"image" : "http://avatars.stocktwits.com/production/183087/thumb-1350332393.png",
"link" : "http://stocktwits.com/myselfbtc",
"links" :
[
{
"href" : null,
"rel" : "me"
}
],
"objectType" : "person",
"preferredUsername" : "myselfbtc",
"statusesCount" : 2,
"summary" : null,
"tradingStrategy" :
{
"approach" : "Technical",
"assetsFrequentlyTraded" : [ "Forex" ],
"experience" : "Novice",
"holdingPeriod" : "Day Trader"
}
},
"body" : "$BCOIN and macd is going down ..... http://stks.co/iDEB",
"entities" :
{
"chart" :
{
"fullImage" :
{
"link" : "http://charts.stocktwits.com/production/original_10047145.png"
},
"image" :
{
"link" : "http://charts.stocktwits.com/production/small_10047145.png"
},
"link" : "http://stks.co/iDEB",
"objectType" : "image"
},
"sentiment" :
{
"basic" : "Bearish"
},
"stocks" :
[
{
"displayName" : "Bitcoin",
"exchange" : "PRIVATE",
"industry" : null,
"sector" : null,
"stocktwits_id" : 9659,
"symbol" : "BCOIN"
}
],
"video" : null
},
"gnip" :
{
"language" :
{
"value" : "en"
}
},
"id" : "tag:gnip.stocktwits.com:2012:note/10047145",
"inReplyTo" :
{
"id" : "tag:gnip.stocktwits.com:2012:note/10046953",
"objectType" : "comment"
},
"link" : "http://stocktwits.com/myselfbtc/message/10047145",
"object" :
{
"id" : "note:stocktwits:10047145",
"link" : "http://stocktwits.com/myselfbtc/message/10047145",
"objectType" : "note",
"postedTime" : "2012-10-17T19:13:50Z",
"summary" : "$BCOIN and macd is going down ..... http://stks.co/iDEB",
"updatedTime" : "2012-10-17T19:13:50Z"
},
"provider" :
{
"displayName" : "StockTwits",
"link" : "http://stocktwits.com"
},
"verb" : "post"
}
我认为您需要类似流解析器的东西。 ijson 可能有效:
https://changelog.com/ijson-parse-streams-of-json-in-python/
jq 1.5 有一个流式解析器(记录在 http://stedolan.github.io/jq/manual/#Streaming)。从某种意义上说,它很容易使用,例如如果您的 1G 文件名为 1G.json,则以下命令将生成一系列行,包括每个 "leaf" 值一行:
jq -c --stream . 1G.json
(输出如下所示。注意每一行本身都是有效的 JSON。)
但是,使用流式输出可能并不那么容易,但这取决于您想要做什么:-)
理解流式输出的关键是大多数行都具有以下形式:
[ PATH, VALUE ]
其中 "PATH" 是路径的数组表示。 (在使用jq的时候,这个数组其实可以作为一个路径使用。)
[["actor","classification",0],"suggested"]
[["actor","classification",0]]
[["actor","displayName"],"myself"]
[["actor","followersCount"],0]
[["actor","followingCount"],0]
[["actor","followingStocksCount"],0]
[["actor","id"],"person:stocktwits:183087"]
[["actor","image"],"http://avatars.stocktwits.com/production/183087/thumb-1350332393.png"]
[["actor","link"],"http://stocktwits.com/myselfbtc"]
[["actor","links",0,"href"],null]
[["actor","links",0,"rel"],"me"]
[["actor","links",0,"rel"]]
[["actor","links",0]]
[["actor","objectType"],"person"]
[["actor","preferredUsername"],"myselfbtc"]
[["actor","statusesCount"],2]
[["actor","summary"],null]
[["actor","tradingStrategy","approach"],"Technical"]
[["actor","tradingStrategy","assetsFrequentlyTraded",0],"Forex"]
[["actor","tradingStrategy","assetsFrequentlyTraded",0]]
[["actor","tradingStrategy","experience"],"Novice"]
[["actor","tradingStrategy","holdingPeriod"],"Day Trader"]
[["actor","tradingStrategy","holdingPeriod"]]
[["actor","tradingStrategy"]]
[["body"],"$BCOIN and macd is going down ..... http://stks.co/iDEB"]
[["entities","chart","fullImage","link"],"http://charts.stocktwits.com/production/original_10047145.png"]
[["entities","chart","fullImage","link"]]
[["entities","chart","image","link"],"http://charts.stocktwits.com/production/small_10047145.png"]
[["entities","chart","image","link"]]
[["entities","chart","link"],"http://stks.co/iDEB"]
[["entities","chart","objectType"],"image"]
[["entities","chart","objectType"]]
[["entities","sentiment","basic"],"Bearish"]
[["entities","sentiment","basic"]]
[["entities","stocks",0,"displayName"],"Bitcoin"]
[["entities","stocks",0,"exchange"],"PRIVATE"]
[["entities","stocks",0,"industry"],null]
[["entities","stocks",0,"sector"],null]
[["entities","stocks",0,"stocktwits_id"],9659]
[["entities","stocks",0,"symbol"],"BCOIN"]
[["entities","stocks",0,"symbol"]]
[["entities","stocks",0]]
[["entities","video"],null]
[["entities","video"]]
[["gnip","language","value"],"en"]
[["gnip","language","value"]]
[["gnip","language"]]
[["id"],"tag:gnip.stocktwits.com:2012:note/10047145"]
[["inReplyTo","id"],"tag:gnip.stocktwits.com:2012:note/10046953"]
[["inReplyTo","objectType"],"comment"]
[["inReplyTo","objectType"]]
[["link"],"http://stocktwits.com/myselfbtc/message/10047145"]
[["object","id"],"note:stocktwits:10047145"]
[["object","link"],"http://stocktwits.com/myselfbtc/message/10047145"]
[["object","objectType"],"note"]
[["object","postedTime"],"2012-10-17T19:13:50Z"]
[["object","summary"],"$BCOIN and macd is going down ..... http://stks.co/iDEB"]
[["object","updatedTime"],"2012-10-17T19:13:50Z"]
[["object","updatedTime"]]
[["provider","displayName"],"StockTwits"]
[["provider","link"],"http://stocktwits.com"]
[["provider","link"]]
[["verb"],"post"]
[["verb"]]
我有一个非常大的 JSON (11 gb) 文件,太大而无法读入我的内存。 我想把它分解成更小的文件来分析数据。我目前正在使用 Python 和 Pandas 进行分析,我想知道是否有某种方法可以访问文件的块,以便可以在不使程序崩溃的情况下将其读入内存。理想情况下,我想将多年的数据分解成跨越一周左右的较小的可管理文件,但是没有固定的数据大小,尽管如果它们是设定的时间间隔并不重要。
这里是数据格式
{
"actor" :
{
"classification" : [ "suggested" ],
"displayName" : "myself",
"followersCount" : 0,
"followingCount" : 0,
"followingStocksCount" : 0,
"id" : "person:stocktwits:183087",
"image" : "http://avatars.stocktwits.com/production/183087/thumb-1350332393.png",
"link" : "http://stocktwits.com/myselfbtc",
"links" :
[
{
"href" : null,
"rel" : "me"
}
],
"objectType" : "person",
"preferredUsername" : "myselfbtc",
"statusesCount" : 2,
"summary" : null,
"tradingStrategy" :
{
"approach" : "Technical",
"assetsFrequentlyTraded" : [ "Forex" ],
"experience" : "Novice",
"holdingPeriod" : "Day Trader"
}
},
"body" : "$BCOIN and macd is going down ..... http://stks.co/iDEB",
"entities" :
{
"chart" :
{
"fullImage" :
{
"link" : "http://charts.stocktwits.com/production/original_10047145.png"
},
"image" :
{
"link" : "http://charts.stocktwits.com/production/small_10047145.png"
},
"link" : "http://stks.co/iDEB",
"objectType" : "image"
},
"sentiment" :
{
"basic" : "Bearish"
},
"stocks" :
[
{
"displayName" : "Bitcoin",
"exchange" : "PRIVATE",
"industry" : null,
"sector" : null,
"stocktwits_id" : 9659,
"symbol" : "BCOIN"
}
],
"video" : null
},
"gnip" :
{
"language" :
{
"value" : "en"
}
},
"id" : "tag:gnip.stocktwits.com:2012:note/10047145",
"inReplyTo" :
{
"id" : "tag:gnip.stocktwits.com:2012:note/10046953",
"objectType" : "comment"
},
"link" : "http://stocktwits.com/myselfbtc/message/10047145",
"object" :
{
"id" : "note:stocktwits:10047145",
"link" : "http://stocktwits.com/myselfbtc/message/10047145",
"objectType" : "note",
"postedTime" : "2012-10-17T19:13:50Z",
"summary" : "$BCOIN and macd is going down ..... http://stks.co/iDEB",
"updatedTime" : "2012-10-17T19:13:50Z"
},
"provider" :
{
"displayName" : "StockTwits",
"link" : "http://stocktwits.com"
},
"verb" : "post"
}
我认为您需要类似流解析器的东西。 ijson 可能有效:
https://changelog.com/ijson-parse-streams-of-json-in-python/
jq 1.5 有一个流式解析器(记录在 http://stedolan.github.io/jq/manual/#Streaming)。从某种意义上说,它很容易使用,例如如果您的 1G 文件名为 1G.json,则以下命令将生成一系列行,包括每个 "leaf" 值一行:
jq -c --stream . 1G.json
(输出如下所示。注意每一行本身都是有效的 JSON。)
但是,使用流式输出可能并不那么容易,但这取决于您想要做什么:-)
理解流式输出的关键是大多数行都具有以下形式:
[ PATH, VALUE ]
其中 "PATH" 是路径的数组表示。 (在使用jq的时候,这个数组其实可以作为一个路径使用。)
[["actor","classification",0],"suggested"]
[["actor","classification",0]]
[["actor","displayName"],"myself"]
[["actor","followersCount"],0]
[["actor","followingCount"],0]
[["actor","followingStocksCount"],0]
[["actor","id"],"person:stocktwits:183087"]
[["actor","image"],"http://avatars.stocktwits.com/production/183087/thumb-1350332393.png"]
[["actor","link"],"http://stocktwits.com/myselfbtc"]
[["actor","links",0,"href"],null]
[["actor","links",0,"rel"],"me"]
[["actor","links",0,"rel"]]
[["actor","links",0]]
[["actor","objectType"],"person"]
[["actor","preferredUsername"],"myselfbtc"]
[["actor","statusesCount"],2]
[["actor","summary"],null]
[["actor","tradingStrategy","approach"],"Technical"]
[["actor","tradingStrategy","assetsFrequentlyTraded",0],"Forex"]
[["actor","tradingStrategy","assetsFrequentlyTraded",0]]
[["actor","tradingStrategy","experience"],"Novice"]
[["actor","tradingStrategy","holdingPeriod"],"Day Trader"]
[["actor","tradingStrategy","holdingPeriod"]]
[["actor","tradingStrategy"]]
[["body"],"$BCOIN and macd is going down ..... http://stks.co/iDEB"]
[["entities","chart","fullImage","link"],"http://charts.stocktwits.com/production/original_10047145.png"]
[["entities","chart","fullImage","link"]]
[["entities","chart","image","link"],"http://charts.stocktwits.com/production/small_10047145.png"]
[["entities","chart","image","link"]]
[["entities","chart","link"],"http://stks.co/iDEB"]
[["entities","chart","objectType"],"image"]
[["entities","chart","objectType"]]
[["entities","sentiment","basic"],"Bearish"]
[["entities","sentiment","basic"]]
[["entities","stocks",0,"displayName"],"Bitcoin"]
[["entities","stocks",0,"exchange"],"PRIVATE"]
[["entities","stocks",0,"industry"],null]
[["entities","stocks",0,"sector"],null]
[["entities","stocks",0,"stocktwits_id"],9659]
[["entities","stocks",0,"symbol"],"BCOIN"]
[["entities","stocks",0,"symbol"]]
[["entities","stocks",0]]
[["entities","video"],null]
[["entities","video"]]
[["gnip","language","value"],"en"]
[["gnip","language","value"]]
[["gnip","language"]]
[["id"],"tag:gnip.stocktwits.com:2012:note/10047145"]
[["inReplyTo","id"],"tag:gnip.stocktwits.com:2012:note/10046953"]
[["inReplyTo","objectType"],"comment"]
[["inReplyTo","objectType"]]
[["link"],"http://stocktwits.com/myselfbtc/message/10047145"]
[["object","id"],"note:stocktwits:10047145"]
[["object","link"],"http://stocktwits.com/myselfbtc/message/10047145"]
[["object","objectType"],"note"]
[["object","postedTime"],"2012-10-17T19:13:50Z"]
[["object","summary"],"$BCOIN and macd is going down ..... http://stks.co/iDEB"]
[["object","updatedTime"],"2012-10-17T19:13:50Z"]
[["object","updatedTime"]]
[["provider","displayName"],"StockTwits"]
[["provider","link"],"http://stocktwits.com"]
[["provider","link"]]
[["verb"],"post"]
[["verb"]]