将 slist 转换为 csv
Converting slist to csv
一个shell脚本,我运行在IPythonreturns以下对象:
results = ['{"url": "https://url.com", "date": "2020-10-02T21:25:20+00:00", "content": "mycontent\nmorecontent\nmorecontent", "renderedContent": "myrenderedcontent", "id": 123, "username": "somename", "user": {"username": "somename", "displayname": "some name", "id": 123, "description": "my description", "rawDescription": "my description", "descriptionUrls": [], "verified": false, "created": "2020-02-00T02:00:00+00:00", "followersCount": 1, "friendsCount": 1, "statusesCount": 1, "favouritesCount": 1, "listedCount": 1, "mediaCount": 1, "location": "", "protected": false, "linkUrl": null, "linkTcourl": null, "profileImageUrl": "https://myprofile.com/mypic.jpg", "profileBannerUrl": "https://myprofile.com/mypic.jpg"}, "outlinks": [], "outlinks2": "", "outlinks3": [], "outlinks4": "", "replyCount": 0, "retweetCount": 0, "likeCount": 0, "quoteCount": 0, "conversationId": 123, "lang": "en", "source": "<a href=\"mysource.com" rel=\"something\">Sometext</a>", "media": [{"previewUrl": "smallpic.jpg", "fullUrl": "largepic.jpg", "type": "photo"}], "forwarded": null, "quoted": null, "mentionedUsers": [{"username": "name1", "displayname": "name 1", "id": 345, "description": null, "rawDescription": null, "descriptionUrls": null, "verified": null, "created": null, "followersCount": null, "friendsCount": null, "statusesCount": null, "favouritesCount": null, "listedCount": null, "mediaCount": null, "location": null, "protected": null, "linkUrl": null, "link2url": null, "profileImageUrl": null, "profileBannerUrl": null}]}', ...]
而 ...
表示与前一个条目类似的条目更多。根据 type(),这是一个 slist。根据上述 shell 脚本的文档,这是一个 jsonlines 文件。
最终,我想将其转换为一个 csv 对象,其中键是列,值是值,其中每个条目(如上所示)是一行。所以像:
url date content ...
https://url.com 2020-10-02T21:25:20+00:00 mycontent ...
我已经尝试了 here 提出的解决方案,但我收到了一个包含键值对的数据框,如下所示:
import pandas as pd
df = pd.DataFrame(data=results)
df = df[0].str.split(',',expand=True)
df = df.rename(columns=df.iloc[0])
虽然您的示例数据包含几个问题,但如果您解决了这些问题,这将有效:
import json
import pandas as pd
fragment = '{"url": "https://url.com", "date": "2020-10-02T21:25:20+00:00", "content": "mycontent\\nmorecontent\\nmorecontent", "renderedContent": "myrenderedcontent", "id": 123, "username": "somename", "user": {"username": "somename", "displayname": "some name", "id": 123, "description": "my description", "rawDescription": "my description", "descriptionUrls": [], "verified": false, "created": "2020-02-00T02:00:00+00:00", "followersCount": 1, "friendsCount": 1, "statusesCount": 1, "favouritesCount": 1, "listedCount": 1, "mediaCount": 1, "location": "", "protected": false, "linkUrl": null, "linkTcourl": null, "profileImageUrl": "https://myprofile.com/mypic.jpg", "profileBannerUrl": "https://myprofile.com/mypic.jpg"}, "outlinks": [], "outlinks2": "", "outlinks3": [], "outlinks4": "", "replyCount": 0, "retweetCount": 0, "likeCount": 0, "quoteCount": 0, "conversationId": 123, "lang": "en", "source": "<a href=\"mysource.com\" rel=\"something\">Sometext</a>", "media": [{"previewUrl": "smallpic.jpg", "fullUrl": "largepic.jpg", "type": "photo"}], "forwarded": null, "quoted": null, "mentionedUsers": [{"username": "name1", "displayname": "name 1", "id": 345, "description": null, "rawDescription": null, "descriptionUrls": null, "verified": null, "created": null, "followersCount": null, "friendsCount": null, "statusesCount": null, "favouritesCount": null, "listedCount": null, "mediaCount": null, "location": null, "protected": null, "linkUrl": null, "link2url": null, "profileImageUrl": null, "profileBannerUrl": null}]}'
data = json.loads(fragment)
df = pd.DataFrame([data])
df.to_csv('test_out.csv')
注:本例中示例数据已修复,改动:
"
在 'source' 中被正确转义
\n
被转义为 \\n
,也可能是 \n
,但我认为您不希望 csv 中的换行符
如果结果是以下列表:
import json
import pandas as pd
results = get_results_somewhere()
df = pd.DataFrame([json.loads(r) for r in results])
df.to_csv('test_out.csv')
如果您输入的错误仅限于上述情况,您可以这样修正:
def fix_input(s):
return regex.sub('(?<=<[^>]*?)(")', r'\"', regex.sub(r'(?<=<[^>]*?)(\)', '', regex.sub('\n', '\\\\n', s)))
这将之前在 <>
中转义的 \"
转义,然后将 <>
中的所有 "
替换为 \"
并且它还 'fixes'换行符。如果您无法理解正则表达式为何如此工作,那可能是一个单独的问题。
整个事情:
import json
import regex
import pandas as pd
def fix_input(s):
return regex.sub('(?<=<[^>]*?)(")', r'\"', regex.sub(r'(?<=<[^>]*?)(\)', '', regex.sub('\n', '\\\\n', s)))
results = get_results_somewhere()
fixed_results = fix_input(results)
df = pd.DataFrame([json.loads(r) for r in fixed_results])
df.to_csv('test_out.csv')
注意:这使用第三方 regex
而不是 re
,因为它使用可变长度的后视。
一个shell脚本,我运行在IPythonreturns以下对象:
results = ['{"url": "https://url.com", "date": "2020-10-02T21:25:20+00:00", "content": "mycontent\nmorecontent\nmorecontent", "renderedContent": "myrenderedcontent", "id": 123, "username": "somename", "user": {"username": "somename", "displayname": "some name", "id": 123, "description": "my description", "rawDescription": "my description", "descriptionUrls": [], "verified": false, "created": "2020-02-00T02:00:00+00:00", "followersCount": 1, "friendsCount": 1, "statusesCount": 1, "favouritesCount": 1, "listedCount": 1, "mediaCount": 1, "location": "", "protected": false, "linkUrl": null, "linkTcourl": null, "profileImageUrl": "https://myprofile.com/mypic.jpg", "profileBannerUrl": "https://myprofile.com/mypic.jpg"}, "outlinks": [], "outlinks2": "", "outlinks3": [], "outlinks4": "", "replyCount": 0, "retweetCount": 0, "likeCount": 0, "quoteCount": 0, "conversationId": 123, "lang": "en", "source": "<a href=\"mysource.com" rel=\"something\">Sometext</a>", "media": [{"previewUrl": "smallpic.jpg", "fullUrl": "largepic.jpg", "type": "photo"}], "forwarded": null, "quoted": null, "mentionedUsers": [{"username": "name1", "displayname": "name 1", "id": 345, "description": null, "rawDescription": null, "descriptionUrls": null, "verified": null, "created": null, "followersCount": null, "friendsCount": null, "statusesCount": null, "favouritesCount": null, "listedCount": null, "mediaCount": null, "location": null, "protected": null, "linkUrl": null, "link2url": null, "profileImageUrl": null, "profileBannerUrl": null}]}', ...]
而 ...
表示与前一个条目类似的条目更多。根据 type(),这是一个 slist。根据上述 shell 脚本的文档,这是一个 jsonlines 文件。
最终,我想将其转换为一个 csv 对象,其中键是列,值是值,其中每个条目(如上所示)是一行。所以像:
url date content ...
https://url.com 2020-10-02T21:25:20+00:00 mycontent ...
我已经尝试了 here 提出的解决方案,但我收到了一个包含键值对的数据框,如下所示:
import pandas as pd
df = pd.DataFrame(data=results)
df = df[0].str.split(',',expand=True)
df = df.rename(columns=df.iloc[0])
虽然您的示例数据包含几个问题,但如果您解决了这些问题,这将有效:
import json
import pandas as pd
fragment = '{"url": "https://url.com", "date": "2020-10-02T21:25:20+00:00", "content": "mycontent\\nmorecontent\\nmorecontent", "renderedContent": "myrenderedcontent", "id": 123, "username": "somename", "user": {"username": "somename", "displayname": "some name", "id": 123, "description": "my description", "rawDescription": "my description", "descriptionUrls": [], "verified": false, "created": "2020-02-00T02:00:00+00:00", "followersCount": 1, "friendsCount": 1, "statusesCount": 1, "favouritesCount": 1, "listedCount": 1, "mediaCount": 1, "location": "", "protected": false, "linkUrl": null, "linkTcourl": null, "profileImageUrl": "https://myprofile.com/mypic.jpg", "profileBannerUrl": "https://myprofile.com/mypic.jpg"}, "outlinks": [], "outlinks2": "", "outlinks3": [], "outlinks4": "", "replyCount": 0, "retweetCount": 0, "likeCount": 0, "quoteCount": 0, "conversationId": 123, "lang": "en", "source": "<a href=\"mysource.com\" rel=\"something\">Sometext</a>", "media": [{"previewUrl": "smallpic.jpg", "fullUrl": "largepic.jpg", "type": "photo"}], "forwarded": null, "quoted": null, "mentionedUsers": [{"username": "name1", "displayname": "name 1", "id": 345, "description": null, "rawDescription": null, "descriptionUrls": null, "verified": null, "created": null, "followersCount": null, "friendsCount": null, "statusesCount": null, "favouritesCount": null, "listedCount": null, "mediaCount": null, "location": null, "protected": null, "linkUrl": null, "link2url": null, "profileImageUrl": null, "profileBannerUrl": null}]}'
data = json.loads(fragment)
df = pd.DataFrame([data])
df.to_csv('test_out.csv')
注:本例中示例数据已修复,改动:
"
在 'source' 中被正确转义
\n
被转义为\\n
,也可能是\n
,但我认为您不希望 csv 中的换行符
如果结果是以下列表:
import json
import pandas as pd
results = get_results_somewhere()
df = pd.DataFrame([json.loads(r) for r in results])
df.to_csv('test_out.csv')
如果您输入的错误仅限于上述情况,您可以这样修正:
def fix_input(s):
return regex.sub('(?<=<[^>]*?)(")', r'\"', regex.sub(r'(?<=<[^>]*?)(\)', '', regex.sub('\n', '\\\\n', s)))
这将之前在 <>
中转义的 \"
转义,然后将 <>
中的所有 "
替换为 \"
并且它还 'fixes'换行符。如果您无法理解正则表达式为何如此工作,那可能是一个单独的问题。
整个事情:
import json
import regex
import pandas as pd
def fix_input(s):
return regex.sub('(?<=<[^>]*?)(")', r'\"', regex.sub(r'(?<=<[^>]*?)(\)', '', regex.sub('\n', '\\\\n', s)))
results = get_results_somewhere()
fixed_results = fix_input(results)
df = pd.DataFrame([json.loads(r) for r in fixed_results])
df.to_csv('test_out.csv')
注意:这使用第三方 regex
而不是 re
,因为它使用可变长度的后视。