如何使用转义的 ' 和 " 重新格式化损坏的 json 文件?
How to reformat a corrupt json file with escaped ' and "?
问题
我有一个很大的 JSON 文件(约 700.000 行,1.2GB 文件大小),其中包含我需要预处理以进行数据和网络分析的推特数据。
在数据收集期间发生错误:Instead of using " as a separator ' was used. 由于这不符合 JSON 标准,因此文件无法由 R 或 Python.[= 处理19=]
关于数据集的信息:
每大约 500 行以元信息 + 用户的元信息等开始。然后 json 中的推文(字段顺序不稳定)以 space 开始,每行一条推文。
这是我到目前为止尝试过的:
- 一个简单的
data.replace('\'', '\"')
是不可能的,因为“文本”字段包含可能包含 ' 或 ' 本身的推文。
- 使用正则表达式,我能够捕捉到一些实例,但它并没有捕捉到所有情况:
re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\'')
- 使用
ast
包中的 literal.eval(data)
也会引发错误。
由于字段的 顺序和每个字段的长度不稳定 我一直在想如何重新格式化该文件以符合 JSON。
数据的正常样本行(对于这个选项一和二可以,但请注意推文也是非英语语言,在他们的推文中使用 " 或 '):
{'author_id': '1236888827605725186', 'entities': {'mentions': [{'start': 108, 'end': 124, 'username': 'realDonaldTrump'}], 'hashtags': [{'start': 49, 'end': 55, 'tag': 'QAnon'}, {'start': 56, 'end': 66, 'tag': 'ProudBoys'}]}, 'context_annotations': [{'domain': {'id': '10', 'name': 'Person', 'description': 'Named people in the world like Nelson Mandela'}, 'entity': {'id': '799022225751871488', 'name': 'Donald Trump', 'description': 'US President Donald Trump'}}, {'domain': {'id': '35', 'name': 'Politician', 'description': 'Politicians in the world, like Joe Biden'}, 'entity': {'id': '799022225751871488', 'name': 'Donald Trump', 'description': 'US President Donald Trump'}}], 'text': 'RT @NinjaHodon: Here’s an example of the average #QAnon #ProudBoys crackass trash that’s going to vote for @realDonaldTrump. \n\n https://t.…', 'referenced_tweets': [{'type': 'retweeted', 'id': '1315363137240010753'}], 'conversation_id': '1315441338427506689', 'id': '1315441338427506689', 'lang': 'en', 'public_metrics': {'retweet_count': 20, 'reply_count': 0, 'like_count': 0, 'quote_count': 0}, 'created_at': '20201011T23:57:09.000Z', 'source': 'Twitter for Android', 'possibly_sensitive': False}
重新格式化导致问题的示例行:
{"users": [{"id": "437781219", "username": "HakesJon", "location": `"Wisconsin", "description": "#IndieFictionWriter. Husband. Father. Bearded.\n#BlackLivesMatter #DemilitarizeThePolice #DismantlePolicing", "name": "Jon Hakes", "created_at": "20111215T20:42:41.000Z"}, {"id": "1171947445841997824", "username": "FactNc", "location": "Under Carolina blue skies ", "description": "Defender of truth, justice and the American way. "I never give them hell. I just tell the truth and they think it\'s hell." Harry S. Truman", "name": "NCFactFinder", "created_at": "20190912T00:44:21.000Z"}, {"id": "315041625", "username": "o0rimbuk0o", "description": "Your desire to put pronouns here is not my issue. Get help.\n\n#resist #notmypresident\n#FBiden", "name": "Sick of it", "created_at": "20110611T06:16:11.000Z"}, {"id": "3141427487", "username": "theGeekSheek", "description": "I don't believe in your God. Don't tell me he hates me.", "name": "Chic Geek", "created_at": "20150406T18:34:45.000Z"}, {"id": "1084112678", "username": "KarinBorjeesson", "description": "Love to help people & animals in need. Love music. Fucking hate racists. #Anon #OpExposeCPS #BLM #FreePalestine #Yemen #OpSerenaShim #Animalrights #NoDAPL", "name": "AnonyMISSKarin", "created_at": "20130112T20:57:28.000Z"}, {"id": "1003712866011308033", "username": "persian_pesar", "description": "\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200fبه ستواری و سختی رشک پولاد/\nبه راه عشق سرها داده بر باد/\nقرین بیستون هم\u200cسنگ فرهاد/\nز کرمانشاهیان یاد اینچنین باد\n\u200e#Civil_Environment_Engineer", "name": "persianpesar\u200d", "created_at": "20180604T18:59:30.000Z"}, {"id": "814795859644809217", "username": "Aazadist", "description": "\u200f\u200e#Equality\n\u200e#Humanity\nخواهی نشوی همرنگ ، رسوای جماعت شو", "name": "Aazad ️\u200d آزاد", "created_at": "20161230T11:30:45.000Z"}, {"id": "790375699638915072", "username": "Isaihstewart", "location": "Los Angeles, CA", "description": "Part time assistant manager at “Sheets and Things”", "name": "Dey got the henessey ", "created_at": "20161024T02:13:46.000Z"}, {"id": "4846243708", "username": "williamvercetti", "location": "Virginia Beach, VA", "description": "vma. art. modelo papi. tpain to the dms.", "name": "William Vercetti", "created_at": "20160125T17:21:50.000Z"}, {"id": "1160723882", "username": "k_cawsey", "location": "Halifax, Nova Scotia", "description": "Chaucer, Malory, Arthur Tolkien. @Dal_English", "name": "Dr. Kathy Cawsey", "created_at": "20130208T17:15:30.000Z"}, {"id": "3789298943", "username": "solomonesther17", "location": "Lagos, Nigeria", "description": "FairBib Legal Practitioners", "name": "Esther Solomon", "created_at": "20150927T04:52:29.000Z"}, {"id": "14860380", "username": "Dejify", "location": "San Francisco", "description": "The Nigerian State is a festering boil that the world can't afford to ignore. Because, when it pops, its rancid ooze won't be pleasant nor easy to contain.", "name": "Buhari: Uber Ment (Dèjì Akọ́mọláfẹ́)", "created_at": "20080521T18:57:27.000Z"}, {"id": "1120883223070773248", "username": "Donna780780", "description": "", "name": "Donna Swidley", "created_at": "20190424T02:52:40.000Z"}, {"id": "1253742908487929858", "username": "Neros_sis", "location": "Florida", "description": "", "name": "@Nero's Fiddle GOP has a terrorism problem", "created_at": "20200424T17:50:00.000Z"}, {"id": "585090491", "username": "vickierae562", "location": "The LBC", "description": "That’s Right, I’m a Lefty and I don’t feed trolls! #resist #DumpTrump #DitchMitch #LooseLindsey", "name": "Vickie Rae", "created_at": "20120519T21:00:28.000Z"}, {"id": "1262122532607574022", "username": "EmilySi49944255", "description": "", "name": "Skylar Aubrey", "created_at": "20200517T20:47:34.000Z"}, {"id": "1401663176", "username": "mdeHummelchen", "location": "Tief im Westen", "description": "Pflegewissenschaftlerin,Pflegeberaterin,Dozentin,Lächeln und winken...Pro Pflegekammer", "name": "Madame Hummelchen ", "created_at": "20130504T07:44:32.000Z"}, {"id": "2381808114", "username": "mommy97giraffe", "location": "Antifa HQs/Mom Division Office", "description": "Follower of Jesus, Mennonite mom&wife, lover of books, world, peo, poetry&art. 6 autoimmunes&fibroie Proud Mama Bear of 1gayD & 1pan&autistic son, in 20s", "name": "Mennonite Mom(she/her)", "created_at": "20140310T08:51:02.000Z"}, {"id": "2362182011", "username": "rd2glry", "location": "Washington, DC", "description": "", "name": "ateachr", "created_at": "20140224T04:07:21.000Z"}, {"id": "974917494870700032", "username": "GiraffeOld", "location": "Arizona, USA", "description": "", "name": "old man giraffe", "created_at": "20180317T07:56:58.000Z"}, {"id": "830939480", "username": "redz041", "description": "", "name": "Jan Mouzone", "created_at": "20120918T12:18:36.000Z"}, {"id": "3346032292", "username": "kumccaig44", "description": "", "name": "Katrine McCaig", "created_at": "20150625T21:25:21.000Z"}, {"id": "80630279", "username": "LuluTheCalm", "location": "Green Grass & Puddles, Canada", "description": "Mischief in My Eyes & Adventure in My Soul. \nLet's Have a Laugh &, you know, Make the World a Better Place. \nAus/Brit/Cdn", "name": "Lulu #BeKindBeCalmBeSafe ", "created_at": "20091007T17:26:56.000Z"}, {"id": "3252437864", "username": "engelhardterin", "location": "Houston, TX || Lubbock, TX", "description": "24 || Texas Tech || ♀️ || she/her", "name": "Erin Engelhardt", "created_at": "20150622T07:26:28.000Z"}, {"id": "93797267", "username": "mcbeaz", "location": "he/him", "description": "black lives matter.", "name": "mike", "created_at": "20091201T05:28:58.000Z"}, {"id": "2585773107", "username": "michiganington", "location": "Washington, D.C. ", "description": "", "name": "Allyoop", "created_at": "20140606T02:12:33.000Z"}, {"id": "27857135", "username": "JackRayher", "location": "Northport, NY", "description": "Senior Marketing Executive\nLifelong Democrat\n#BidenHarris", "name": "Jack Rayher", "created_at": "20090331T12:12:03.000Z"}, {"id": "1078457644736827392", "username": "RobertCooper58", "description": "Bilingual community advocate. Father of five wonderful kids. Lifelong progressive and proud member of @TheDemCoalition. Early supporter of President @JoeBiden.", "name": "Robert Cooper ", "created_at": "20181228T01:08:34.000Z"}, {"id": "206860139", "username": "MariaArtze", "location": "Münster, Deutschland", "description": "Nas trincheiras da ESO\nEmigrante a medio retornar. Womansplainer.\n(Sie vostede)\n\nTrans rights are human rights.", "name": "A Malvada Profe mediovacinada", "created_at": "20101023T22:27:26.000Z"}, {"id": "2903906123", "username": "lm1067", "location": "London, England", "description": "B A FINE ARTIST GRADUATED", "name": "Luis Pais", "created_at": "20141203T15:53:10.000Z"}, {"id": "64119853", "username": "IAM_SHAKESPEARE", "location": "Tweeting from the Grave", "description": "This bot has tweeted the complete works of Shakespeare (in order) 5 times over the last 12years. On hiatus for a bit. Created by @strebel", "name": "Willy Shakes", "created_at": "20090809T05:41:08.000Z"}, {"id": "3176623941", "username": "acastellich", "location": "Chicago, Il.", "description": "Abogado,Restaurantero,Immigrant , UVM. AD1 IPADE MBA. Restaurant Hospitality Industry, Chicago IL.", "name": "Alejandro Castelli", "created_at": "20150417T13:23:17.000Z"}, {"id": "782765390925533185", "username": "Diane_L_Espo", "location": "Florida, USA", "description": "", "name": "DianeEspo ", "created_at": "20161003T02:13:07.000Z"}, {"id": "67471020", "username": "thedcma", "location": "Fort Lauderdale, FL", "description": " Style is the only substance I abuse. I’m just a Gay Hillbilly Warlock Riding a \u200dVaporwave Fever Dream #blacklivesmatter", "name": "Grace Kelly on Steiroids", "created_at": "20090821T00:32:37.000Z"}, {"id": "78797635", "username": "graciosodiablo", "description": "Too much of a good thing can be bad. So too little of a bad thing must be good. 160 characters or less of me should be perfect.", "name": "gracioso diabloint", "created_at": "20091001T03:59:16.000Z"}, {"id": "268314713", "username": "philppedurand", "location": "Auxerre", "description": "Je suis une personne gentille je milite pour la PMA. je suis militant communiste je suis aussi à l’association des Rosoirs je suis conseillé quartier", "name": "Philippe durand", "created_at": "20110318T14:37:36.000Z"}, {"id": "37996028", "username": "nicrawhide", "location": "Pinconning Michigan ", "description": "Just your average small town gay with big town sensibility!!", "name": "Nicholas Bean", "created_at": "20090505T19:20:37.000Z"}, {"id": "1236656342674407427", "username": "LadyJayPersists", "location": "Valhalla", "description": "USN Veteran | Shieldmaiden | Mom | Not here for a man, I have one | PTSD Warrior | My mind is a beautiful servant to a dangerous master", "name": "Jax", "created_at": "20200308T14:13:48.000Z"}, {"id": "171183306", "username": "dawndawnB", "location": "United States", "description": "Mrs. B, mother of 2 amazing kids, Substance Abuse Counselor, Volunteer, Music Lover. Born in DC but a VA Lo❤️er!", "name": "nwad", "created_at": "20100726T19:21:24.000Z"}, {"id": "817247846751555587", "username": "me2020_2021", "location": "Brisbane, Queensland", "description": "Proud Aussie, living a wonderful life with my wife, Australian Cricket , \U0001f9ae Alex", "name": "️\u200d "A girl has no Name”', 'created_at': '20170106T05:54:05.000Z'}, {'id': '879459933988585472', 'username': 'Davecl3069', 'location': 'San Francisco Bay Area', 'description': 'proud of my views, life long learner,& hopefully, that guy!\n#LowerTheFlagForCovidVictims #VoteBlue #BLM #SupportThePlayers #LGBTQ #WeNeedToDoBetter #ResistStill', 'name': 'David', 'created_at': '20170626T22:02:42.000Z'}`
使用代码
正则表达式:
def convert_to_json(file):
with open(file, "r", encoding="utf-8") as f:
x = f.read()
x = x.replace("-", "")
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\'')
decoded = rx.sub('"', x)
literal_eval:
def open_json():
with open("data.json", "r", encoding="utf-8") as f:
f.read()
data = literal_eval(f)
data = json.loads(str(data))
我想达到的目标
- 重新格式化数据以符合JSON(本题)以便能够
- 使用相关推文文本、用户信息和元数据(次要目标)构建数据框,以用于进一步分析。
提前感谢您的任何建议! :)
如果导致问题的 '
仅在推文和描述中
你可以试试
pre_tweet ="'text': '"
post_tweet = "', 'referenced_tweets':"
with open(file, encoding="utf-8") as f:
data=f.readlines()
output = []
errors = []
for line in data:
if pre_tweet in line and post_tweet in line :
first_part,rest = line.split(pre_tweet)
tweet,last_part = rest.split(post_tweet)
pre_tweet = first_part.replace('\'', '\"') + pre_tweet.replace('\'', '\"')
post_tweet = post_tweet.replace('\'', '\"') + last_part.replace('\'', '\"')
output.append(pre_tweet + tweet + post_tweet)
else :
errors.append(line)
如果错误不为空,要么是因为该行中没有推文(您可以稍微更改代码以将其添加到输出中),
或者推文后面的内容不是 'referenced_tweets'。在第二种情况下,您可以尝试弄清楚会发生什么变化并修改上面的代码以添加多个 post_tweet
那么您可以通过将 pre 和 post 推文更改为通常在描述
之前和之后的内容来对描述执行相同的操作
tweets/description之后可能的键数必须是有限的,所以可能需要一些时间来找出所有的可能性,但最终你应该成功
所以我想出了一种方法来处理损坏的数据。
可以找到解决方案 here.
使用 ast.literal_eval(input_string)
可以让我在损坏的 json 行中读入字典。唯一的事情是确保输入字符串中不包含前导或尾随空格、逗号等。
使用 ast.literal_eval():
读取数据的示例代码
from ast import literal_eval
with open("inputdata.json", "r", encoding="utf-8") as f:
dictlist = []
for line in f:
x: str = f.readline()
x = x.lstrip()
data = literal_eval(x)
dictlist.append(data)
问题
我有一个很大的 JSON 文件(约 700.000 行,1.2GB 文件大小),其中包含我需要预处理以进行数据和网络分析的推特数据。 在数据收集期间发生错误:Instead of using " as a separator ' was used. 由于这不符合 JSON 标准,因此文件无法由 R 或 Python.[= 处理19=]
关于数据集的信息: 每大约 500 行以元信息 + 用户的元信息等开始。然后 json 中的推文(字段顺序不稳定)以 space 开始,每行一条推文。
这是我到目前为止尝试过的:
- 一个简单的
data.replace('\'', '\"')
是不可能的,因为“文本”字段包含可能包含 ' 或 ' 本身的推文。 - 使用正则表达式,我能够捕捉到一些实例,但它并没有捕捉到所有情况:
re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\'')
- 使用
ast
包中的literal.eval(data)
也会引发错误。
由于字段的 顺序和每个字段的长度不稳定 我一直在想如何重新格式化该文件以符合 JSON。
数据的正常样本行(对于这个选项一和二可以,但请注意推文也是非英语语言,在他们的推文中使用 " 或 '):
{'author_id': '1236888827605725186', 'entities': {'mentions': [{'start': 108, 'end': 124, 'username': 'realDonaldTrump'}], 'hashtags': [{'start': 49, 'end': 55, 'tag': 'QAnon'}, {'start': 56, 'end': 66, 'tag': 'ProudBoys'}]}, 'context_annotations': [{'domain': {'id': '10', 'name': 'Person', 'description': 'Named people in the world like Nelson Mandela'}, 'entity': {'id': '799022225751871488', 'name': 'Donald Trump', 'description': 'US President Donald Trump'}}, {'domain': {'id': '35', 'name': 'Politician', 'description': 'Politicians in the world, like Joe Biden'}, 'entity': {'id': '799022225751871488', 'name': 'Donald Trump', 'description': 'US President Donald Trump'}}], 'text': 'RT @NinjaHodon: Here’s an example of the average #QAnon #ProudBoys crackass trash that’s going to vote for @realDonaldTrump. \n\n https://t.…', 'referenced_tweets': [{'type': 'retweeted', 'id': '1315363137240010753'}], 'conversation_id': '1315441338427506689', 'id': '1315441338427506689', 'lang': 'en', 'public_metrics': {'retweet_count': 20, 'reply_count': 0, 'like_count': 0, 'quote_count': 0}, 'created_at': '20201011T23:57:09.000Z', 'source': 'Twitter for Android', 'possibly_sensitive': False}
重新格式化导致问题的示例行:
{"users": [{"id": "437781219", "username": "HakesJon", "location": `"Wisconsin", "description": "#IndieFictionWriter. Husband. Father. Bearded.\n#BlackLivesMatter #DemilitarizeThePolice #DismantlePolicing", "name": "Jon Hakes", "created_at": "20111215T20:42:41.000Z"}, {"id": "1171947445841997824", "username": "FactNc", "location": "Under Carolina blue skies ", "description": "Defender of truth, justice and the American way. "I never give them hell. I just tell the truth and they think it\'s hell." Harry S. Truman", "name": "NCFactFinder", "created_at": "20190912T00:44:21.000Z"}, {"id": "315041625", "username": "o0rimbuk0o", "description": "Your desire to put pronouns here is not my issue. Get help.\n\n#resist #notmypresident\n#FBiden", "name": "Sick of it", "created_at": "20110611T06:16:11.000Z"}, {"id": "3141427487", "username": "theGeekSheek", "description": "I don't believe in your God. Don't tell me he hates me.", "name": "Chic Geek", "created_at": "20150406T18:34:45.000Z"}, {"id": "1084112678", "username": "KarinBorjeesson", "description": "Love to help people & animals in need. Love music. Fucking hate racists. #Anon #OpExposeCPS #BLM #FreePalestine #Yemen #OpSerenaShim #Animalrights #NoDAPL", "name": "AnonyMISSKarin", "created_at": "20130112T20:57:28.000Z"}, {"id": "1003712866011308033", "username": "persian_pesar", "description": "\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200fبه ستواری و سختی رشک پولاد/\nبه راه عشق سرها داده بر باد/\nقرین بیستون هم\u200cسنگ فرهاد/\nز کرمانشاهیان یاد اینچنین باد\n\u200e#Civil_Environment_Engineer", "name": "persianpesar\u200d", "created_at": "20180604T18:59:30.000Z"}, {"id": "814795859644809217", "username": "Aazadist", "description": "\u200f\u200e#Equality\n\u200e#Humanity\nخواهی نشوی همرنگ ، رسوای جماعت شو", "name": "Aazad ️\u200d آزاد", "created_at": "20161230T11:30:45.000Z"}, {"id": "790375699638915072", "username": "Isaihstewart", "location": "Los Angeles, CA", "description": "Part time assistant manager at “Sheets and Things”", "name": "Dey got the henessey ", "created_at": "20161024T02:13:46.000Z"}, {"id": "4846243708", "username": "williamvercetti", "location": "Virginia Beach, VA", "description": "vma. art. modelo papi. tpain to the dms.", "name": "William Vercetti", "created_at": "20160125T17:21:50.000Z"}, {"id": "1160723882", "username": "k_cawsey", "location": "Halifax, Nova Scotia", "description": "Chaucer, Malory, Arthur Tolkien. @Dal_English", "name": "Dr. Kathy Cawsey", "created_at": "20130208T17:15:30.000Z"}, {"id": "3789298943", "username": "solomonesther17", "location": "Lagos, Nigeria", "description": "FairBib Legal Practitioners", "name": "Esther Solomon", "created_at": "20150927T04:52:29.000Z"}, {"id": "14860380", "username": "Dejify", "location": "San Francisco", "description": "The Nigerian State is a festering boil that the world can't afford to ignore. Because, when it pops, its rancid ooze won't be pleasant nor easy to contain.", "name": "Buhari: Uber Ment (Dèjì Akọ́mọláfẹ́)", "created_at": "20080521T18:57:27.000Z"}, {"id": "1120883223070773248", "username": "Donna780780", "description": "", "name": "Donna Swidley", "created_at": "20190424T02:52:40.000Z"}, {"id": "1253742908487929858", "username": "Neros_sis", "location": "Florida", "description": "", "name": "@Nero's Fiddle GOP has a terrorism problem", "created_at": "20200424T17:50:00.000Z"}, {"id": "585090491", "username": "vickierae562", "location": "The LBC", "description": "That’s Right, I’m a Lefty and I don’t feed trolls! #resist #DumpTrump #DitchMitch #LooseLindsey", "name": "Vickie Rae", "created_at": "20120519T21:00:28.000Z"}, {"id": "1262122532607574022", "username": "EmilySi49944255", "description": "", "name": "Skylar Aubrey", "created_at": "20200517T20:47:34.000Z"}, {"id": "1401663176", "username": "mdeHummelchen", "location": "Tief im Westen", "description": "Pflegewissenschaftlerin,Pflegeberaterin,Dozentin,Lächeln und winken...Pro Pflegekammer", "name": "Madame Hummelchen ", "created_at": "20130504T07:44:32.000Z"}, {"id": "2381808114", "username": "mommy97giraffe", "location": "Antifa HQs/Mom Division Office", "description": "Follower of Jesus, Mennonite mom&wife, lover of books, world, peo, poetry&art. 6 autoimmunes&fibroie Proud Mama Bear of 1gayD & 1pan&autistic son, in 20s", "name": "Mennonite Mom(she/her)", "created_at": "20140310T08:51:02.000Z"}, {"id": "2362182011", "username": "rd2glry", "location": "Washington, DC", "description": "", "name": "ateachr", "created_at": "20140224T04:07:21.000Z"}, {"id": "974917494870700032", "username": "GiraffeOld", "location": "Arizona, USA", "description": "", "name": "old man giraffe", "created_at": "20180317T07:56:58.000Z"}, {"id": "830939480", "username": "redz041", "description": "", "name": "Jan Mouzone", "created_at": "20120918T12:18:36.000Z"}, {"id": "3346032292", "username": "kumccaig44", "description": "", "name": "Katrine McCaig", "created_at": "20150625T21:25:21.000Z"}, {"id": "80630279", "username": "LuluTheCalm", "location": "Green Grass & Puddles, Canada", "description": "Mischief in My Eyes & Adventure in My Soul. \nLet's Have a Laugh &, you know, Make the World a Better Place. \nAus/Brit/Cdn", "name": "Lulu #BeKindBeCalmBeSafe ", "created_at": "20091007T17:26:56.000Z"}, {"id": "3252437864", "username": "engelhardterin", "location": "Houston, TX || Lubbock, TX", "description": "24 || Texas Tech || ♀️ || she/her", "name": "Erin Engelhardt", "created_at": "20150622T07:26:28.000Z"}, {"id": "93797267", "username": "mcbeaz", "location": "he/him", "description": "black lives matter.", "name": "mike", "created_at": "20091201T05:28:58.000Z"}, {"id": "2585773107", "username": "michiganington", "location": "Washington, D.C. ", "description": "", "name": "Allyoop", "created_at": "20140606T02:12:33.000Z"}, {"id": "27857135", "username": "JackRayher", "location": "Northport, NY", "description": "Senior Marketing Executive\nLifelong Democrat\n#BidenHarris", "name": "Jack Rayher", "created_at": "20090331T12:12:03.000Z"}, {"id": "1078457644736827392", "username": "RobertCooper58", "description": "Bilingual community advocate. Father of five wonderful kids. Lifelong progressive and proud member of @TheDemCoalition. Early supporter of President @JoeBiden.", "name": "Robert Cooper ", "created_at": "20181228T01:08:34.000Z"}, {"id": "206860139", "username": "MariaArtze", "location": "Münster, Deutschland", "description": "Nas trincheiras da ESO\nEmigrante a medio retornar. Womansplainer.\n(Sie vostede)\n\nTrans rights are human rights.", "name": "A Malvada Profe mediovacinada", "created_at": "20101023T22:27:26.000Z"}, {"id": "2903906123", "username": "lm1067", "location": "London, England", "description": "B A FINE ARTIST GRADUATED", "name": "Luis Pais", "created_at": "20141203T15:53:10.000Z"}, {"id": "64119853", "username": "IAM_SHAKESPEARE", "location": "Tweeting from the Grave", "description": "This bot has tweeted the complete works of Shakespeare (in order) 5 times over the last 12years. On hiatus for a bit. Created by @strebel", "name": "Willy Shakes", "created_at": "20090809T05:41:08.000Z"}, {"id": "3176623941", "username": "acastellich", "location": "Chicago, Il.", "description": "Abogado,Restaurantero,Immigrant , UVM. AD1 IPADE MBA. Restaurant Hospitality Industry, Chicago IL.", "name": "Alejandro Castelli", "created_at": "20150417T13:23:17.000Z"}, {"id": "782765390925533185", "username": "Diane_L_Espo", "location": "Florida, USA", "description": "", "name": "DianeEspo ", "created_at": "20161003T02:13:07.000Z"}, {"id": "67471020", "username": "thedcma", "location": "Fort Lauderdale, FL", "description": " Style is the only substance I abuse. I’m just a Gay Hillbilly Warlock Riding a \u200dVaporwave Fever Dream #blacklivesmatter", "name": "Grace Kelly on Steiroids", "created_at": "20090821T00:32:37.000Z"}, {"id": "78797635", "username": "graciosodiablo", "description": "Too much of a good thing can be bad. So too little of a bad thing must be good. 160 characters or less of me should be perfect.", "name": "gracioso diabloint", "created_at": "20091001T03:59:16.000Z"}, {"id": "268314713", "username": "philppedurand", "location": "Auxerre", "description": "Je suis une personne gentille je milite pour la PMA. je suis militant communiste je suis aussi à l’association des Rosoirs je suis conseillé quartier", "name": "Philippe durand", "created_at": "20110318T14:37:36.000Z"}, {"id": "37996028", "username": "nicrawhide", "location": "Pinconning Michigan ", "description": "Just your average small town gay with big town sensibility!!", "name": "Nicholas Bean", "created_at": "20090505T19:20:37.000Z"}, {"id": "1236656342674407427", "username": "LadyJayPersists", "location": "Valhalla", "description": "USN Veteran | Shieldmaiden | Mom | Not here for a man, I have one | PTSD Warrior | My mind is a beautiful servant to a dangerous master", "name": "Jax", "created_at": "20200308T14:13:48.000Z"}, {"id": "171183306", "username": "dawndawnB", "location": "United States", "description": "Mrs. B, mother of 2 amazing kids, Substance Abuse Counselor, Volunteer, Music Lover. Born in DC but a VA Lo❤️er!", "name": "nwad", "created_at": "20100726T19:21:24.000Z"}, {"id": "817247846751555587", "username": "me2020_2021", "location": "Brisbane, Queensland", "description": "Proud Aussie, living a wonderful life with my wife, Australian Cricket , \U0001f9ae Alex", "name": "️\u200d "A girl has no Name”', 'created_at': '20170106T05:54:05.000Z'}, {'id': '879459933988585472', 'username': 'Davecl3069', 'location': 'San Francisco Bay Area', 'description': 'proud of my views, life long learner,& hopefully, that guy!\n#LowerTheFlagForCovidVictims #VoteBlue #BLM #SupportThePlayers #LGBTQ #WeNeedToDoBetter #ResistStill', 'name': 'David', 'created_at': '20170626T22:02:42.000Z'}`
使用代码
正则表达式:
def convert_to_json(file):
with open(file, "r", encoding="utf-8") as f:
x = f.read()
x = x.replace("-", "")
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\'')
decoded = rx.sub('"', x)
literal_eval:
def open_json():
with open("data.json", "r", encoding="utf-8") as f:
f.read()
data = literal_eval(f)
data = json.loads(str(data))
我想达到的目标
- 重新格式化数据以符合JSON(本题)以便能够
- 使用相关推文文本、用户信息和元数据(次要目标)构建数据框,以用于进一步分析。
提前感谢您的任何建议! :)
如果导致问题的 '
仅在推文和描述中
你可以试试
pre_tweet ="'text': '"
post_tweet = "', 'referenced_tweets':"
with open(file, encoding="utf-8") as f:
data=f.readlines()
output = []
errors = []
for line in data:
if pre_tweet in line and post_tweet in line :
first_part,rest = line.split(pre_tweet)
tweet,last_part = rest.split(post_tweet)
pre_tweet = first_part.replace('\'', '\"') + pre_tweet.replace('\'', '\"')
post_tweet = post_tweet.replace('\'', '\"') + last_part.replace('\'', '\"')
output.append(pre_tweet + tweet + post_tweet)
else :
errors.append(line)
如果错误不为空,要么是因为该行中没有推文(您可以稍微更改代码以将其添加到输出中),
或者推文后面的内容不是 'referenced_tweets'。在第二种情况下,您可以尝试弄清楚会发生什么变化并修改上面的代码以添加多个 post_tweet
那么您可以通过将 pre 和 post 推文更改为通常在描述
tweets/description之后可能的键数必须是有限的,所以可能需要一些时间来找出所有的可能性,但最终你应该成功
所以我想出了一种方法来处理损坏的数据。 可以找到解决方案 here.
使用 ast.literal_eval(input_string)
可以让我在损坏的 json 行中读入字典。唯一的事情是确保输入字符串中不包含前导或尾随空格、逗号等。
使用 ast.literal_eval():
from ast import literal_eval
with open("inputdata.json", "r", encoding="utf-8") as f:
dictlist = []
for line in f:
x: str = f.readline()
x = x.lstrip()
data = literal_eval(x)
dictlist.append(data)