由于某些字符长度不同,如何在不使用索引的情况下解析数据
How do I parse data without using the index because some characters are different lengths
我需要解析这些数据,以便数据解析列中的每个值都存放在它自己的列中。
userid data_to_parse
0 54f3ad9a29ada "value":"N;U;A7;W"}]
1 54f69f2de6aec "value":"N;U;I6;W"}]
2 54f650f004474 "value":"Y;U;A7;W"}]
3 54f52e8872227 "value":"N;U;I1;W"}]
4 54f64d3075b72 "value":"Y;U;A7;W"}]
例如,第一个条目的四个附加列的值为“N”、“U”、“A7”和“W”。我首先尝试像这样根据索引进行拆分:
parsing_df['value_one'] = parsing_df['data_to_parse'].str[9:10]
parsing_df['value_two'] = parsing_df['data_to_parse'].str[11:12]
parsing_df['value_three'] = parsing_df['data_to_parse'].str[13:15]
parsing_df['value_four'] = parsing_df['data_to_parse'].str[16:17]
除了有一些长度不同(例如 937 和 938)之外,效果非常好。
935 54f45edd13582 "value":"N;U;A7;W"}] N U A7 W
936 54f4d55080113 "value":"N;C;A7;L"}] N C A7 L
937 54f534614d44b "value":"N;U;U;W"}] N U U; "
938 54f383ee53069 "value":"N;U;U;W"}] N U U; "
939 54f40656a4be4 "value":"Y;U;A1;W"}] Y U A1 W
940 54f5d4e063d6a "value":"N;U;A4;W"}] N U A4 W
有没有人有不使用硬编码位置的解决方案?
感谢您的帮助!
w=15=shw=12=shw=13=sh
w=15=WILL.y.w=13=w
w=10=sh
w=11=sh
一个相对简单的解决问题的方法:
txt = """54f45edd13582 "value":"N;U;A7;W"}]
54f4d55080113 "value":"N;C;A7;L"}]
54f534614d44b "value":"N;U;U;W"}]
54f383ee53069 "value":"N;U;U;W"}]
54f40656a4be4 "value":"Y;U;A1;W"}]
54f5d4e063d6a "value":"N;U;A4;W"}]
"""
import pandas as pd
txt = txt.replace('}','').replace(']','').replace('"','') #first, clean up the data
#then, collect your data (it may be possible to do it w/ list comprehension, but I prefer this):
rows = []
for l in [t.split('\tvalue:') for t in txt.splitlines()]:
#depending on your actual data, you may have to split by "\nvalue" or " value" or whatever
row = l[1].split(';')
row.insert(0,l[0])
rows.append(row)
#define your columns
columns = ['userid','value_one','value_two','value_three','value_four']
#finally, create your dataframe:
pd.DataFrame(rows,columns=columns)
输出(请原谅格式):
userid value_one value_two value_three value_four
0 54f45edd13582 N U A7 W
1 54f4d55080113 N C A7 L
2 54f534614d44b N U U W
3 54f383ee53069 N U U W
4 54f40656a4be4 Y U A1 W
5 54f5d4e063d6a N U A4 W
我需要解析这些数据,以便数据解析列中的每个值都存放在它自己的列中。
userid data_to_parse
0 54f3ad9a29ada "value":"N;U;A7;W"}]
1 54f69f2de6aec "value":"N;U;I6;W"}]
2 54f650f004474 "value":"Y;U;A7;W"}]
3 54f52e8872227 "value":"N;U;I1;W"}]
4 54f64d3075b72 "value":"Y;U;A7;W"}]
例如,第一个条目的四个附加列的值为“N”、“U”、“A7”和“W”。我首先尝试像这样根据索引进行拆分:
parsing_df['value_one'] = parsing_df['data_to_parse'].str[9:10]
parsing_df['value_two'] = parsing_df['data_to_parse'].str[11:12]
parsing_df['value_three'] = parsing_df['data_to_parse'].str[13:15]
parsing_df['value_four'] = parsing_df['data_to_parse'].str[16:17]
除了有一些长度不同(例如 937 和 938)之外,效果非常好。
935 54f45edd13582 "value":"N;U;A7;W"}] N U A7 W
936 54f4d55080113 "value":"N;C;A7;L"}] N C A7 L
937 54f534614d44b "value":"N;U;U;W"}] N U U; "
938 54f383ee53069 "value":"N;U;U;W"}] N U U; "
939 54f40656a4be4 "value":"Y;U;A1;W"}] Y U A1 W
940 54f5d4e063d6a "value":"N;U;A4;W"}] N U A4 W
有没有人有不使用硬编码位置的解决方案?
感谢您的帮助!
一个相对简单的解决问题的方法:
txt = """54f45edd13582 "value":"N;U;A7;W"}]
54f4d55080113 "value":"N;C;A7;L"}]
54f534614d44b "value":"N;U;U;W"}]
54f383ee53069 "value":"N;U;U;W"}]
54f40656a4be4 "value":"Y;U;A1;W"}]
54f5d4e063d6a "value":"N;U;A4;W"}]
"""
import pandas as pd
txt = txt.replace('}','').replace(']','').replace('"','') #first, clean up the data
#then, collect your data (it may be possible to do it w/ list comprehension, but I prefer this):
rows = []
for l in [t.split('\tvalue:') for t in txt.splitlines()]:
#depending on your actual data, you may have to split by "\nvalue" or " value" or whatever
row = l[1].split(';')
row.insert(0,l[0])
rows.append(row)
#define your columns
columns = ['userid','value_one','value_two','value_three','value_four']
#finally, create your dataframe:
pd.DataFrame(rows,columns=columns)
输出(请原谅格式):
userid value_one value_two value_three value_four
0 54f45edd13582 N U A7 W
1 54f4d55080113 N C A7 L
2 54f534614d44b N U U W
3 54f383ee53069 N U U W
4 54f40656a4be4 Y U A1 W
5 54f5d4e063d6a N U A4 W