从字符串中提取定量信息
Extracting quantitative Information out of Strings
我正在分析 Open Food Facts 数据集。
数据集非常混乱,有一个名为 'quantity' 的列,其中包含如下条目:
'100克',
'5 盎司(142 克)',
'12 盎司',
'200克',
'12 盎司(340 克)',
'10 盎司(296 毫升)',
'750 毫升',
'1l',
'250 毫升',
'8 盎司',
'10.5 盎司(750 克)',
'1 加仑(3.78 升)',
'27 盎司(1 磅 11 盎司)765 克',
'75 cl',
如您所见,到处都是数值和度量单位!有时数量以两种不同的测量方式给出......
我的目标是在我的 pandas 数据框中创建一个新列 'quantity_in_g',我从字符串中提取信息并根据 'quantity' 列中的克数创建一个整数值.
因此,如果数量列为“200 g”,我想要整数 200,如果它显示“1 kg”,我想要整数 1000。我还想将其他测量单位转换为克。对于“2 oz”,我想要整数 56,对于 1 L,我想要 1000。
有人可以帮我转换这个专栏吗?
非常感谢!
提前致谢
raw_data_lst = ['100 g ','5 oz (142 g)','12 oz','200 g ','12 oz (340 g)','10 f oz (296ml)','750 ml','1 l','250 ml', '8 OZ',]
# 10 f oz (296ml) don't know what f is
# if more there is more data like this then gram_conv_dict.keys() loop over this instead of directly ... doing what i have done below
in_grams_colm = []
gram_conv_dict ={
'g':1,
'oz': 28.3495,
'kg':1000,
'l': 1000 # assuming 1 litre of water --> grams
}
# ml --> g is tricky as density varies
def convert2num(string_num):
try:
return int(string_num)
except ValueError:
return float(string_num)
def get_in_grams(unit):
try:
return gram_conv_dict[unit.lower()]
except:
print('don\'t know how much grams is present in 1',unit+'.')
return 1
for data in raw_data_lst:
i = 0
quantity_str =''
quantity_num = 0
while i < len(data):
if 47 < ord(data[i]) < 58 or data[i] == '.':
quantity_str+= data[i]
else:
# data[i] = '' most abbrv has at most length = 2 therefore data[i+1:i+3] or u can just send the whole data[i+1:]
# gram_conv_dict[data[i+1:i+3].strip()] directly check if key exist
break
i+=1
quantity_num = convert2num(quantity_str)*get_in_grams(data[i+1:i+3].strip()) # assuming each data has this format numberspace-- len 2 abbrv
in_grams_colm.append(quantity_num) # if u want only integer int(quantity_num)
#print(in_grams_colm)
def nice_print():
for _ in in_grams_colm:
print('{:.2f}'.format(_))
nice_print()
'''
output
don't know how much grams is present in 1 f.
don't know how much grams is present in 1 ml.
don't know how much grams is present in 1 ml.
100.00
141.75
340.19
200.00
340.19
10.00
750.00
1000.00
250.00
226.80'''
我正在分析 Open Food Facts 数据集。 数据集非常混乱,有一个名为 'quantity' 的列,其中包含如下条目:
'100克',
'5 盎司(142 克)',
'12 盎司',
'200克',
'12 盎司(340 克)',
'10 盎司(296 毫升)',
'750 毫升',
'1l',
'250 毫升',
'8 盎司',
'10.5 盎司(750 克)',
'1 加仑(3.78 升)',
'27 盎司(1 磅 11 盎司)765 克',
'75 cl',
如您所见,到处都是数值和度量单位!有时数量以两种不同的测量方式给出......
我的目标是在我的 pandas 数据框中创建一个新列 'quantity_in_g',我从字符串中提取信息并根据 'quantity' 列中的克数创建一个整数值.
因此,如果数量列为“200 g”,我想要整数 200,如果它显示“1 kg”,我想要整数 1000。我还想将其他测量单位转换为克。对于“2 oz”,我想要整数 56,对于 1 L,我想要 1000。
有人可以帮我转换这个专栏吗?
非常感谢!
提前致谢
raw_data_lst = ['100 g ','5 oz (142 g)','12 oz','200 g ','12 oz (340 g)','10 f oz (296ml)','750 ml','1 l','250 ml', '8 OZ',]
# 10 f oz (296ml) don't know what f is
# if more there is more data like this then gram_conv_dict.keys() loop over this instead of directly ... doing what i have done below
in_grams_colm = []
gram_conv_dict ={
'g':1,
'oz': 28.3495,
'kg':1000,
'l': 1000 # assuming 1 litre of water --> grams
}
# ml --> g is tricky as density varies
def convert2num(string_num):
try:
return int(string_num)
except ValueError:
return float(string_num)
def get_in_grams(unit):
try:
return gram_conv_dict[unit.lower()]
except:
print('don\'t know how much grams is present in 1',unit+'.')
return 1
for data in raw_data_lst:
i = 0
quantity_str =''
quantity_num = 0
while i < len(data):
if 47 < ord(data[i]) < 58 or data[i] == '.':
quantity_str+= data[i]
else:
# data[i] = '' most abbrv has at most length = 2 therefore data[i+1:i+3] or u can just send the whole data[i+1:]
# gram_conv_dict[data[i+1:i+3].strip()] directly check if key exist
break
i+=1
quantity_num = convert2num(quantity_str)*get_in_grams(data[i+1:i+3].strip()) # assuming each data has this format numberspace-- len 2 abbrv
in_grams_colm.append(quantity_num) # if u want only integer int(quantity_num)
#print(in_grams_colm)
def nice_print():
for _ in in_grams_colm:
print('{:.2f}'.format(_))
nice_print()
'''
output
don't know how much grams is present in 1 f.
don't know how much grams is present in 1 ml.
don't know how much grams is present in 1 ml.
100.00
141.75
340.19
200.00
340.19
10.00
750.00
1000.00
250.00
226.80'''