有没有办法将 convert/standardize 文本转换为 Python 中的整数?
Is there a way to convert/standardize text into Integer in Python?
我有一个数据框,其中有一列显示组织每个库存项目所花费的时间(以分钟为单位)。目标是以整数或浮点数显示花费的分钟数。但是,此列中的值不干净,请参见下面的一些示例。有没有办法标准化并将所有内容转换为整数或浮点数? (例如10小时应该是600分钟)
import pandas as pd
df1 = { 'min':['420','450','480','512','560','10 hours', '10.5 hours',
'420 (all inventory)','3h ', '4.1 hours', '60**','6h', '7hours ']}
df1=pd.DataFrame(df1)
想要的输出是这样的
我用regex
解决了这类问题。
import regex as re
import numpy as np
import pandas as pd
df1 = { 'min':['420','450','480','512','560','10 hours', '10.5 hours',
'420 (all inventory)','3h ', '4.1 hours', '60**','6h', '7hours ']}
df1=pd.DataFrame(df1)
# Copy Dataframe for iteration
# Created a empty numpy array for parsing by index
arr = np.zeros(df1.shape[0])
df1_copy = df1.copy()
for i,j in df1_copy.iterrows():
if "h" in j["min"]:
j["min"] = re.sub(r"[a-zA-Z()\s]","",j["min"])
j["min"] = float(j["min"])
arr[i] = float(j["min"]*60)
else:
j["min"] = re.sub(r"[a-zA-Z()**\s]","",j["min"])
j["min"] = float(j["min"])
arr[i] = float(j["min"])
df1["min_clean"] = arr
print(df1)
min min_clean
0 420 420.0
1 450 450.0
2 480 480.0
3 512 512.0
4 560 560.0
5 10 hours 600.0
6 10.5 hours 630.0
7 420 (all inventory) 420.0
8 3h 180.0
9 4.1 hours 246.0
10 60** 60.0
11 6h 360.0
12 7hours 420.0
我目前不知道 pandas
但这个解决方案(使用正则表达式)可能会有所帮助
import re
df1 = { 'min':['420','450','480','512','560','10 hours', '10.5 hours',
'420 (all inventory)','3h ', '4.1 hours', '60**','6h', '7hours ']}
def mins(s):
if re.match(r"\d*\.?\d+ *(h|hour)", s):
l = re.sub(r"[^\d.]", "", s).split(".")
m = int(l[0]) * 60
if len(l) != 1:
m += int(l[1]) * 6
return m
return int(re.sub(r"\D", "", s))
min_clear = map(mins, df1["min"])
print(list(min_clear))
# output: [420, 450, 480, 512, 560, 600, 630, 420, 180, 246, 60, 360, 420]
您稍后可以将 min_clear
添加到 DataFrame
顺便说一句,我只是一个初学者;如果任何用例失败,请告诉我,我会尽力改进。
谢谢
我有一个数据框,其中有一列显示组织每个库存项目所花费的时间(以分钟为单位)。目标是以整数或浮点数显示花费的分钟数。但是,此列中的值不干净,请参见下面的一些示例。有没有办法标准化并将所有内容转换为整数或浮点数? (例如10小时应该是600分钟)
import pandas as pd
df1 = { 'min':['420','450','480','512','560','10 hours', '10.5 hours',
'420 (all inventory)','3h ', '4.1 hours', '60**','6h', '7hours ']}
df1=pd.DataFrame(df1)
想要的输出是这样的
我用regex
解决了这类问题。
import regex as re
import numpy as np
import pandas as pd
df1 = { 'min':['420','450','480','512','560','10 hours', '10.5 hours',
'420 (all inventory)','3h ', '4.1 hours', '60**','6h', '7hours ']}
df1=pd.DataFrame(df1)
# Copy Dataframe for iteration
# Created a empty numpy array for parsing by index
arr = np.zeros(df1.shape[0])
df1_copy = df1.copy()
for i,j in df1_copy.iterrows():
if "h" in j["min"]:
j["min"] = re.sub(r"[a-zA-Z()\s]","",j["min"])
j["min"] = float(j["min"])
arr[i] = float(j["min"]*60)
else:
j["min"] = re.sub(r"[a-zA-Z()**\s]","",j["min"])
j["min"] = float(j["min"])
arr[i] = float(j["min"])
df1["min_clean"] = arr
print(df1)
min min_clean
0 420 420.0
1 450 450.0
2 480 480.0
3 512 512.0
4 560 560.0
5 10 hours 600.0
6 10.5 hours 630.0
7 420 (all inventory) 420.0
8 3h 180.0
9 4.1 hours 246.0
10 60** 60.0
11 6h 360.0
12 7hours 420.0
我目前不知道 pandas
但这个解决方案(使用正则表达式)可能会有所帮助
import re
df1 = { 'min':['420','450','480','512','560','10 hours', '10.5 hours',
'420 (all inventory)','3h ', '4.1 hours', '60**','6h', '7hours ']}
def mins(s):
if re.match(r"\d*\.?\d+ *(h|hour)", s):
l = re.sub(r"[^\d.]", "", s).split(".")
m = int(l[0]) * 60
if len(l) != 1:
m += int(l[1]) * 6
return m
return int(re.sub(r"\D", "", s))
min_clear = map(mins, df1["min"])
print(list(min_clear))
# output: [420, 450, 480, 512, 560, 600, 630, 420, 180, 246, 60, 360, 420]
您稍后可以将 min_clear
添加到 DataFrame
顺便说一句,我只是一个初学者;如果任何用例失败,请告诉我,我会尽力改进。
谢谢