在 DataFrame 中将行转换为整数时删除数组中的字符
Deleting characters in arrays while converting rows to integer in DataFrame
我有一个如下所示的数据框
每行的长度不同。
我正在尝试运行下面的代码
for i in range(len(df)):
df['ColumnA'][i] = df['ColumnA'][i].astype(int)
我收到错误消息,因为某些行具有“64B”、“64A”等字符串值。
例如,
df['ColumnA'][37]
输出:
array(['34', '35', '36', '38', '39', '40', '41', '56', '58', '59', '60',
'61', '62', '62A', '62B', '63', '64', '65', '88', '90', '94', '98'],
dtype='<U3')
您可以轻松查看“62A”、“62B”字符串值。
我的目标 是将所有字符串转换为 int (例如从 62A 到 62)。删除旁边的字符。
您可以使用内置 regex module 中的 sub
函数替换每个 non-digit 字符(\D
在正则表达式中选择 non-digits)空白 ""
.
列表推导式可让您将相同的方法应用于列表中的每个项目。
您需要的最后一个组件是 pandas 中的 apply
方法,它将对系列中的每个项目应用相同的功能。
import re
import pandas as pd
# sample data
df = pd.Series([
[61, '62a', 70, 'z8z8z'],
[61, '62hello', 70],
], name='columnA').to_frame()
# show sample data
df
#> columnA
#> 0 [61, 62a, 70, z8z8z]
#> 1 [61, 62hello, 70]
# select the column and apply the function to each element
df.columnA.apply(
lambda row: [
int(re.sub("\D", "", str(x))) # cast to string, replace any non-digit character, cast to int
for x in row # for every item in the list
]
)
#> 0 [61, 62, 70, 88]
#> 1 [61, 62, 70]
#> Name: columnA, dtype: object
这将导致一个列表存储为系列“columnA”中的每个项目。如果您希望这些是 numpy 数组,您可以将列表理解包装在对 np.array
.
的调用中
您可以通过使用列表理解来做到这一点:
import re
import pandas as pd
import numpy as np
df = pd.DataFrame({'colA': [np.array(['34A', '35', '36A', '38', '39', '40', '41', '56', '58', '59', '60',
'61', '62', '62A', '62B', '63', '64', '65', '88', '90', '94', '98'],
dtype='<U3') for _ in range(10)]})
print('before : \n', df)
df['colA'] = [[int(re.sub('[^0-9]','', el)) for el in row] for row in df['colA']]
print('after : \n', df)
输出:
before :
colA
0 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
1 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
2 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
3 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
4 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
5 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
6 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
7 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
8 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
9 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
after :
colA
0 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
1 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
2 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
3 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
4 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
5 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
6 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
7 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
8 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
9 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
我有一个如下所示的数据框
每行的长度不同。
我正在尝试运行下面的代码
for i in range(len(df)):
df['ColumnA'][i] = df['ColumnA'][i].astype(int)
我收到错误消息,因为某些行具有“64B”、“64A”等字符串值。
例如,
df['ColumnA'][37]
输出:
array(['34', '35', '36', '38', '39', '40', '41', '56', '58', '59', '60',
'61', '62', '62A', '62B', '63', '64', '65', '88', '90', '94', '98'],
dtype='<U3')
您可以轻松查看“62A”、“62B”字符串值。
我的目标 是将所有字符串转换为 int (例如从 62A 到 62)。删除旁边的字符。
您可以使用内置 regex module 中的 sub
函数替换每个 non-digit 字符(\D
在正则表达式中选择 non-digits)空白 ""
.
列表推导式可让您将相同的方法应用于列表中的每个项目。
您需要的最后一个组件是 pandas 中的 apply
方法,它将对系列中的每个项目应用相同的功能。
import re
import pandas as pd
# sample data
df = pd.Series([
[61, '62a', 70, 'z8z8z'],
[61, '62hello', 70],
], name='columnA').to_frame()
# show sample data
df
#> columnA
#> 0 [61, 62a, 70, z8z8z]
#> 1 [61, 62hello, 70]
# select the column and apply the function to each element
df.columnA.apply(
lambda row: [
int(re.sub("\D", "", str(x))) # cast to string, replace any non-digit character, cast to int
for x in row # for every item in the list
]
)
#> 0 [61, 62, 70, 88]
#> 1 [61, 62, 70]
#> Name: columnA, dtype: object
这将导致一个列表存储为系列“columnA”中的每个项目。如果您希望这些是 numpy 数组,您可以将列表理解包装在对 np.array
.
您可以通过使用列表理解来做到这一点:
import re
import pandas as pd
import numpy as np
df = pd.DataFrame({'colA': [np.array(['34A', '35', '36A', '38', '39', '40', '41', '56', '58', '59', '60',
'61', '62', '62A', '62B', '63', '64', '65', '88', '90', '94', '98'],
dtype='<U3') for _ in range(10)]})
print('before : \n', df)
df['colA'] = [[int(re.sub('[^0-9]','', el)) for el in row] for row in df['colA']]
print('after : \n', df)
输出:
before :
colA
0 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
1 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
2 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
3 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
4 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
5 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
6 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
7 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
8 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
9 [34A, 35, 36A, 38, 39, 40, 41, 56, 58, 59, 60,...
after :
colA
0 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
1 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
2 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
3 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
4 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
5 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
6 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
7 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
8 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...
9 [34, 35, 36, 38, 39, 40, 41, 56, 58, 59, 60, 6...