按行修改 pandas DataFrame 中的字符串
Modifying strings in a pandas DataFrame by row
我在 pandas DataFrame 的 Python3、列 string1
和 string2
中有以下字符串:
import pandas as pd
datainput = [
{ 'string1': 'TTTABCDABCDTTTTT', 'string2': 'ABABABABABABABAA' },
{ 'string1': 'AAAAAAAA', 'string2': 'TTAAAATT' },
{ 'string1': 'TTABCDTTTTT', 'string2': 'ABABABABABA' }
]
df = pd.DataFrame(datainput)
df
string1 string2
0 TTTABCDABCDTTTTT ABABABABABABABAA
1 AAAAAAAA TTAAAATT
2 TTABCDTTTTT ABABABABABA
对于每一行,string1
和 string2
列中的字符串被定义为相同的长度。
对于 DataFrame 的每一行,字符串可能需要 "cleaned" 个 beginning/trailing 个字母 'T'。但是,对于每一行,字符串都需要去除相同数量的字符,以便字符串保持相同的长度。
正确的输出如下:
df
string1 string2
0 ABCDABCD BABABABA
1 AAAA AAAA
2 ABCD ABAB
如果这是两个变量,用 strip()
计算会很简单,例如
string1 = "TTTABCDABCDTTTTT"
string2 = "ABABABABABABABAA"
length_original = len(string1)
num_left_chars = len(string1) - len(string1.lstrip('T'))
num_right_chars = len(string1.rstrip('T'))
edited = string1[num_left_chars:num_right_chars]
## print(edited)
## 'ABCDABCD'
但是,在这种情况下,需要遍历所有行并一次重新定义两行。如何逐行修改这些字符串?
编辑:我的主要困惑是,鉴于两列都可以 T
,我该如何重新定义它们?
raw_data = {'name': ['Will Morris', 'Alferd Hitcock', 'Sir William', 'Daniel Thomas'],
'age': [11, 49, 66, 77],
'color': ['TblueT', 'redT', 'white', "cyan"],
'marks': [74, 90, 44, 17]}
df = pd.DataFrame(raw_data, columns = ['name', 'age', 'color', 'grade'])
print(df)
cols = ['name','color']
print("new df")
#following line does the magic
df[cols] = df[cols].apply(lambda row: row.str.lstrip('T').str.rstrip('T'), axis=1)
print(df)
将打印
name age color grade
0 TWillard MorrisT 20 TblueT 88
1 Al Jennings 19 redT 92
2 Omar Mullins 22 yellow 95
3 Spencer McDaniel 21 green 70
new df
name age color grade
0 Willard Morris 20 blue 88
1 Al Jennings 19 red 92
2 Omar Mullins 22 yellow 95
3 Spencer McDaniel 21 green 70
有点冗长,但完成工作..
import re
def count_head(s):
head = re.findall('^T+', s)
if head:
return len(head[0])
return 0
def count_tail(s):
tail = re.findall('T+$', s)
if tail:
return len(tail[0])
return 0
df1 = df.copy()
df1['st1_head'] = df1['string1'].apply(count_head)
df1['st2_head'] = df1['string2'].apply(count_head)
df1['st1_tail'] = df1['string1'].apply(count_tail)
df1['st2_tail'] = df1['string2'].apply(count_tail)
df1['length'] = df1['string1'].str.len()
def trim_strings(row):
head = max(row['st1_head'], row['st2_head'])
tail = max(row['st1_tail'], row['st2_tail'])
l = row['length']
return {'string1': row['string1'][head:(l-tail)],
'string2': row['string2'][head:(l-tail)]}
new_df = pd.DataFrame(list(df1.apply(trim_strings, axis=1)))
print(new_df)
输出:
string1 string2
0 ABCDABCD BABABABA
1 AAAA AAAA
2 ABCD ABAB
更紧凑的版本:
def trim(st1, st2):
l = len(st1)
head = max(len(st1) - len(st1.lstrip('T')),
len(st2) - len(st2.lstrip('T')))
tail = max(len(st1) - len(st1.rstrip('T')),
len(st2) - len(st2.rstrip('T')))
return (st1[head:(l-tail)],
st2[head:(l-tail)])
new_df = pd.DataFrame(list(
df.apply(lambda r: trim(r['string1'], r['string2']),
axis=1)), columns=['string1', 'string2'])
print(new_df)
要注意的主要事情是 df.apply(<your function>, axis=1)
,它允许您在每一行上执行任何功能(在本例中同时作用于两列)。
我在 pandas DataFrame 的 Python3、列 string1
和 string2
中有以下字符串:
import pandas as pd
datainput = [
{ 'string1': 'TTTABCDABCDTTTTT', 'string2': 'ABABABABABABABAA' },
{ 'string1': 'AAAAAAAA', 'string2': 'TTAAAATT' },
{ 'string1': 'TTABCDTTTTT', 'string2': 'ABABABABABA' }
]
df = pd.DataFrame(datainput)
df
string1 string2
0 TTTABCDABCDTTTTT ABABABABABABABAA
1 AAAAAAAA TTAAAATT
2 TTABCDTTTTT ABABABABABA
对于每一行,string1
和 string2
列中的字符串被定义为相同的长度。
对于 DataFrame 的每一行,字符串可能需要 "cleaned" 个 beginning/trailing 个字母 'T'。但是,对于每一行,字符串都需要去除相同数量的字符,以便字符串保持相同的长度。
正确的输出如下:
df
string1 string2
0 ABCDABCD BABABABA
1 AAAA AAAA
2 ABCD ABAB
如果这是两个变量,用 strip()
计算会很简单,例如
string1 = "TTTABCDABCDTTTTT"
string2 = "ABABABABABABABAA"
length_original = len(string1)
num_left_chars = len(string1) - len(string1.lstrip('T'))
num_right_chars = len(string1.rstrip('T'))
edited = string1[num_left_chars:num_right_chars]
## print(edited)
## 'ABCDABCD'
但是,在这种情况下,需要遍历所有行并一次重新定义两行。如何逐行修改这些字符串?
编辑:我的主要困惑是,鉴于两列都可以 T
,我该如何重新定义它们?
raw_data = {'name': ['Will Morris', 'Alferd Hitcock', 'Sir William', 'Daniel Thomas'],
'age': [11, 49, 66, 77],
'color': ['TblueT', 'redT', 'white', "cyan"],
'marks': [74, 90, 44, 17]}
df = pd.DataFrame(raw_data, columns = ['name', 'age', 'color', 'grade'])
print(df)
cols = ['name','color']
print("new df")
#following line does the magic
df[cols] = df[cols].apply(lambda row: row.str.lstrip('T').str.rstrip('T'), axis=1)
print(df)
将打印
name age color grade
0 TWillard MorrisT 20 TblueT 88
1 Al Jennings 19 redT 92
2 Omar Mullins 22 yellow 95
3 Spencer McDaniel 21 green 70
new df
name age color grade
0 Willard Morris 20 blue 88
1 Al Jennings 19 red 92
2 Omar Mullins 22 yellow 95
3 Spencer McDaniel 21 green 70
有点冗长,但完成工作..
import re
def count_head(s):
head = re.findall('^T+', s)
if head:
return len(head[0])
return 0
def count_tail(s):
tail = re.findall('T+$', s)
if tail:
return len(tail[0])
return 0
df1 = df.copy()
df1['st1_head'] = df1['string1'].apply(count_head)
df1['st2_head'] = df1['string2'].apply(count_head)
df1['st1_tail'] = df1['string1'].apply(count_tail)
df1['st2_tail'] = df1['string2'].apply(count_tail)
df1['length'] = df1['string1'].str.len()
def trim_strings(row):
head = max(row['st1_head'], row['st2_head'])
tail = max(row['st1_tail'], row['st2_tail'])
l = row['length']
return {'string1': row['string1'][head:(l-tail)],
'string2': row['string2'][head:(l-tail)]}
new_df = pd.DataFrame(list(df1.apply(trim_strings, axis=1)))
print(new_df)
输出:
string1 string2
0 ABCDABCD BABABABA
1 AAAA AAAA
2 ABCD ABAB
更紧凑的版本:
def trim(st1, st2):
l = len(st1)
head = max(len(st1) - len(st1.lstrip('T')),
len(st2) - len(st2.lstrip('T')))
tail = max(len(st1) - len(st1.rstrip('T')),
len(st2) - len(st2.rstrip('T')))
return (st1[head:(l-tail)],
st2[head:(l-tail)])
new_df = pd.DataFrame(list(
df.apply(lambda r: trim(r['string1'], r['string2']),
axis=1)), columns=['string1', 'string2'])
print(new_df)
要注意的主要事情是 df.apply(<your function>, axis=1)
,它允许您在每一行上执行任何功能(在本例中同时作用于两列)。