求 pandas data.frame 每两行字符串之间的差异
Find the difference between strings for each two rows of pandas data.frame
我是 python 的新人,我为此苦苦挣扎了一段时间。
我有一个如下所示的文件:
name seq
1 a1 bbb
2 a2 bbc
3 b1 fff
4 b2 fff
5 c1 aaa
6 c2 acg
其中 name 是字符串的名称,seq 是字符串。
我想要一个新列或一个新数据框来指示每两行之间没有重叠的差异数。例如,我想要名称 [a1-a2] 的序列之间的差异数,然后是 [b1-b2],最后是 [c1-c2]。
所以我需要这样的东西:
name seq diff
1 a1 bbb NA
2 a2 bbc 1
3 b1 fff NA
4 b2 fff 0
5 c1 aaa NA
6 c2 acg 2
非常感谢任何帮助
您似乎想要 jaccard distance of the pairs of strings. Here's one way using groupby
and scipy.spatial.distance.jaccard
:
from scipy.spatial.distance import jaccard
g = df.groupby(df.name.str[0])
df['diff'] = [sim for _, seqs in g.seq for sim in
[float('nan'), jaccard(*map(list,seqs))]]
print(df)
name seq diff
1 a1 bbb NaN
2 a2 bbc 1.0
3 b1 fff NaN
4 b2 fff 0.0
5 c1 aaa NaN
6 c2 acg 2.0
作为第一步,我重新创建了您的数据:
#!/usr/bin/env python3
import pandas as pd
# Setup
data = {'name': {1: 'a1', 2: 'a2', 3: 'b1', 4: 'b2', 5: 'c1', 6: 'c2'}, 'seq': {1: 'bbb', 2: 'bbc', 3: 'fff', 4: 'fff', 5: 'aaa', 6: 'acg'}}
df = pd.DataFrame(data)
解决方案
您可以尝试遍历数据框并将上一次迭代的 seq
值与当前迭代的值进行比较。为了比较两个字符串(存储在数据框的 seq
列中),您可以应用一个简单的列表理解,就像在这个函数中一样:
def diff_letters(a,b):
return sum ( a[i] != b[i] for i in range(len(a)) )
Dataframe 行的迭代
diff = ['NA']
row_iterator = df.iterrows()
_, last = next(row_iterator)
# Iterate over the df get populate a list with result of the comparison
for i, row in row_iterator:
if i % 2 == 0:
diff.append(diff_letters(last['seq'],row['seq']))
else:
# for odd row numbers append NA value
diff.append("NA")
last = row
df['diff'] = diff
结果如下所示
name seq diff
1 a1 bbb NA
2 a2 bbc 1
3 b1 fff NA
4 b2 fff 0
5 c1 aaa NA
6 c2 acg 2
备选 Levenshtein
距离:
import Levenshtein
s = df['name'].str[0]
out = df.assign(Diff=s.drop_duplicates(keep='last').map(df.groupby(s)['seq']
.apply(lambda x: Levenshtein.distance(x.iloc[0],x.iloc[-1]))))
name seq Diff
1 a1 bbb NaN
2 a2 bbc 1.0
3 b1 fff NaN
4 b2 fff 0.0
5 c1 aaa NaN
6 c2 acg 2.0
勾选这个
import pandas as pd
data = {'name': ['a1', 'a2','b1','b2','c1','c2'],
'seq': ['bbb', 'bbc','fff','fff','aaa','acg']
}
df = pd.DataFrame (data, columns = ['name','seq'])
diffCntr=0
df['diff'] = np.nan
i=0
while i < len(df)-1:
diffCntr=np.nan
item=df.at[i,'seq']
df.at[i,'diff']=diffCntr
diffCntr=0
for j in df.at[i+1,'seq']:
if item.find(j) < 0:
diffCntr +=1
df.at[i+1,'diff']=diffCntr
i +=2
df
结果是这样的:
name seq diff
0 a1 bbb NaN
1 a2 bbc 1.0
2 b1 fff NaN
3 b2 fff 0.0
4 c1 aaa NaN
5 c2 acg 2.0
我是 python 的新人,我为此苦苦挣扎了一段时间。 我有一个如下所示的文件:
name seq
1 a1 bbb
2 a2 bbc
3 b1 fff
4 b2 fff
5 c1 aaa
6 c2 acg
其中 name 是字符串的名称,seq 是字符串。 我想要一个新列或一个新数据框来指示每两行之间没有重叠的差异数。例如,我想要名称 [a1-a2] 的序列之间的差异数,然后是 [b1-b2],最后是 [c1-c2]。
所以我需要这样的东西:
name seq diff
1 a1 bbb NA
2 a2 bbc 1
3 b1 fff NA
4 b2 fff 0
5 c1 aaa NA
6 c2 acg 2
非常感谢任何帮助
您似乎想要 jaccard distance of the pairs of strings. Here's one way using groupby
and scipy.spatial.distance.jaccard
:
from scipy.spatial.distance import jaccard
g = df.groupby(df.name.str[0])
df['diff'] = [sim for _, seqs in g.seq for sim in
[float('nan'), jaccard(*map(list,seqs))]]
print(df)
name seq diff
1 a1 bbb NaN
2 a2 bbc 1.0
3 b1 fff NaN
4 b2 fff 0.0
5 c1 aaa NaN
6 c2 acg 2.0
作为第一步,我重新创建了您的数据:
#!/usr/bin/env python3
import pandas as pd
# Setup
data = {'name': {1: 'a1', 2: 'a2', 3: 'b1', 4: 'b2', 5: 'c1', 6: 'c2'}, 'seq': {1: 'bbb', 2: 'bbc', 3: 'fff', 4: 'fff', 5: 'aaa', 6: 'acg'}}
df = pd.DataFrame(data)
解决方案
您可以尝试遍历数据框并将上一次迭代的 seq
值与当前迭代的值进行比较。为了比较两个字符串(存储在数据框的 seq
列中),您可以应用一个简单的列表理解,就像在这个函数中一样:
def diff_letters(a,b):
return sum ( a[i] != b[i] for i in range(len(a)) )
Dataframe 行的迭代
diff = ['NA']
row_iterator = df.iterrows()
_, last = next(row_iterator)
# Iterate over the df get populate a list with result of the comparison
for i, row in row_iterator:
if i % 2 == 0:
diff.append(diff_letters(last['seq'],row['seq']))
else:
# for odd row numbers append NA value
diff.append("NA")
last = row
df['diff'] = diff
结果如下所示
name seq diff
1 a1 bbb NA
2 a2 bbc 1
3 b1 fff NA
4 b2 fff 0
5 c1 aaa NA
6 c2 acg 2
备选 Levenshtein
距离:
import Levenshtein
s = df['name'].str[0]
out = df.assign(Diff=s.drop_duplicates(keep='last').map(df.groupby(s)['seq']
.apply(lambda x: Levenshtein.distance(x.iloc[0],x.iloc[-1]))))
name seq Diff
1 a1 bbb NaN
2 a2 bbc 1.0
3 b1 fff NaN
4 b2 fff 0.0
5 c1 aaa NaN
6 c2 acg 2.0
勾选这个
import pandas as pd
data = {'name': ['a1', 'a2','b1','b2','c1','c2'],
'seq': ['bbb', 'bbc','fff','fff','aaa','acg']
}
df = pd.DataFrame (data, columns = ['name','seq'])
diffCntr=0
df['diff'] = np.nan
i=0
while i < len(df)-1:
diffCntr=np.nan
item=df.at[i,'seq']
df.at[i,'diff']=diffCntr
diffCntr=0
for j in df.at[i+1,'seq']:
if item.find(j) < 0:
diffCntr +=1
df.at[i+1,'diff']=diffCntr
i +=2
df
结果是这样的:
name seq diff
0 a1 bbb NaN
1 a2 bbc 1.0
2 b1 fff NaN
3 b2 fff 0.0
4 c1 aaa NaN
5 c2 acg 2.0