Python Pandas - 比较列文本并提供匹配字数
Python Pandas - compare column text and provide matched word count
我正在尝试开发一个字符串比较工具。我有两组 json 数据如下。
DF 1:
ID Subject
1 Angular JS : getting unexpected cross symbol with Image
2 Cordova debug: the specified file was not found
3 get custom mask for phone numbers
4 Remove files for the Xcode Bots Unit Test Coverage
5 "Upload to Mongodb collection in aldeed:autoform
6 Mask for phone numbers
DF 2:
ID Subject
1 Please provide custom mask for phone numbers
2 Files for the Xcode Bots Unit Test Coverage need to be removed
3 Upload to Mongodb collection
现在,使用 python + pandas ,对于每个 Table 2 ID,我想在 Table 1 行中找到一个紧密匹配的条目,单词顺序无关紧要,需要从比较中删除特殊字符。
例如:
For ID 1 - ID 2 has 5 matching words
For ID 1 - ID 6 has 4 matching words
For ID 2 - ID 4 has 8 matching words
For ID 3 - ID 4 has 4 matching words
有什么指点吗?
我想你可以结合之前的 and with merge
, groupby
by ID1
and ID2
with aggegating size
:
其他可能的解决方案是使用:
.replace(r'[\-\!\@\#$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?\|]','')
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')
df3 = (df1['Subject'].str
.replace(r'[^a-zA-Z\s]' , '')
.str
.lower()
.str
.split('\s+', expand=True)
.stack()
.reset_index(drop=True, level=1)
.reset_index(name='val'))
df4 = (df2['Subject'].str
.replace(r'[^a-zA-Z\s]' , '')
.str
.lower()
.str
.split('\s+', expand=True)
.stack()
.reset_index(drop=True, level=1)
.reset_index(name='val'))
df5 = (pd.merge(df3, df4, on='val', suffixes=('1','2')))
print (df5)
ID1 val ID2
0 2 the 2
1 4 the 2
2 3 custom 1
3 3 mask 1
4 6 mask 1
5 3 for 1
6 3 for 2
7 4 for 1
8 4 for 2
9 6 for 1
10 6 for 2
11 3 phone 1
12 6 phone 1
13 3 numbers 1
14 6 numbers 1
15 4 files 2
16 4 xcode 2
17 4 bots 2
18 4 unit 2
19 4 test 2
20 4 coverage 2
21 5 upload 3
22 5 to 2
23 5 to 3
24 5 mongodb 3
25 5 collection 3
print (df5.groupby(['ID1','ID2']).size().reset_index(name='c'))
ID1 ID2 c
0 2 2 1
1 3 1 5
2 3 2 1
3 4 1 1
4 4 2 8
5 5 2 1
6 5 3 4
7 6 1 4
8 6 2 1
我正在尝试开发一个字符串比较工具。我有两组 json 数据如下。
DF 1:
ID Subject
1 Angular JS : getting unexpected cross symbol with Image
2 Cordova debug: the specified file was not found
3 get custom mask for phone numbers
4 Remove files for the Xcode Bots Unit Test Coverage
5 "Upload to Mongodb collection in aldeed:autoform
6 Mask for phone numbers
DF 2:
ID Subject
1 Please provide custom mask for phone numbers
2 Files for the Xcode Bots Unit Test Coverage need to be removed
3 Upload to Mongodb collection
现在,使用 python + pandas ,对于每个 Table 2 ID,我想在 Table 1 行中找到一个紧密匹配的条目,单词顺序无关紧要,需要从比较中删除特殊字符。
例如:
For ID 1 - ID 2 has 5 matching words
For ID 1 - ID 6 has 4 matching words
For ID 2 - ID 4 has 8 matching words
For ID 3 - ID 4 has 4 matching words
有什么指点吗?
我想你可以结合之前的merge
, groupby
by ID1
and ID2
with aggegating size
:
其他可能的解决方案是使用:
.replace(r'[\-\!\@\#$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?\|]','')
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')
df3 = (df1['Subject'].str
.replace(r'[^a-zA-Z\s]' , '')
.str
.lower()
.str
.split('\s+', expand=True)
.stack()
.reset_index(drop=True, level=1)
.reset_index(name='val'))
df4 = (df2['Subject'].str
.replace(r'[^a-zA-Z\s]' , '')
.str
.lower()
.str
.split('\s+', expand=True)
.stack()
.reset_index(drop=True, level=1)
.reset_index(name='val'))
df5 = (pd.merge(df3, df4, on='val', suffixes=('1','2')))
print (df5)
ID1 val ID2
0 2 the 2
1 4 the 2
2 3 custom 1
3 3 mask 1
4 6 mask 1
5 3 for 1
6 3 for 2
7 4 for 1
8 4 for 2
9 6 for 1
10 6 for 2
11 3 phone 1
12 6 phone 1
13 3 numbers 1
14 6 numbers 1
15 4 files 2
16 4 xcode 2
17 4 bots 2
18 4 unit 2
19 4 test 2
20 4 coverage 2
21 5 upload 3
22 5 to 2
23 5 to 3
24 5 mongodb 3
25 5 collection 3
print (df5.groupby(['ID1','ID2']).size().reset_index(name='c'))
ID1 ID2 c
0 2 2 1
1 3 1 5
2 3 2 1
3 4 1 1
4 4 2 8
5 5 2 1
6 5 3 4
7 6 1 4
8 6 2 1