检查两列之间的值
Check values between two columns
我需要在 df 的两列 -A 和 B- 上执行以下步骤并将结果输出到 C:
1) check if value from B is present in A -on row, at any position
2) if present but in another format then remove
3) add value from B in A and output in C
A B C
tshirt for women TSHIRT TSHIRT for women
Zaino Estensibile SJ Gang SJ Gang Zaino Estensibile
Air Optix plus AIR OPTIX AIR OPTIX plus
在 A 和 B 之间串联并删除重复项的解决方法:
版本 1
def uniqueList(row):
words = str(row).split(" ")
unique = words[0]
for w in words:
if w.lower() not in unique.lower() :
if w.lower()not in my_list:
unique = unique + " " + w
return unique
df["C"] = df["C"].apply(uniqueList)
版本 2
sentences = df["B"] .to_list()
for s in sentences:
s_split = s.split(' ') # keep original sentence split by ' '
s_split_without_comma = [i.strip(',') for i in s_split]
# method 1: re
compare_words = re.split(' |-', s)
# method 2: itertools
compare_words = list(itertools.chain.from_iterable([i.split('-') for i in s_split]))
method 3: DIY
compare_words = []
for i in s_split:
compare_words += i.split('-')
# strip ','
compare_words_without_comma = [i.strip(',') for i in compare_words]
start to compare
need_removed_index = []
for word in compare_words_without_comma:
matched_indexes = []
for idx, w in enumerate(s_split_without_comma):
if word.lower() in w.lower().split('-'):
matched_indexes.append(idx)
if len(matched_indexes) > 1: # has_duplicates
need_removed_index += matched_indexes[1:]
need_removed_index = list(set(need_removed_index))
# keep remain and join with ' '
print(" ".join([i for idx, i in enumerate(s_split) if idx not in need_removed_index]))
# print(sentences)
print(sentences)
None 其中的工作正常,因为这不是最好的方法。
这是一个使用正则表达式的解决方案,假设 df
是数据框的名称。
所以思路很简单,如果B在A里面有东西,就用B的值代替。否则 return 字符串 B + A.
import re
def create_c(row):
if re.sub(row['B'], row['B'], row['A'], flags=re.IGNORECASE) == row['A']:
return row['B'] + ' ' + row['A']
return re.sub(row['B'], row['B'], row['A'], flags=re.IGNORECASE)
df['C'] = df.apply(create_c, axis=1)
编辑 #1:我忘记在 re.sub() 语句之前添加 return
关键字。
这是 运行 shell 中的代码:
>>> import pandas as pd
>>> data = [['tshirt for women', 'TSHIRT'], ['Zaino Estensibile', 'SJ Gang']]
>>> df = pd.DataFrame(data, columns=['A', 'B'])
>>> df
A B
0 tshirt for women TSHIRT
1 Zaino Estensibile SJ Gang
>>>
>>>
>>> import re
>>> def create_c(row):
... if re.sub(row['B'], row['B'], row['A'], flags=re.IGNORECASE) == row['A']:
... return row['B'] + ' ' + row['A']
... return re.sub(row['B'], row['B'], row['A'], flags=re.IGNORECASE)
...
>>>
>>> df['C'] = df.apply(create_c, axis=1)
>>> df
A B C
0 tshirt for women TSHIRT TSHIRT for women
1 Zaino Estensibile SJ Gang SJ Gang Zaino Estensibile
>>>
使用集合,获取 A
中而不是 B
中的字符串。将这些字符串作为集合 C
列中的字符串
df['C'] = [(set(a).difference(b)) for a, b in zip(df['A'].str.upper().str.split('\s'), df['B'].str.upper().str.split('\s'))]
如果 B 是 A 的子字符串,则新列 C
的括号和逗号和 concatenate
与列 B
的剥离。如果不是,则将 B 和 A 连接起来.
下面的代码;
df['C']= np.where([a in b for a, b in zip(df.B.str.lower(),df.A.str.lower())], df['B'] + ' ' + df['C'].str.join(',').str.replace(',',' ').str.lower(), df['B'] + ' ' + df['A'])
打印(df)
输出
A B C
0 tshirt for women TSHIRT TSHIRT for women
1 Zaino Estensibile SJ Gang SJ Gang Zaino Estensibile
2 Air Optix plus AIR OPTIX AIR OPTIX plus
我需要在 df 的两列 -A 和 B- 上执行以下步骤并将结果输出到 C:
1) check if value from B is present in A -on row, at any position
2) if present but in another format then remove
3) add value from B in A and output in C
A B C
tshirt for women TSHIRT TSHIRT for women
Zaino Estensibile SJ Gang SJ Gang Zaino Estensibile
Air Optix plus AIR OPTIX AIR OPTIX plus
在 A 和 B 之间串联并删除重复项的解决方法:
版本 1
def uniqueList(row):
words = str(row).split(" ")
unique = words[0]
for w in words:
if w.lower() not in unique.lower() :
if w.lower()not in my_list:
unique = unique + " " + w
return unique
df["C"] = df["C"].apply(uniqueList)
版本 2
sentences = df["B"] .to_list()
for s in sentences:
s_split = s.split(' ') # keep original sentence split by ' '
s_split_without_comma = [i.strip(',') for i in s_split]
# method 1: re
compare_words = re.split(' |-', s)
# method 2: itertools
compare_words = list(itertools.chain.from_iterable([i.split('-') for i in s_split]))
method 3: DIY
compare_words = []
for i in s_split:
compare_words += i.split('-')
# strip ','
compare_words_without_comma = [i.strip(',') for i in compare_words]
start to compare
need_removed_index = []
for word in compare_words_without_comma:
matched_indexes = []
for idx, w in enumerate(s_split_without_comma):
if word.lower() in w.lower().split('-'):
matched_indexes.append(idx)
if len(matched_indexes) > 1: # has_duplicates
need_removed_index += matched_indexes[1:]
need_removed_index = list(set(need_removed_index))
# keep remain and join with ' '
print(" ".join([i for idx, i in enumerate(s_split) if idx not in need_removed_index]))
# print(sentences)
print(sentences)
None 其中的工作正常,因为这不是最好的方法。
这是一个使用正则表达式的解决方案,假设 df
是数据框的名称。
所以思路很简单,如果B在A里面有东西,就用B的值代替。否则 return 字符串 B + A.
import re
def create_c(row):
if re.sub(row['B'], row['B'], row['A'], flags=re.IGNORECASE) == row['A']:
return row['B'] + ' ' + row['A']
return re.sub(row['B'], row['B'], row['A'], flags=re.IGNORECASE)
df['C'] = df.apply(create_c, axis=1)
编辑 #1:我忘记在 re.sub() 语句之前添加 return
关键字。
这是 运行 shell 中的代码:
>>> import pandas as pd
>>> data = [['tshirt for women', 'TSHIRT'], ['Zaino Estensibile', 'SJ Gang']]
>>> df = pd.DataFrame(data, columns=['A', 'B'])
>>> df
A B
0 tshirt for women TSHIRT
1 Zaino Estensibile SJ Gang
>>>
>>>
>>> import re
>>> def create_c(row):
... if re.sub(row['B'], row['B'], row['A'], flags=re.IGNORECASE) == row['A']:
... return row['B'] + ' ' + row['A']
... return re.sub(row['B'], row['B'], row['A'], flags=re.IGNORECASE)
...
>>>
>>> df['C'] = df.apply(create_c, axis=1)
>>> df
A B C
0 tshirt for women TSHIRT TSHIRT for women
1 Zaino Estensibile SJ Gang SJ Gang Zaino Estensibile
>>>
使用集合,获取 A
中而不是 B
中的字符串。将这些字符串作为集合 C
列中的字符串
df['C'] = [(set(a).difference(b)) for a, b in zip(df['A'].str.upper().str.split('\s'), df['B'].str.upper().str.split('\s'))]
如果 B 是 A 的子字符串,则新列 C
的括号和逗号和 concatenate
与列 B
的剥离。如果不是,则将 B 和 A 连接起来.
下面的代码;
df['C']= np.where([a in b for a, b in zip(df.B.str.lower(),df.A.str.lower())], df['B'] + ' ' + df['C'].str.join(',').str.replace(',',' ').str.lower(), df['B'] + ' ' + df['A'])
打印(df)
输出
A B C
0 tshirt for women TSHIRT TSHIRT for women
1 Zaino Estensibile SJ Gang SJ Gang Zaino Estensibile
2 Air Optix plus AIR OPTIX AIR OPTIX plus