如何从 csv 文件中删除第一列中的字符串与另一个 csv 的第一列中的字符串相同的行?
how to delete rows from a csv file which string in 1st column is the same of string in 1st column of another csv?
我有 2 个 csv 文件。我需要删除第一个文件的所有行,其中第一列的字符串在第二个文件的第一列中找到。
table1的头部是:
Genus
FAGR
MOCA
MUBR
MUHA
1-14-0-20-45-16
0
0
40
0
1-14-0-20-46-22
0
0
0
169
2-02-FULL-61-13
0
0
0
27
2-12-FULL-35-15
56
182
435
311
table2的头是:
Genus
FAGR
MOCA
MUBR
1-14-0-20-46-22
0
0
0
2-02-FULL-61-13
0
0
0
21-14-0-10-47-8-A
0
0
0
AAA536-G1
0
0
0
预期的输出文件包含文件 1 的行,但与第二个文件的前两行匹配的行除外(它们在第一列中具有以下共同字符串:1-14-0-20-46 -22 和 2-02-FULL-61-13)。当比较完整的文件时,必须从文件 1 中删除整个文件 2。
我正在经历 pandas indexing and selecting data 但仍然找不到解决方案,可能是因为我是新手。
我尝试了发布的解决方案,结果是这样的:
df1 = generagrouped_df
df2['drop_key'] = 'DROP'
output = pd.merge(
left = df1,
right = df2,
how = 'left',
left_on = ['Genus'],
right_on = ['Genus']
)
output.drop(output[output['drop_key'] == 'DROP'].index, inplace = True)
错误信息为KeyError: 'drop_key' (下):
KeyError Traceback (most recent call last)
<ipython-input-103-67d27afa824b> in <module>()
----> 1 output.drop(output[output['drop_key'] == 'DROP'].index, inplace = True)
/Users/AnaPaula/opt/anaconda2/lib/python2.7/site-packages. /pandas/core/frame.pyc in __getitem__(self, key)
2925 if self.columns.nlevels > 1:
2926 return self._getitem_multilevel(key)
-> 2927 indexer = self.columns.get_loc(key)
2928 if is_integer(indexer):
2929 indexer = [indexer]
/Users/AnaPaula/opt/anaconda2/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in get_loc(self, key, method, tolerance)
2657 return self._engine.get_loc(key)
2658 except KeyError:
-> 2659 return self._engine.get_loc(self._maybe_cast_indexer(key))
2660 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2661 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'drop_key'
你能想出解决办法吗?
谢谢
AP
尝试向放置键所在的 csv 文件添加一个新列,然后按索引在该条件下放置:
import pandas as pd
file1 = pd.read_csv('file_1.csv')
file2 = pd.read_csv('file_2.csv')
# Assign the keyword drop to the file with the strings you're looking
# to drop from your final solution.
file2['drop_key'] = 'DROP'
# Merge the files together
output = pd.merge(
left = file1,
right = file2,
how = 'left',
left_on = ['str_col'],
right_on = ['str_col']
)
# Drop the rows that have the keyword 'DROP'
output.drop(output[output['drop_key'] == 'DROP'].index, inplace = True)
请注意,left_on
和 right_on
应该是包含您要匹配的字符串的列的名称。这些在您提供的屏幕截图中不可用,因此我假设名称为 str_col
.
我找到了解决办法。由于必须从文件 1 中删除整个文件 2,因此我执行了以下命令,它仅通知要比较的第一列,并且有效:
df1.loc[pd.merge(df1, df2, on=['Genus'], how='left', indicator=True)['_merge'] == 'left_only']
感谢您的宝贵时间!
AP
我有 2 个 csv 文件。我需要删除第一个文件的所有行,其中第一列的字符串在第二个文件的第一列中找到。 table1的头部是:
Genus | FAGR | MOCA | MUBR | MUHA |
---|---|---|---|---|
1-14-0-20-45-16 | 0 | 0 | 40 | 0 |
1-14-0-20-46-22 | 0 | 0 | 0 | 169 |
2-02-FULL-61-13 | 0 | 0 | 0 | 27 |
2-12-FULL-35-15 | 56 | 182 | 435 | 311 |
table2的头是:
Genus | FAGR | MOCA | MUBR |
---|---|---|---|
1-14-0-20-46-22 | 0 | 0 | 0 |
2-02-FULL-61-13 | 0 | 0 | 0 |
21-14-0-10-47-8-A | 0 | 0 | 0 |
AAA536-G1 | 0 | 0 | 0 |
预期的输出文件包含文件 1 的行,但与第二个文件的前两行匹配的行除外(它们在第一列中具有以下共同字符串:1-14-0-20-46 -22 和 2-02-FULL-61-13)。当比较完整的文件时,必须从文件 1 中删除整个文件 2。
我正在经历 pandas indexing and selecting data 但仍然找不到解决方案,可能是因为我是新手。
我尝试了发布的解决方案,结果是这样的:
df1 = generagrouped_df
df2['drop_key'] = 'DROP'
output = pd.merge(
left = df1,
right = df2,
how = 'left',
left_on = ['Genus'],
right_on = ['Genus']
)
output.drop(output[output['drop_key'] == 'DROP'].index, inplace = True)
错误信息为KeyError: 'drop_key' (下):
KeyError Traceback (most recent call last)
<ipython-input-103-67d27afa824b> in <module>()
----> 1 output.drop(output[output['drop_key'] == 'DROP'].index, inplace = True)
/Users/AnaPaula/opt/anaconda2/lib/python2.7/site-packages. /pandas/core/frame.pyc in __getitem__(self, key)
2925 if self.columns.nlevels > 1:
2926 return self._getitem_multilevel(key)
-> 2927 indexer = self.columns.get_loc(key)
2928 if is_integer(indexer):
2929 indexer = [indexer]
/Users/AnaPaula/opt/anaconda2/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in get_loc(self, key, method, tolerance)
2657 return self._engine.get_loc(key)
2658 except KeyError:
-> 2659 return self._engine.get_loc(self._maybe_cast_indexer(key))
2660 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2661 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'drop_key'
你能想出解决办法吗? 谢谢 AP
尝试向放置键所在的 csv 文件添加一个新列,然后按索引在该条件下放置:
import pandas as pd
file1 = pd.read_csv('file_1.csv')
file2 = pd.read_csv('file_2.csv')
# Assign the keyword drop to the file with the strings you're looking
# to drop from your final solution.
file2['drop_key'] = 'DROP'
# Merge the files together
output = pd.merge(
left = file1,
right = file2,
how = 'left',
left_on = ['str_col'],
right_on = ['str_col']
)
# Drop the rows that have the keyword 'DROP'
output.drop(output[output['drop_key'] == 'DROP'].index, inplace = True)
请注意,left_on
和 right_on
应该是包含您要匹配的字符串的列的名称。这些在您提供的屏幕截图中不可用,因此我假设名称为 str_col
.
我找到了解决办法。由于必须从文件 1 中删除整个文件 2,因此我执行了以下命令,它仅通知要比较的第一列,并且有效:
df1.loc[pd.merge(df1, df2, on=['Genus'], how='left', indicator=True)['_merge'] == 'left_only']
感谢您的宝贵时间! AP