如何使用 python pandas 从未命名的列 excel 中过滤包含关键字的文本数据并打印到 txt 文件
How to filter text data containing key words from an unnamed column excel with python pandas and print to txt file
我对此很陌生,请多多包涵。
我有一个 excel sheet,其中包含我想提取并复制到文本文件的某些文本字符串 - 我已经手动执行此操作很长时间了,我厌倦了它.
所以我的计划是编写一个脚本,从 excel sheet 中提取这些数据并创建一个 txt 文件。
这是我的进度:
#EXTRACT CLIPID FROM XCEL SHEET
import pandas as pd
from tkinter import Tk # from tkinter import Tk for Python 3.x
from tkinter.filedialog import askopenfilename
Tk().withdraw()
filename = askopenfilename()
data = pd.read_excel (filename)
df = pd.DataFrame(data)
print (df)
我想要的数据位于A1列,但并不总是在同一行。
我要查找 3 个单独的关键字:
- “POP”
- “TVS”
- “行星”
字符串看起来像这样:
Channel2021_1_DRU_POP_15s_16062021
Channel2021_2_FANT_POP_15s_16062021
Channel2021_3_ITA_POP_15s_16062021
Channel2021_1_DRU_TVS_15s_16062021
Channel2021_2_FANT_TVS_15s_16062021
Channel2021_3_ITA_TVS_15s_16062021
Channel2021_1_DRU_PLANET_15s_16062021
Channel2021_2_FANT_PLANET_15s_16062021
Channel2021_3_ITA_PLANET_15s_16062021
这是我想写入 txt 文件的提取数据的形式。
所以本质上我想在 A1 列中搜索包含 POP 和 print 的字符串,然后是包含 TVS 和 print 的字符串,最后是包含 PLANET 和 print 的字符串。
如有任何帮助,我们将不胜感激!
谢谢!
都山
PS:
这是 df
的输出:
Unnamed: 0 ... Unnamed: 16
0 NaN ... NaN
1 NaN ... NaN
2 Spot 1 15 s ... NaN
3 NaN ... Indicazioni
4 106290.01 ... dire tutto + grafica ITALIA
5 138575.01 ... NaN
6 142956.01 ... NaN
7 85146.01 ... NaN
8 Eurospin2021_16bis_1_POP_ITA_15s_24_06_2021 ... NaN
9 Eurospin2021_16bis_1_TVS_ITA_15s_24_06_2021 ... NaN
10 Eurospin2021_16bis_1_PLANET_ITA_15s_24_06_2021 ... NaN
11 NaN ... NaN
12 NaN ... NaN
13 Spot 2 15 s ... NaN
14 NaN ... Indicazioni
15 164171.01 ... dire tutto + grafica ITALIA
16 9003309.01 ... NaN
17 88310.01 ... NaN
18 Eurospin2021_16bis_2_POP_ITA_15s_24_06_2021 ... NaN
19 Eurospin2021_16bis_2_TVS_ITA_15s_24_06_2021 ... NaN
20 Eurospin2021_16bis_2_PLANET_ITA_15s_24_06_2021 ... NaN
21 NaN ... NaN
22 NaN ... NaN
23 Spot 3 15 s ... NaN
24 NaN ... Istruzione
25 800214.01 ... dire tutto + dire al kg dopo il prezzo per la ...
26 9001392.01 ... NaN
27 9002306.01 ... NaN
28 147804.01 ... NaN
29 Eurospin2021_16bis_3_POP_DRUZ_15s_24_06_2021 ... NaN
30 Eurospin2021_16bis_3_TVS_DRUZ_15s_24_06_2021 ... NaN
31 Eurospin2021_16bis_3_PLANET_DRUZ_15s_24_06_2021 ... NaN
[32 rows x 17 columns]
如果您仍在寻找解决方案,这里有一个建议:
带样例框
df = pd.DataFrame({
0: [
'Channel2021_1_DRU_POP_15s_16062021',
'Channel2021_2_FANT_POP_15s_16062021',
'Channel2021_3_ITA_POP_15s_16062021',
1.,
2.,
'Channel2021_1_DRU_TVS_15s_16062021',
'Channel2021_2_FANT_TVS_15s_16062021',
'Channel2021_3_ITA_TVS_15s_16062021',
3.,
4.,
'Channel2021_1_DRU_PLANET_15s_16062021',
'Channel2021_2_FANT_PLANET_15s_16062021',
'Channel2021_3_ITA_PLANET_15s_16062021',
5.
],
1: '...',
})
0 1
0 Channel2021_1_DRU_POP_15s_16062021 ...
1 Channel2021_2_FANT_POP_15s_16062021 ...
2 Channel2021_3_ITA_POP_15s_16062021 ...
3 1 ...
4 2 ...
5 Channel2021_1_DRU_TVS_15s_16062021 ...
6 Channel2021_2_FANT_TVS_15s_16062021 ...
7 Channel2021_3_ITA_TVS_15s_16062021 ...
8 3 ...
9 4 ...
10 Channel2021_1_DRU_PLANET_15s_16062021 ...
11 Channel2021_2_FANT_PLANET_15s_16062021 ...
12 Channel2021_3_ITA_PLANET_15s_16062021 ...
13 5 ...
这个
selection = df.iloc[:, 0].str.contains(r'POP|TVS|PLANET', na=False)
print(df.iloc[:, 0][selection])
df.iloc[:, 0][selection].to_csv('items.txt', index=False, header=False)
打印出你想要的条目
0 Channel2021_1_DRU_POP_15s_16062021
1 Channel2021_2_FANT_POP_15s_16062021
2 Channel2021_3_ITA_POP_15s_16062021
5 Channel2021_1_DRU_TVS_15s_16062021
6 Channel2021_2_FANT_TVS_15s_16062021
7 Channel2021_3_ITA_TVS_15s_16062021
10 Channel2021_1_DRU_PLANET_15s_16062021
11 Channel2021_2_FANT_PLANET_15s_16062021
12 Channel2021_3_ITA_PLANET_15s_16062021
并将它们写入文件items.txt
Channel2021_1_DRU_POP_15s_16062021
Channel2021_2_FANT_POP_15s_16062021
Channel2021_3_ITA_POP_15s_16062021
Channel2021_1_DRU_TVS_15s_16062021
Channel2021_2_FANT_TVS_15s_16062021
Channel2021_3_ITA_TVS_15s_16062021
Channel2021_1_DRU_PLANET_15s_16062021
Channel2021_2_FANT_PLANET_15s_16062021
Channel2021_3_ITA_PLANET_15s_16062021
由于我不确定列名,所以我只使用了索引基础选择 (.iloc
)。
如果您希望结果按照您给定的顺序排列,那么这个
df = pd.concat([
df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)]
for tag in ('POP', 'TVS', 'PLANET')
])
应该可以工作(之后只需打印 df
或将其写入文件)。
顺便说一句:这太复杂了
data = pd.read_excel (filename)
df = pd.DataFrame(data)
你只需要pd.read_excel
:
df = pd.read_excel(filename)
编辑:
关于评论:
with open('items.txt', 'wt') as file:
file.write('The following has been sent:')
for tag in ('POP', 'TVS', 'PLANET'):
file.write(f'\n{tag}:\n')
items = df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)].to_list()
file.write('\n'.join(items))
我对此很陌生,请多多包涵。
我有一个 excel sheet,其中包含我想提取并复制到文本文件的某些文本字符串 - 我已经手动执行此操作很长时间了,我厌倦了它.
所以我的计划是编写一个脚本,从 excel sheet 中提取这些数据并创建一个 txt 文件。
这是我的进度:
#EXTRACT CLIPID FROM XCEL SHEET
import pandas as pd
from tkinter import Tk # from tkinter import Tk for Python 3.x
from tkinter.filedialog import askopenfilename
Tk().withdraw()
filename = askopenfilename()
data = pd.read_excel (filename)
df = pd.DataFrame(data)
print (df)
我想要的数据位于A1列,但并不总是在同一行。 我要查找 3 个单独的关键字:
- “POP”
- “TVS”
- “行星”
字符串看起来像这样:
Channel2021_1_DRU_POP_15s_16062021 Channel2021_2_FANT_POP_15s_16062021 Channel2021_3_ITA_POP_15s_16062021
Channel2021_1_DRU_TVS_15s_16062021 Channel2021_2_FANT_TVS_15s_16062021 Channel2021_3_ITA_TVS_15s_16062021
Channel2021_1_DRU_PLANET_15s_16062021 Channel2021_2_FANT_PLANET_15s_16062021 Channel2021_3_ITA_PLANET_15s_16062021
这是我想写入 txt 文件的提取数据的形式。
所以本质上我想在 A1 列中搜索包含 POP 和 print 的字符串,然后是包含 TVS 和 print 的字符串,最后是包含 PLANET 和 print 的字符串。
如有任何帮助,我们将不胜感激!
谢谢!
都山
PS:
这是 df
的输出:
Unnamed: 0 ... Unnamed: 16
0 NaN ... NaN
1 NaN ... NaN
2 Spot 1 15 s ... NaN
3 NaN ... Indicazioni
4 106290.01 ... dire tutto + grafica ITALIA
5 138575.01 ... NaN
6 142956.01 ... NaN
7 85146.01 ... NaN
8 Eurospin2021_16bis_1_POP_ITA_15s_24_06_2021 ... NaN
9 Eurospin2021_16bis_1_TVS_ITA_15s_24_06_2021 ... NaN
10 Eurospin2021_16bis_1_PLANET_ITA_15s_24_06_2021 ... NaN
11 NaN ... NaN
12 NaN ... NaN
13 Spot 2 15 s ... NaN
14 NaN ... Indicazioni
15 164171.01 ... dire tutto + grafica ITALIA
16 9003309.01 ... NaN
17 88310.01 ... NaN
18 Eurospin2021_16bis_2_POP_ITA_15s_24_06_2021 ... NaN
19 Eurospin2021_16bis_2_TVS_ITA_15s_24_06_2021 ... NaN
20 Eurospin2021_16bis_2_PLANET_ITA_15s_24_06_2021 ... NaN
21 NaN ... NaN
22 NaN ... NaN
23 Spot 3 15 s ... NaN
24 NaN ... Istruzione
25 800214.01 ... dire tutto + dire al kg dopo il prezzo per la ...
26 9001392.01 ... NaN
27 9002306.01 ... NaN
28 147804.01 ... NaN
29 Eurospin2021_16bis_3_POP_DRUZ_15s_24_06_2021 ... NaN
30 Eurospin2021_16bis_3_TVS_DRUZ_15s_24_06_2021 ... NaN
31 Eurospin2021_16bis_3_PLANET_DRUZ_15s_24_06_2021 ... NaN
[32 rows x 17 columns]
如果您仍在寻找解决方案,这里有一个建议:
带样例框
df = pd.DataFrame({
0: [
'Channel2021_1_DRU_POP_15s_16062021',
'Channel2021_2_FANT_POP_15s_16062021',
'Channel2021_3_ITA_POP_15s_16062021',
1.,
2.,
'Channel2021_1_DRU_TVS_15s_16062021',
'Channel2021_2_FANT_TVS_15s_16062021',
'Channel2021_3_ITA_TVS_15s_16062021',
3.,
4.,
'Channel2021_1_DRU_PLANET_15s_16062021',
'Channel2021_2_FANT_PLANET_15s_16062021',
'Channel2021_3_ITA_PLANET_15s_16062021',
5.
],
1: '...',
})
0 1
0 Channel2021_1_DRU_POP_15s_16062021 ...
1 Channel2021_2_FANT_POP_15s_16062021 ...
2 Channel2021_3_ITA_POP_15s_16062021 ...
3 1 ...
4 2 ...
5 Channel2021_1_DRU_TVS_15s_16062021 ...
6 Channel2021_2_FANT_TVS_15s_16062021 ...
7 Channel2021_3_ITA_TVS_15s_16062021 ...
8 3 ...
9 4 ...
10 Channel2021_1_DRU_PLANET_15s_16062021 ...
11 Channel2021_2_FANT_PLANET_15s_16062021 ...
12 Channel2021_3_ITA_PLANET_15s_16062021 ...
13 5 ...
这个
selection = df.iloc[:, 0].str.contains(r'POP|TVS|PLANET', na=False)
print(df.iloc[:, 0][selection])
df.iloc[:, 0][selection].to_csv('items.txt', index=False, header=False)
打印出你想要的条目
0 Channel2021_1_DRU_POP_15s_16062021
1 Channel2021_2_FANT_POP_15s_16062021
2 Channel2021_3_ITA_POP_15s_16062021
5 Channel2021_1_DRU_TVS_15s_16062021
6 Channel2021_2_FANT_TVS_15s_16062021
7 Channel2021_3_ITA_TVS_15s_16062021
10 Channel2021_1_DRU_PLANET_15s_16062021
11 Channel2021_2_FANT_PLANET_15s_16062021
12 Channel2021_3_ITA_PLANET_15s_16062021
并将它们写入文件items.txt
Channel2021_1_DRU_POP_15s_16062021
Channel2021_2_FANT_POP_15s_16062021
Channel2021_3_ITA_POP_15s_16062021
Channel2021_1_DRU_TVS_15s_16062021
Channel2021_2_FANT_TVS_15s_16062021
Channel2021_3_ITA_TVS_15s_16062021
Channel2021_1_DRU_PLANET_15s_16062021
Channel2021_2_FANT_PLANET_15s_16062021
Channel2021_3_ITA_PLANET_15s_16062021
由于我不确定列名,所以我只使用了索引基础选择 (.iloc
)。
如果您希望结果按照您给定的顺序排列,那么这个
df = pd.concat([
df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)]
for tag in ('POP', 'TVS', 'PLANET')
])
应该可以工作(之后只需打印 df
或将其写入文件)。
顺便说一句:这太复杂了
data = pd.read_excel (filename)
df = pd.DataFrame(data)
你只需要pd.read_excel
:
df = pd.read_excel(filename)
编辑: 关于评论:
with open('items.txt', 'wt') as file:
file.write('The following has been sent:')
for tag in ('POP', 'TVS', 'PLANET'):
file.write(f'\n{tag}:\n')
items = df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)].to_list()
file.write('\n'.join(items))