如何使用 python pandas 从未命名的列 excel 中过滤包含关键字的文本数据并打印到 txt 文件

How to filter text data containing key words from an unnamed column excel with python pandas and print to txt file

我对此很陌生,请多多包涵。

我有一个 excel sheet,其中包含我想提取并复制到文本文件的某些文本字符串 - 我已经手动执行此操作很长时间了,我厌倦了它.

所以我的计划是编写一个脚本,从 excel sheet 中提取这些数据并创建一个 txt 文件。

这是我的进度:

#EXTRACT CLIPID FROM XCEL SHEET
import pandas as pd
from tkinter import Tk     # from tkinter import Tk for Python 3.x
from tkinter.filedialog import askopenfilename

Tk().withdraw() 
filename = askopenfilename()
data = pd.read_excel (filename)
df = pd.DataFrame(data)
print (df)

我想要的数据位于A1列,但并不总是在同一行。 我要查找 3 个单独的关键字:

  1. “POP”
  2. “TVS”
  3. “行星”

字符串看起来像这样:

Channel2021_1_DRU_POP_15s_16062021 Channel2021_2_FANT_POP_15s_16062021 Channel2021_3_ITA_POP_15s_16062021

Channel2021_1_DRU_TVS_15s_16062021 Channel2021_2_FANT_TVS_15s_16062021 Channel2021_3_ITA_TVS_15s_16062021

Channel2021_1_DRU_PLANET_15s_16062021 Channel2021_2_FANT_PLANET_15s_16062021 Channel2021_3_ITA_PLANET_15s_16062021

这是我想写入 txt 文件的提取数据的形式。

所以本质上我想在 A1 列中搜索包含 POP 和 print 的字符串,然后是包含 TVS 和 print 的字符串,最后是包含 PLANET 和 print 的字符串。

如有任何帮助,我们将不胜感激!

谢谢!

都山

PS: 这是 df 的输出:

                                         Unnamed: 0  ...                                        Unnamed: 16
0                                               NaN  ...                                                NaN
1                                               NaN  ...                                                NaN
2                                       Spot 1 15 s  ...                                                NaN
3                                               NaN  ...                                        Indicazioni
4                                         106290.01  ...                        dire tutto + grafica ITALIA
5                                         138575.01  ...                                                NaN
6                                         142956.01  ...                                                NaN
7                                          85146.01  ...                                                NaN
8      Eurospin2021_16bis_1_POP_ITA_15s_24_06_2021   ...                                                NaN
9       Eurospin2021_16bis_1_TVS_ITA_15s_24_06_2021  ...                                                NaN
10   Eurospin2021_16bis_1_PLANET_ITA_15s_24_06_2021  ...                                                NaN
11                                              NaN  ...                                                NaN
12                                              NaN  ...                                                NaN
13                                      Spot 2 15 s  ...                                                NaN
14                                              NaN  ...                                        Indicazioni
15                                        164171.01  ...                       dire tutto +  grafica ITALIA
16                                       9003309.01  ...                                                NaN
17                                         88310.01  ...                                                NaN
18      Eurospin2021_16bis_2_POP_ITA_15s_24_06_2021  ...                                                NaN
19      Eurospin2021_16bis_2_TVS_ITA_15s_24_06_2021  ...                                                NaN
20   Eurospin2021_16bis_2_PLANET_ITA_15s_24_06_2021  ...                                                NaN
21                                              NaN  ...                                                NaN
22                                              NaN  ...                                                NaN
23                                      Spot 3 15 s  ...                                                NaN
24                                              NaN  ...                                         Istruzione
25                                        800214.01  ...  dire tutto + dire al kg dopo il prezzo per la ...
26                                       9001392.01  ...                                                NaN
27                                       9002306.01  ...                                                NaN
28                                        147804.01  ...                                                NaN
29     Eurospin2021_16bis_3_POP_DRUZ_15s_24_06_2021  ...                                                NaN
30     Eurospin2021_16bis_3_TVS_DRUZ_15s_24_06_2021  ...                                                NaN
31  Eurospin2021_16bis_3_PLANET_DRUZ_15s_24_06_2021  ...                                                NaN

[32 rows x 17 columns]

如果您仍在寻找解决方案,这里有一个建议:

带样例框

df = pd.DataFrame({
    0: [
        'Channel2021_1_DRU_POP_15s_16062021',
        'Channel2021_2_FANT_POP_15s_16062021',
        'Channel2021_3_ITA_POP_15s_16062021',
        1.,
        2.,
        'Channel2021_1_DRU_TVS_15s_16062021',
        'Channel2021_2_FANT_TVS_15s_16062021',
        'Channel2021_3_ITA_TVS_15s_16062021',
        3.,
        4.,
        'Channel2021_1_DRU_PLANET_15s_16062021',
        'Channel2021_2_FANT_PLANET_15s_16062021',
        'Channel2021_3_ITA_PLANET_15s_16062021',
        5.
    ],
    1: '...',
})
                                         0    1
0       Channel2021_1_DRU_POP_15s_16062021  ...
1      Channel2021_2_FANT_POP_15s_16062021  ...
2       Channel2021_3_ITA_POP_15s_16062021  ...
3                                        1  ...
4                                        2  ...
5       Channel2021_1_DRU_TVS_15s_16062021  ...
6      Channel2021_2_FANT_TVS_15s_16062021  ...
7       Channel2021_3_ITA_TVS_15s_16062021  ...
8                                        3  ...
9                                        4  ...
10   Channel2021_1_DRU_PLANET_15s_16062021  ...
11  Channel2021_2_FANT_PLANET_15s_16062021  ...
12   Channel2021_3_ITA_PLANET_15s_16062021  ...
13                                       5  ...

这个

selection = df.iloc[:, 0].str.contains(r'POP|TVS|PLANET', na=False)
print(df.iloc[:, 0][selection])
df.iloc[:, 0][selection].to_csv('items.txt', index=False, header=False)

打印出你想要的条目

0         Channel2021_1_DRU_POP_15s_16062021
1        Channel2021_2_FANT_POP_15s_16062021
2         Channel2021_3_ITA_POP_15s_16062021
5         Channel2021_1_DRU_TVS_15s_16062021
6        Channel2021_2_FANT_TVS_15s_16062021
7         Channel2021_3_ITA_TVS_15s_16062021
10     Channel2021_1_DRU_PLANET_15s_16062021
11    Channel2021_2_FANT_PLANET_15s_16062021
12     Channel2021_3_ITA_PLANET_15s_16062021

并将它们写入文件items.txt

Channel2021_1_DRU_POP_15s_16062021
Channel2021_2_FANT_POP_15s_16062021
Channel2021_3_ITA_POP_15s_16062021
Channel2021_1_DRU_TVS_15s_16062021
Channel2021_2_FANT_TVS_15s_16062021
Channel2021_3_ITA_TVS_15s_16062021
Channel2021_1_DRU_PLANET_15s_16062021
Channel2021_2_FANT_PLANET_15s_16062021
Channel2021_3_ITA_PLANET_15s_16062021

由于我不确定列名,所以我只使用了索引基础选择 (.iloc)。

如果您希望结果按照您给定的顺序排列,那么这个

df = pd.concat([
         df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)]
         for tag in ('POP', 'TVS', 'PLANET')
     ])

应该可以工作(之后只需打印 df 或将其写入文件)。

顺便说一句:这太复杂了

data = pd.read_excel (filename)
df = pd.DataFrame(data)

你只需要pd.read_excel:

df = pd.read_excel(filename)

编辑: 关于评论:

with open('items.txt', 'wt') as file:
    file.write('The following has been sent:')
    for tag in ('POP', 'TVS', 'PLANET'):
        file.write(f'\n{tag}:\n')
        items = df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)].to_list()
        file.write('\n'.join(items))