在 pandas 列中仅找到文件名中的可变数字数据,并将数字放在新列中

Locate only the variable numeric data in filename in pandas column and place numbers in new column

我正在使用此代码引入两个具有相似命名约定的 CSV,将它们的文件名放在 "File" 列中,并将数据帧连接到一个名为 NatHrs 的数据帧中。

import glob
from pathlib import Path

path = r'C:\Users\ThisUser\Desktop\AC Mbr Analysis'
all_files = glob.glob(path + '\Natl_hours_YTD_OC_*.csv')

Nat_dfs = []
for file in all_files:
    df = pd.read_csv(file, index_col=None, encoding='windows-1252', header=1 )
    df['File'] = file
    Nat_dfs.append(df)

NatHrs = pd.concat(Nat_dfs)

现在,我想使用 "File" 列,其中 returns 一个文件名对象,条目看起来像 "C:\Users\ThisUser\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019",只提取文件名的末尾——在这种情况下“2018-2019”——并将这些字符放入新列 "Program Year",反映条目“2018-2019”。我在操作字符串或系列时没有成功——我应该使用 path.replace 吗?我搞不清楚了。当我描述我要解析的专栏时...

NatHrs['File'].describe

...我明白了:

Name: File, dtype: object>

您可以使用正则表达式在字符串中查找子字符串:

import re

string = r"C:\Users\ThisUser\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019"
short = string.split('\')[-1]

substring = re.search('\d+[-]*\d+',short).group()
print(substring)

可能想详细说明模式如何变化。会一直是"Year-Year"吗?可以只是 "Year" 吗?那时可能必须更改正则表达式。

编辑:

这太不方便了,所以我制作了自己的虚拟文件,我可以用它来做你想做的事。对我来说效果很好,但你自己看:

import glob
import pandas as pd
import re
import os

all_files = glob.glob('Natl_hours_YTD_OC_*.csv')
full_paths = [os.path.abspath(file) for file in all_files]

print(full_paths)
>>> Out:
['C:\Users\Chris\Desktop\Natl_hours_YTD_OC_2018-2019.csv',
'C:\Users\Chris\Desktop\Natl_hours_YTD_OC_2019-2020.csv',
'C:\Users\Chris\Desktop\Natl_hours_YTD_OC_2020.csv']
Nat_dfs = []

for file in all_files:
    df = pd.read_csv(file,delim_whitespace=True)
    print(df,'\n')
    df['File'] = file
    df['Year'] = re.search('\d+[-]*\d*',file).group()

    Nat_dfs.append(df)
>>> Out:
   A  B
0  7  7
1  8  8
2  9  9 

   A  B
0  4  4
1  5  5
2  6  6 

   A  B
0  1  1
1  2  2
2  3  3 
NatHrs = pd.concat(Nat_dfs)
print(NatHrs)
>>> Out:
   A  B                             File       Year
0  7  7  Natl_hours_YTD_OC_2018-2019.csv  2018-2019
1  8  8  Natl_hours_YTD_OC_2018-2019.csv  2018-2019
2  9  9  Natl_hours_YTD_OC_2018-2019.csv  2018-2019
0  4  4  Natl_hours_YTD_OC_2019-2020.csv  2019-2020
1  5  5  Natl_hours_YTD_OC_2019-2020.csv  2019-2020
2  6  6  Natl_hours_YTD_OC_2019-2020.csv  2019-2020
0  1  1       Natl_hours_YTD_OC_2020.csv       2020
1  2  2       Natl_hours_YTD_OC_2020.csv       2020
2  3  3       Natl_hours_YTD_OC_2020.csv       2020

我不知道你做错了什么,但这绝对有效。希望这就是您所需要的。

我试过这个:

import re

string = NatHrs['File']
short = string.split('\')[-1]

substring = re.search('\d+[-]*\d+',short).group()
print(substring)

NatHrs['Program Year'] = substring
NatHrs

我得到了这个:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-257-f8c37bb604e2> in <module>
      3 
      4 string = NatHrs['File']
----> 5 short = string.split('\')[-1]
      6 
      7 substring = re.search('\d+[-]*\d+',short).group()

~\anaconda3\envs\PythonData\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5177             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5178                 return self[name]
-> 5179             return object.__getattribute__(self, name)
   5180 
   5181     def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'split'

这返回了一个文件,该文件读取了文件年份和程序年份之间的此​​类不一致:

File    Program Year
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019

我也试过这个:

import re

string = file
short = string.split('\')[-1]

substring = re.search('\d+[-]*\d+',short).group()
print(substring)

NatHrs['Program Year'] = substring
NatHrs

并获得了一个仅反映 2019-2020 年的专栏 "Program Year",尽管我希望同时显示 2018-2019 年和 2019-2020 年。

这最终成功了。非常感谢你一直以来对我的帮助!

globbed_files = glob.glob("Natl_hours_YTD_OC_*.csv")
globbed_files

data = []
for csv in globbed_files:
    frame = pd.read_csv(csv, encoding='windows-1252', header=1)
    frame['filename'] = os.path.basename(csv)
    file = os.path.basename(csv)
#create a new column to store the portion of the file name that denotes the Program Year to which the data belongs
    frame['Program Year'] = re.search('\d+[-]*\d*',file).group()
    data.append(frame)

NatHrs = pd.concat(data, ignore_index=True) #dont want pandas to try an align row indexes
NatHrs.copy().head()