在 pandas 列中仅找到文件名中的可变数字数据,并将数字放在新列中
Locate only the variable numeric data in filename in pandas column and place numbers in new column
我正在使用此代码引入两个具有相似命名约定的 CSV,将它们的文件名放在 "File" 列中,并将数据帧连接到一个名为 NatHrs 的数据帧中。
import glob
from pathlib import Path
path = r'C:\Users\ThisUser\Desktop\AC Mbr Analysis'
all_files = glob.glob(path + '\Natl_hours_YTD_OC_*.csv')
Nat_dfs = []
for file in all_files:
df = pd.read_csv(file, index_col=None, encoding='windows-1252', header=1 )
df['File'] = file
Nat_dfs.append(df)
NatHrs = pd.concat(Nat_dfs)
现在,我想使用 "File" 列,其中 returns 一个文件名对象,条目看起来像 "C:\Users\ThisUser\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019",只提取文件名的末尾——在这种情况下“2018-2019”——并将这些字符放入新列 "Program Year",反映条目“2018-2019”。我在操作字符串或系列时没有成功——我应该使用 path.replace 吗?我搞不清楚了。当我描述我要解析的专栏时...
NatHrs['File'].describe
...我明白了:
Name: File, dtype: object>
您可以使用正则表达式在字符串中查找子字符串:
import re
string = r"C:\Users\ThisUser\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019"
short = string.split('\')[-1]
substring = re.search('\d+[-]*\d+',short).group()
print(substring)
可能想详细说明模式如何变化。会一直是"Year-Year"吗?可以只是 "Year" 吗?那时可能必须更改正则表达式。
编辑:
这太不方便了,所以我制作了自己的虚拟文件,我可以用它来做你想做的事。对我来说效果很好,但你自己看:
import glob
import pandas as pd
import re
import os
all_files = glob.glob('Natl_hours_YTD_OC_*.csv')
full_paths = [os.path.abspath(file) for file in all_files]
print(full_paths)
>>> Out:
['C:\Users\Chris\Desktop\Natl_hours_YTD_OC_2018-2019.csv',
'C:\Users\Chris\Desktop\Natl_hours_YTD_OC_2019-2020.csv',
'C:\Users\Chris\Desktop\Natl_hours_YTD_OC_2020.csv']
Nat_dfs = []
for file in all_files:
df = pd.read_csv(file,delim_whitespace=True)
print(df,'\n')
df['File'] = file
df['Year'] = re.search('\d+[-]*\d*',file).group()
Nat_dfs.append(df)
>>> Out:
A B
0 7 7
1 8 8
2 9 9
A B
0 4 4
1 5 5
2 6 6
A B
0 1 1
1 2 2
2 3 3
NatHrs = pd.concat(Nat_dfs)
print(NatHrs)
>>> Out:
A B File Year
0 7 7 Natl_hours_YTD_OC_2018-2019.csv 2018-2019
1 8 8 Natl_hours_YTD_OC_2018-2019.csv 2018-2019
2 9 9 Natl_hours_YTD_OC_2018-2019.csv 2018-2019
0 4 4 Natl_hours_YTD_OC_2019-2020.csv 2019-2020
1 5 5 Natl_hours_YTD_OC_2019-2020.csv 2019-2020
2 6 6 Natl_hours_YTD_OC_2019-2020.csv 2019-2020
0 1 1 Natl_hours_YTD_OC_2020.csv 2020
1 2 2 Natl_hours_YTD_OC_2020.csv 2020
2 3 3 Natl_hours_YTD_OC_2020.csv 2020
我不知道你做错了什么,但这绝对有效。希望这就是您所需要的。
我试过这个:
import re
string = NatHrs['File']
short = string.split('\')[-1]
substring = re.search('\d+[-]*\d+',short).group()
print(substring)
NatHrs['Program Year'] = substring
NatHrs
我得到了这个:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-257-f8c37bb604e2> in <module>
3
4 string = NatHrs['File']
----> 5 short = string.split('\')[-1]
6
7 substring = re.search('\d+[-]*\d+',short).group()
~\anaconda3\envs\PythonData\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5177 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5178 return self[name]
-> 5179 return object.__getattribute__(self, name)
5180
5181 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'split'
这返回了一个文件,该文件读取了文件年份和程序年份之间的此类不一致:
File Program Year
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv 2018-2019
我也试过这个:
import re
string = file
short = string.split('\')[-1]
substring = re.search('\d+[-]*\d+',short).group()
print(substring)
NatHrs['Program Year'] = substring
NatHrs
并获得了一个仅反映 2019-2020 年的专栏 "Program Year",尽管我希望同时显示 2018-2019 年和 2019-2020 年。
这最终成功了。非常感谢你一直以来对我的帮助!
globbed_files = glob.glob("Natl_hours_YTD_OC_*.csv")
globbed_files
data = []
for csv in globbed_files:
frame = pd.read_csv(csv, encoding='windows-1252', header=1)
frame['filename'] = os.path.basename(csv)
file = os.path.basename(csv)
#create a new column to store the portion of the file name that denotes the Program Year to which the data belongs
frame['Program Year'] = re.search('\d+[-]*\d*',file).group()
data.append(frame)
NatHrs = pd.concat(data, ignore_index=True) #dont want pandas to try an align row indexes
NatHrs.copy().head()
我正在使用此代码引入两个具有相似命名约定的 CSV,将它们的文件名放在 "File" 列中,并将数据帧连接到一个名为 NatHrs 的数据帧中。
import glob
from pathlib import Path
path = r'C:\Users\ThisUser\Desktop\AC Mbr Analysis'
all_files = glob.glob(path + '\Natl_hours_YTD_OC_*.csv')
Nat_dfs = []
for file in all_files:
df = pd.read_csv(file, index_col=None, encoding='windows-1252', header=1 )
df['File'] = file
Nat_dfs.append(df)
NatHrs = pd.concat(Nat_dfs)
现在,我想使用 "File" 列,其中 returns 一个文件名对象,条目看起来像 "C:\Users\ThisUser\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019",只提取文件名的末尾——在这种情况下“2018-2019”——并将这些字符放入新列 "Program Year",反映条目“2018-2019”。我在操作字符串或系列时没有成功——我应该使用 path.replace 吗?我搞不清楚了。当我描述我要解析的专栏时...
NatHrs['File'].describe
...我明白了:
Name: File, dtype: object>
您可以使用正则表达式在字符串中查找子字符串:
import re
string = r"C:\Users\ThisUser\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019"
short = string.split('\')[-1]
substring = re.search('\d+[-]*\d+',short).group()
print(substring)
可能想详细说明模式如何变化。会一直是"Year-Year"吗?可以只是 "Year" 吗?那时可能必须更改正则表达式。
编辑:
这太不方便了,所以我制作了自己的虚拟文件,我可以用它来做你想做的事。对我来说效果很好,但你自己看:
import glob
import pandas as pd
import re
import os
all_files = glob.glob('Natl_hours_YTD_OC_*.csv')
full_paths = [os.path.abspath(file) for file in all_files]
print(full_paths)
>>> Out:
['C:\Users\Chris\Desktop\Natl_hours_YTD_OC_2018-2019.csv',
'C:\Users\Chris\Desktop\Natl_hours_YTD_OC_2019-2020.csv',
'C:\Users\Chris\Desktop\Natl_hours_YTD_OC_2020.csv']
Nat_dfs = []
for file in all_files:
df = pd.read_csv(file,delim_whitespace=True)
print(df,'\n')
df['File'] = file
df['Year'] = re.search('\d+[-]*\d*',file).group()
Nat_dfs.append(df)
>>> Out:
A B
0 7 7
1 8 8
2 9 9
A B
0 4 4
1 5 5
2 6 6
A B
0 1 1
1 2 2
2 3 3
NatHrs = pd.concat(Nat_dfs)
print(NatHrs)
>>> Out:
A B File Year
0 7 7 Natl_hours_YTD_OC_2018-2019.csv 2018-2019
1 8 8 Natl_hours_YTD_OC_2018-2019.csv 2018-2019
2 9 9 Natl_hours_YTD_OC_2018-2019.csv 2018-2019
0 4 4 Natl_hours_YTD_OC_2019-2020.csv 2019-2020
1 5 5 Natl_hours_YTD_OC_2019-2020.csv 2019-2020
2 6 6 Natl_hours_YTD_OC_2019-2020.csv 2019-2020
0 1 1 Natl_hours_YTD_OC_2020.csv 2020
1 2 2 Natl_hours_YTD_OC_2020.csv 2020
2 3 3 Natl_hours_YTD_OC_2020.csv 2020
我不知道你做错了什么,但这绝对有效。希望这就是您所需要的。
我试过这个:
import re
string = NatHrs['File']
short = string.split('\')[-1]
substring = re.search('\d+[-]*\d+',short).group()
print(substring)
NatHrs['Program Year'] = substring
NatHrs
我得到了这个:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-257-f8c37bb604e2> in <module>
3
4 string = NatHrs['File']
----> 5 short = string.split('\')[-1]
6
7 substring = re.search('\d+[-]*\d+',short).group()
~\anaconda3\envs\PythonData\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5177 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5178 return self[name]
-> 5179 return object.__getattribute__(self, name)
5180
5181 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'split'
这返回了一个文件,该文件读取了文件年份和程序年份之间的此类不一致:
File Program Year
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv 2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv 2018-2019
我也试过这个:
import re
string = file
short = string.split('\')[-1]
substring = re.search('\d+[-]*\d+',short).group()
print(substring)
NatHrs['Program Year'] = substring
NatHrs
并获得了一个仅反映 2019-2020 年的专栏 "Program Year",尽管我希望同时显示 2018-2019 年和 2019-2020 年。
这最终成功了。非常感谢你一直以来对我的帮助!
globbed_files = glob.glob("Natl_hours_YTD_OC_*.csv")
globbed_files
data = []
for csv in globbed_files:
frame = pd.read_csv(csv, encoding='windows-1252', header=1)
frame['filename'] = os.path.basename(csv)
file = os.path.basename(csv)
#create a new column to store the portion of the file name that denotes the Program Year to which the data belongs
frame['Program Year'] = re.search('\d+[-]*\d*',file).group()
data.append(frame)
NatHrs = pd.concat(data, ignore_index=True) #dont want pandas to try an align row indexes
NatHrs.copy().head()