如何将文件夹中的多个 ann 文件(来自 brat 注释)读入一个 pandas 数据帧?
How to read multiple ann files (from brat annotation) within a folder into one pandas dataframe?
我可以将一个 ann 文件读入 pandas 数据帧,如下所示:
df = pd.read_csv('something/something.ann', sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
df.head()
但我不知道如何将多个 ann 文件读取到一个 pandas 数据帧中。我尝试使用concat
,但结果不是我所期望的。
如何将多个 ann 文件读取到一个 pandas 数据帧中?
听起来您需要使用 glob
从文件夹中提取所有 .ann
文件并将它们添加到数据帧列表中。之后,您可能需要 join/merge/concat 等。
我不知道您的具体要求,但下面的代码应该能让您满意。目前脚本假定,从 运行 Python 脚本所在的位置,您有一个名为 files
的子文件夹,并且您想要拉入所有 .ann
文件(它不会查看任何其他内容)。显然,根据每行注释的要求进行审查和更改。
import pandas as pd
import glob
path = r'./files' # use your path
all_files = glob.glob(path + "/*.ann")
# create empty list to hold dataframes from files found
dfs = []
# for each file in the path above ending .ann
for file in all_files:
#open the file
df = pd.read_csv(file, sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
#add this new (temp during the looping) frame to the end of the list
dfs.append(df)
#at this point you have a list of frames with each list item as one .ann file. Like [annFile1, annFile2, etc.] - just not those names.
#handle a list that is empty
if len(dfs) == 0:
print('No files found.')
#create a dummy frame
df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
df = dfs[0]
#or concatenate more than one frame together
else: #modify this join as required.
df = pd.concat(dfs, ignore_index=True)
df = df.reset_index(drop=True)
#check what you've got
print(df.head())
我可以将一个 ann 文件读入 pandas 数据帧,如下所示:
df = pd.read_csv('something/something.ann', sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
df.head()
但我不知道如何将多个 ann 文件读取到一个 pandas 数据帧中。我尝试使用concat
,但结果不是我所期望的。
如何将多个 ann 文件读取到一个 pandas 数据帧中?
听起来您需要使用 glob
从文件夹中提取所有 .ann
文件并将它们添加到数据帧列表中。之后,您可能需要 join/merge/concat 等。
我不知道您的具体要求,但下面的代码应该能让您满意。目前脚本假定,从 运行 Python 脚本所在的位置,您有一个名为 files
的子文件夹,并且您想要拉入所有 .ann
文件(它不会查看任何其他内容)。显然,根据每行注释的要求进行审查和更改。
import pandas as pd
import glob
path = r'./files' # use your path
all_files = glob.glob(path + "/*.ann")
# create empty list to hold dataframes from files found
dfs = []
# for each file in the path above ending .ann
for file in all_files:
#open the file
df = pd.read_csv(file, sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
#add this new (temp during the looping) frame to the end of the list
dfs.append(df)
#at this point you have a list of frames with each list item as one .ann file. Like [annFile1, annFile2, etc.] - just not those names.
#handle a list that is empty
if len(dfs) == 0:
print('No files found.')
#create a dummy frame
df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
df = dfs[0]
#or concatenate more than one frame together
else: #modify this join as required.
df = pd.concat(dfs, ignore_index=True)
df = df.reset_index(drop=True)
#check what you've got
print(df.head())