python 中有没有办法删除 csv 文件中的多行?
Is there a way in python to delete several rows in an csv file?
我目前正在从 sec.gov 下载 2016 年第一季度的 form.idx 文件。因为我只对 10-Ks 感兴趣,所以我想下载将文件保存为 .csv 文件并删除无用的行。我尝试按表单类型进行过滤,但没有成功。
到目前为止我的代码如下:
import requests
import os
years = [2016]
quarters = ['QTR1']
base_path = '/Users/xyz/Desktop'
current_dirs = os.listdir(path=base_path)
for yr in years:
if str(yr) not in current_dirs:
os.mkdir('/'.join([base_path, str(yr)]))
current_files = os.listdir('/'.join([base_path, str(yr)]))
for qtr in quarters:
local_filename = f'{yr}-{qtr}.csv'
local_file_path = '/'.join([base_path, str(yr), local_filename])
if local_filename in current_files:
print(f'Skipping file for {yr}, {qtr} because it is already saved.')
continue
url = f'https://www.sec.gov/Archives/edgar/full-index/{yr}/{qtr}/form.idx'
r = requests.get(url, stream=True)
with open(local_file_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=128):
f.write(chunk)
r2 = pd.read_csv('/Users/xyz/Desktop/2016-QTR1.csv', sep=";", encoding="utf-8")
r2.head()
filt = (r2 ['Form Type'] == '10-K')
r2_10K = r2.loc[filt]
r2_10K.head()
r2_10K.to_csv('/Users/xyz/Desktop/modified.csv')
The Error message I get is:
Traceback (most recent call last):
File "<ipython-input-5-f84e3f81f3d1>", line 61, in <module>
filt = (r2 ['Form Type'] == '10-K')
File "/Users/xyz/opt/anaconda3/envs/spyder-4.1.5_1/lib/python3.8/site-packages/pandas/core/frame.py", line 2906, in __getitem__
indexer = self.columns.get_loc(key)
File "/Users/xyz/opt/anaconda3/envs/spyder-4.1.5_1/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
raise KeyError(key) from err
KeyError: 'Form Type'
也许有一种方法可以只删除文件中我不需要的行?
否则,我也感谢您在该问题上提供的任何帮助。
非常感谢。
亲切的问候,
埃琳娜
您可以通过多种方式从 csv 文件中删除行。 Python 中的 Pandas 库具有任意数量的函数,您可以通过这些函数更改 csv 文件中的数据。
首先通过以下代码导入Pandas库:
import pandas as pd
通过以下代码读取您的 csv 文件:
df = pd.read_csv("filename.csv")
例如,如果您有一个名为 df 的数据字段,其中包含您的 csv 文件。您可以通过以下代码按索引删除行:
df1 = df.drop([df.index[1], df.index[2]])
使用 Pandas 可以通过多种方式从 csv 中删除行。例如:按行值、按空值、按数据类型等等!
这是适合您的完整工作代码,主要问题在于您从网上获取的 csv 格式,完整代码:https://rextester.com/QUGF24653
我做了什么:
- 我确实跳过了前 10 行
- 在使用 3 space 分隔符后设置列名
- 将最后一列拆分为 2 个新列
- 使用“10-K”过滤表单类型
import requests
import os
import pandas as pd
years = [2016]
quarters = ['QTR1']
base_path = '/Users/xyz/Desktop'
current_dirs = os.listdir(path=base_path)
for yr in years:
if str(yr) not in current_dirs:
os.mkdir('/'.join([base_path, str(yr)]))
current_files = os.listdir('/'.join([base_path, str(yr)]))
for qtr in quarters:
local_filename = f'{yr}-{qtr}.csv'
local_file_path = '/'.join([base_path, str(yr), local_filename])
if local_filename in current_files:
print(f'Skipping file for {yr}, {qtr} because it is already saved.')
continue
url = f'https://www.sec.gov/Archives/edgar/full-index/{yr}/{qtr}/form.idx'
r = requests.get(url, stream=True)
with open(local_file_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=128):
f.write(chunk)
colnames=['Form Type', 'Company Name', 'CIK', 'Date Filed','File Name']
r2 = pd.read_csv('/Users/xyz/Desktop/2016-QTR1.csv', sep=r'\s{3,}', skiprows=10, encoding="utf-8", names=colnames,header=None)
r2[['Date Filed','File Name']] = r2['Date Filed'].str.split(expand=True)
filtered = (r2['Form Type'] == '10-K')
r2_10K = r2.loc[filtered]
print(r2_10K.head())
输出:
Form Type Company Name CIK Date Filed File Name
2181 10-K 1347 Capital Corp 1606163 2016-03-21 edgar/data/1606163/0001144204-16-089184.txt
2182 10-K 1347 Property Insurance Holdings, Inc. 1591890 2016-03-17 edgar/data/1591890/0001387131-16-004603.txt
2183 10-K 1ST CONSTITUTION BANCORP 1141807 2016-03-22 edgar/data/1141807/0001141807-16-000010.txt
2184 10-K 1ST SOURCE CORP 34782 2016-02-19 edgar/data/34782/0000034782-16-000102.txt
2185 10-K 1st Century Bancshares, Inc. 1420525 2016-03-04 edgar/data/1420525/0001437749-16-026765.txt
我目前正在从 sec.gov 下载 2016 年第一季度的 form.idx 文件。因为我只对 10-Ks 感兴趣,所以我想下载将文件保存为 .csv 文件并删除无用的行。我尝试按表单类型进行过滤,但没有成功。
到目前为止我的代码如下:
import requests
import os
years = [2016]
quarters = ['QTR1']
base_path = '/Users/xyz/Desktop'
current_dirs = os.listdir(path=base_path)
for yr in years:
if str(yr) not in current_dirs:
os.mkdir('/'.join([base_path, str(yr)]))
current_files = os.listdir('/'.join([base_path, str(yr)]))
for qtr in quarters:
local_filename = f'{yr}-{qtr}.csv'
local_file_path = '/'.join([base_path, str(yr), local_filename])
if local_filename in current_files:
print(f'Skipping file for {yr}, {qtr} because it is already saved.')
continue
url = f'https://www.sec.gov/Archives/edgar/full-index/{yr}/{qtr}/form.idx'
r = requests.get(url, stream=True)
with open(local_file_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=128):
f.write(chunk)
r2 = pd.read_csv('/Users/xyz/Desktop/2016-QTR1.csv', sep=";", encoding="utf-8")
r2.head()
filt = (r2 ['Form Type'] == '10-K')
r2_10K = r2.loc[filt]
r2_10K.head()
r2_10K.to_csv('/Users/xyz/Desktop/modified.csv')
The Error message I get is:
Traceback (most recent call last):
File "<ipython-input-5-f84e3f81f3d1>", line 61, in <module>
filt = (r2 ['Form Type'] == '10-K')
File "/Users/xyz/opt/anaconda3/envs/spyder-4.1.5_1/lib/python3.8/site-packages/pandas/core/frame.py", line 2906, in __getitem__
indexer = self.columns.get_loc(key)
File "/Users/xyz/opt/anaconda3/envs/spyder-4.1.5_1/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
raise KeyError(key) from err
KeyError: 'Form Type'
也许有一种方法可以只删除文件中我不需要的行? 否则,我也感谢您在该问题上提供的任何帮助。
非常感谢。
亲切的问候, 埃琳娜
您可以通过多种方式从 csv 文件中删除行。 Python 中的 Pandas 库具有任意数量的函数,您可以通过这些函数更改 csv 文件中的数据。 首先通过以下代码导入Pandas库:
import pandas as pd
通过以下代码读取您的 csv 文件:
df = pd.read_csv("filename.csv")
例如,如果您有一个名为 df 的数据字段,其中包含您的 csv 文件。您可以通过以下代码按索引删除行:
df1 = df.drop([df.index[1], df.index[2]])
使用 Pandas 可以通过多种方式从 csv 中删除行。例如:按行值、按空值、按数据类型等等!
这是适合您的完整工作代码,主要问题在于您从网上获取的 csv 格式,完整代码:https://rextester.com/QUGF24653
我做了什么:
- 我确实跳过了前 10 行
- 在使用 3 space 分隔符后设置列名
- 将最后一列拆分为 2 个新列
- 使用“10-K”过滤表单类型
import requests import os import pandas as pd years = [2016] quarters = ['QTR1'] base_path = '/Users/xyz/Desktop' current_dirs = os.listdir(path=base_path) for yr in years: if str(yr) not in current_dirs: os.mkdir('/'.join([base_path, str(yr)])) current_files = os.listdir('/'.join([base_path, str(yr)])) for qtr in quarters: local_filename = f'{yr}-{qtr}.csv' local_file_path = '/'.join([base_path, str(yr), local_filename]) if local_filename in current_files: print(f'Skipping file for {yr}, {qtr} because it is already saved.') continue url = f'https://www.sec.gov/Archives/edgar/full-index/{yr}/{qtr}/form.idx' r = requests.get(url, stream=True) with open(local_file_path, 'wb') as f: for chunk in r.iter_content(chunk_size=128): f.write(chunk) colnames=['Form Type', 'Company Name', 'CIK', 'Date Filed','File Name'] r2 = pd.read_csv('/Users/xyz/Desktop/2016-QTR1.csv', sep=r'\s{3,}', skiprows=10, encoding="utf-8", names=colnames,header=None) r2[['Date Filed','File Name']] = r2['Date Filed'].str.split(expand=True) filtered = (r2['Form Type'] == '10-K') r2_10K = r2.loc[filtered] print(r2_10K.head())
输出:
Form Type Company Name CIK Date Filed File Name
2181 10-K 1347 Capital Corp 1606163 2016-03-21 edgar/data/1606163/0001144204-16-089184.txt
2182 10-K 1347 Property Insurance Holdings, Inc. 1591890 2016-03-17 edgar/data/1591890/0001387131-16-004603.txt
2183 10-K 1ST CONSTITUTION BANCORP 1141807 2016-03-22 edgar/data/1141807/0001141807-16-000010.txt
2184 10-K 1ST SOURCE CORP 34782 2016-02-19 edgar/data/34782/0000034782-16-000102.txt
2185 10-K 1st Century Bancshares, Inc. 1420525 2016-03-04 edgar/data/1420525/0001437749-16-026765.txt