将 Kaggle csv 从下载 url 导入到 pandas DataFrame
Import Kaggle csv from download url to pandas DataFrame
我一直在尝试不同的方法将 SpaceX 任务 csv file on Kaggle 直接导入 pandas DataFrame,但没有成功。
我需要发送登录请求。这是我目前所拥有的:
import requests
import pandas as pd
from io import StringIO
# Link to the Kaggle data set & name of zip file
login_url = 'http://www.kaggle.com/account/login?ReturnUrl=/spacex/spacex-missions/downloads/database.csv'
# Kaggle Username and Password
kaggle_info = {'UserName': "user", 'Password': "pwd"}
# Login to Kaggle and retrieve the data.
r = requests.post(login_url, data=kaggle_info, stream=True)
df = pd.read_csv(StringIO(r.text))
r 正在返回页面的 html 内容。
df = pd.read_csv(url)
给出 CParser 错误:
CParserError: Error tokenizing data. C error: Expected 1 fields in line 13, saw 6
我已经搜索了解决方案,但到目前为止我尝试过的都没有奏效。
您正在创建流并将其直接传递给 pandas。我认为您需要将对象之类的文件传递给 pandas。查看 以获取可能的解决方案(使用 post 但不进入请求)。
我还认为您使用的带重定向的登录名 url 无法正常工作。 I know i suggested that here。但我最终没有使用是因为 post 请求调用没有处理重定向(我怀疑)。
我最终在我的项目中使用的代码是这样的:
def from_kaggle(data_sets, competition):
"""Fetches data from Kaggle
Parameters
----------
data_sets : (array)
list of dataset filenames on kaggle. (e.g. train.csv.zip)
competition : (string)
name of kaggle competition as it appears in url
(e.g. 'rossmann-store-sales')
"""
kaggle_dataset_url = "https://www.kaggle.com/c/{}/download/".format(competition)
KAGGLE_INFO = {'UserName': config.kaggle_username,
'Password': config.kaggle_password}
for data_set in data_sets:
data_url = path.join(kaggle_dataset_url, data_set)
data_output = path.join(config.raw_data_dir, data_set)
# Attempts to download the CSV file. Gets rejected because we are not logged in.
r = requests.get(data_url)
# Login to Kaggle and retrieve the data.
r = requests.post(r.url, data=KAGGLE_INFO, stream=True)
# Writes the data to a local file one chunk at a time.
with open(data_output, 'wb') as f:
# Reads 512KB at a time into memory
for chunk in r.iter_content(chunk_size=(512 * 1024)):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
使用示例:
sets = ['train.csv.zip',
'test.csv.zip',
'store.csv.zip',
'sample_submission.csv.zip',]
from_kaggle(sets, 'rossmann-store-sales')
您可能需要解压缩文件。
def _unzip_folder(destination):
"""Unzip without regards to the folder structure.
Parameters
----------
destination : (str)
Local path and filename where file is should be stored.
"""
with zipfile.ZipFile(destination, "r") as z:
z.extractall(config.raw_data_dir)
所以我从来没有真正直接将它加载到 DataFrame 中,而是先将它存储到磁盘中。但是您可以将其修改为使用临时目录,并在读取文件后将其删除。
我一直在尝试不同的方法将 SpaceX 任务 csv file on Kaggle 直接导入 pandas DataFrame,但没有成功。
我需要发送登录请求。这是我目前所拥有的:
import requests
import pandas as pd
from io import StringIO
# Link to the Kaggle data set & name of zip file
login_url = 'http://www.kaggle.com/account/login?ReturnUrl=/spacex/spacex-missions/downloads/database.csv'
# Kaggle Username and Password
kaggle_info = {'UserName': "user", 'Password': "pwd"}
# Login to Kaggle and retrieve the data.
r = requests.post(login_url, data=kaggle_info, stream=True)
df = pd.read_csv(StringIO(r.text))
r 正在返回页面的 html 内容。
df = pd.read_csv(url)
给出 CParser 错误:
CParserError: Error tokenizing data. C error: Expected 1 fields in line 13, saw 6
我已经搜索了解决方案,但到目前为止我尝试过的都没有奏效。
您正在创建流并将其直接传递给 pandas。我认为您需要将对象之类的文件传递给 pandas。查看
我还认为您使用的带重定向的登录名 url 无法正常工作。 I know i suggested that here。但我最终没有使用是因为 post 请求调用没有处理重定向(我怀疑)。
我最终在我的项目中使用的代码是这样的:
def from_kaggle(data_sets, competition):
"""Fetches data from Kaggle
Parameters
----------
data_sets : (array)
list of dataset filenames on kaggle. (e.g. train.csv.zip)
competition : (string)
name of kaggle competition as it appears in url
(e.g. 'rossmann-store-sales')
"""
kaggle_dataset_url = "https://www.kaggle.com/c/{}/download/".format(competition)
KAGGLE_INFO = {'UserName': config.kaggle_username,
'Password': config.kaggle_password}
for data_set in data_sets:
data_url = path.join(kaggle_dataset_url, data_set)
data_output = path.join(config.raw_data_dir, data_set)
# Attempts to download the CSV file. Gets rejected because we are not logged in.
r = requests.get(data_url)
# Login to Kaggle and retrieve the data.
r = requests.post(r.url, data=KAGGLE_INFO, stream=True)
# Writes the data to a local file one chunk at a time.
with open(data_output, 'wb') as f:
# Reads 512KB at a time into memory
for chunk in r.iter_content(chunk_size=(512 * 1024)):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
使用示例:
sets = ['train.csv.zip',
'test.csv.zip',
'store.csv.zip',
'sample_submission.csv.zip',]
from_kaggle(sets, 'rossmann-store-sales')
您可能需要解压缩文件。
def _unzip_folder(destination):
"""Unzip without regards to the folder structure.
Parameters
----------
destination : (str)
Local path and filename where file is should be stored.
"""
with zipfile.ZipFile(destination, "r") as z:
z.extractall(config.raw_data_dir)
所以我从来没有真正直接将它加载到 DataFrame 中,而是先将它存储到磁盘中。但是您可以将其修改为使用临时目录,并在读取文件后将其删除。