合并来自 Github 存储库 Link 的所有 csv 文件并使其成为一个 csv 文件

Question

我想从以下 Github 存储库 link 中收集所有 csv 文件，并希望将其设为新的 csv 文件（用于数据清理目的） :

https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports

这样我的新 csv 文件将包含所有日期的数据。

使用以下命令，我将只能加载 01-01-2021.csv。

import numpy as np
import pandas as pd
import requests

df = pd.read_csv ('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-01-2021.csv')

df.head()

如何一次加载所有 csv 个文件？

Answer 1

查看 pd.concat?

假设您拥有所有文件链接：

dfs = []
for l in links:
    df = pd.read_csv(l)
    dfs.append(df)
final_df = pd.concat(dfs)

Answer 2

您提供的 link 的 csv 文件名格式为月-日-year.csv。所以我做了一个循环来创建一个文件名并直接从给定的 URL 加载 csv。除非网站有 csv 文件的随机命名约定，否则这应该有效。

years = [2020, 2021]
months = [month for month in range(1, 13)]
days = [day for day in range(1, 31)]
URL = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 
19/master/csse_covid_19_data/csse_covid_19_daily_reports'
all_files = []
for year in years:
    for month in months:
        month = str(month).zfill(2)
        for day in days:
            day = str(day).zfill(2)
            print(f"{month}-{day}-{year}.csv")
            df = pd.read_csv(URL + f"/{month}-{day}-{year}.csv")
            all_files.append(df)
        final_csv_file = pd.concat(all_files, axis=0, ignore_index=True)

这是我从上面的源代码得到的输出的快照。但在这里，我只循环了两个元素 1,2，分别代表日和月，以及 2021 年。只要网站有非随机命名约定，这应该可行。

Answer 3

给你！您可以指定开始日期和结束日期以从它们之间的那些日期获取所有数据。这还会检查该特定日期的 url 是否存在，并且只有当它是有效的 url 时，它才会将其添加到最终数据框。

import requests
import pandas as pd


def is_leap_year(year):
    # checks if the current year is leap year

    """
        params:
            year - int
        
        returns:
            bool
    """

    if((year%4==0 and year%100!=0) or (year%400==0)):
        return True
    else:
        return False


def split_date(date_str):
    # Splits the date into month, day and year

    """
        params:
            date_str - str (mm-dd-yyyy)
        
        returns:
            month - int
            day - int
            year - int
    """

    month, day, year = list(int(x) for x in date_str.split("-")) # For US standards, for rest of the world feel free to swap month and day
    return month, day, year


def generate_dates(start_date, end_date):
    # This doesn't validate the dates and it is assumed that the start_date and end_dates both are valid dates with the end date > start_date
    # This generates all dates bw start date and end date and also takes into account leap year as well

    """
        params:
            start_date - str (mm-dd-yyyy)
            end_date - str (mm-dd-yyyy)
        
        returns:
            dates - list of strings of dates between start_date and end_date
    """

    dates = []
    start_month, start_day, start_year = split_date(start_date)
    end_month, end_day, end_year = split_date(end_date)
    
    year = start_year
    while(year<=end_year):
        month = start_month if(year==start_year) else 1
        max_month = end_month if(year==end_year) else 12
        while(month<=max_month):
            day = start_day if(year==start_year) else 1
            if(month==2):
                max_day = 29 if(is_leap_year(year)) else 28
            else:
                max_day = 31 if(start_month in [1,3,5,7,8,10,12]) else 30
            if(year==end_year and month==end_month):
                max_day = end_day
            while(day<=max_day):
                new_date = f"{month}-{day}-{year}"
                dates.append(new_date)
                day+=1
            month+=1
        year+=1

        return dates


def check_if_url_is_valid(url):
    # This checks if the url is valid through the python requests library, by making a GET request. if the url is present and valid then it returns status code in (200-300)

    """
        params:
            url - str
        
        returns:
            bool
    """

    r = requests.get(url)
    if(r.status_code in range(200,300)):
        return True
    else:
        return False
        

def to_df(base_url, start_date, end_date):
    # Takes all the generated dates, creates a url for each date through the base url and then tries to download it, else prints out an error message

    """
        params:
            base_url - str it should be of the format "https://github.com/{}.csv" where the {} will be used for string formatting and different dates will be put into it

            returns:
                final_df - pd.DataFrame 
    """

    files = []
    dates = generate_dates(start_date, end_date)
    for date in dates:
        url = base_url.format(date)
        valid_url = check_if_url_is_valid(url)
        if(valid_url):
            df = pd.read_csv(url)
            files.append(df)
        else:
            print(f"Could not download {date} data as it may be unavailable")
    final_df = pd.concat(files)
    print(f"\n Downloaded {len(files)} files!\n")
    return final_df

更新：

这是相同的 Google Colab link - https://colab.research.google.com/drive/19ysmJ2wWaiEpzGae7XqOSPa-FfNZqza3?usp=sharing

Answer 4

这是一个使用 pandas、requests 和 BeautifulSoup 过滤所有 csv 链接的简短解决方案：

import pandas as pd
import requests
from bs4 import BeautifulSoup, SoupStrainer

html = requests.get('https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports')

dfs = []
for link in BeautifulSoup(html.text, parse_only=SoupStrainer('a')):
    if hasattr(link, 'href') and link['href'].endswith('.csv'):
        url = 'https://github.com'+link['href'].replace('/blob/', '/raw/')
        dfs.append(pd.read_csv(url))
df = pd.concat(dfs)

注意。测试代码，这在 ~12 分钟内运行并产生 2300506 行 × 21 列的最终数据帧。理想情况下，应该为其添加多线程以并行下载多个文件（合理，不要被服务器踢）

合并来自 Github 存储库 Link 的所有 csv 文件并使其成为一个 csv 文件

Combining all csv files from Github Repository Link and make it a one csv file

python

csv

github

data-analysis

pandas