如何使用 Pandas 或请求访问 Python 中的私有 Github 回购文件 (.csv)

How to Access Private Github Repo File (.csv) in Python using Pandas or Requests

我不得不将我的 public Github 存储库切换为私有并且无法访问文件,而不是使用 public Github 存储库能够访问的访问令牌.

我可以使用 curl 访问我私人存储库的 CSV: ''' curl -s https://{token}@raw.githubusercontent.com/username/repo/master/file.csv

'''

但是,我想在我的 python 文件中访问此信息。当 repo 是 public 时,我可以简单地使用: ''' url = 'https://raw.githubusercontent.com/username/repo/master/file.csv' df = pd.read_csv(url, error_bad_lines=False)

'''

现在这不再有效,因为回购是私人的,我找不到解决方法来下载此 CSV python 而不是从终端拉取。

如果我尝试: ''' requests.get(https://{token}@raw.githubusercontent.com/username/repo/master/file.csv) ''' 我收到 404 响应,这与 pd.read_csv() 发生的情况基本相同。如果我点击原始文件,我会看到创建了一个临时令牌,URL 是: ''' https://raw.githubusercontent.com/username/repo/master/file.csv?token=TEMPTOKEN ''' 有没有办法附加我的永久私有访问令牌,以便我始终可以从 github?

中提取这些数据

你看过pygithub了吗?对于访问存储库、文件、拉取请求、历史记录等非常有用。文档是 here。这是一个示例脚本,它打开一个拉取请求,一个从基本分支中分离出来的新分支(你需要那个访问令牌,或者生成一个新的!),并删除一个文件:

from github import Github
my_reviewers = ['usernames', 'of_reviewers']
gh = Github("<token string>")
repo_name = '<my_org>/<my_repo>'
repo = gh.get_repo(repo_name)
default_branch_name = repo.default_branch
base = repo.get_branch(default_branch_name)
new_branch_name = "my_new_branchname"
new_branch = repo.create_git_ref(ref=f'refs/heads/{new_branch_name}',sha=base.commit.sha)
contents = repo.get_contents("some_script_in_repo.sh", ref=new_branch_name)
repo.delete_file(contents.path, "commit message", contents.sha, branch=new_branch_name)
pr = repo.create_pull(
    title="PR to Remove some_script_in_repo.sh",
    body="This is the text in the main body of your pull request",
    head=new_branch_name,
    base=default_branch_name,
)
pr.create_review_request(reviewers=my_reviewers)

希望对您有所帮助,编码愉快!

是的,您可以在 Python 中下载 CSV 文件,而不是从终端中提取。为此,您可以使用 GitHub API v3 以及 'requests' 和 'io' 模块帮助。下面是可重现的例子。

import numpy as np
import pandas as pd
import requests
from io import StringIO

# Create CSV file
df = pd.DataFrame(np.random.randint(2,size=10_000).reshape(1_000,10))
df.to_csv('filename.csv') 

# -> now upload file to private github repo

# define parameters for a request
token = 'paste-there-your-personal-access-token' 
owner = 'repository-owner-name'
repo = 'repository-name-where-data-is-stored'
path = 'filename.csv'

# send a request
r = requests.get(
    'https://api.github.com/repos/{owner}/{repo}/contents/{path}'.format(
    owner=owner, repo=repo, path=path),
    headers={
        'accept': 'application/vnd.github.v3.raw',
        'authorization': 'token {}'.format(token)
            }
    )

# convert string to StringIO object
string_io_obj = StringIO(r.text)

# Load data to df
df = pd.read_csv(string_io_obj, sep=",", index_col=0)

# optionally write df to CSV
df.to_csv("file_name_02.csv")

这就是最终对我有用的东西 - 如果有人遇到同样的问题,就把它留在这里。感谢您的帮助!

    import json, requests, urllib, io

    user='my_github_username'
    pao='my_pao'

    github_session = requests.Session()
    github_session.auth = (user, pao)

    # providing raw url to download csv from github
    csv_url = 'https://raw.githubusercontent.com/user/repo/master/csv_name.csv'

    download = github_session.get(csv_url).content
    downloaded_csv = pandas.read_csv(io.StringIO(download.decode('utf-8')), error_bad_lines=False)

这种方式对我来说真的很管用:

    def _github(url: str, mode: str = "private"):
        url = url.replace("/blob/", "/")
        url = url.replace("/raw/", "/")
        url = url.replace("github.com/", "raw.githubusercontent.com/")

        if mode == "public":
            return requests.get(url)
        else:
            token = os.getenv('GITHUB_TOKEN', '...')
            headers = {
                'Authorization': f'token {token}',
                'Accept': 'application/vnd.github.v3.raw'}
            return requests.get(url, headers=headers)

添加另一个工作示例:

import requests
from requests.structures import CaseInsensitiveDict

# Variables
GH_PREFIX = "https://raw.githubusercontent.com"
ORG = "my-user-name"
REPO = "my-repo-name"
BRANCH = "main"
FOLDER = "some-folder"
FILE = "some-file.csv"
URL = GH_PREFIX + "/" + ORG + "/" + REPO + "/" + BRANCH + "/" + FOLDER + "/" + FILE

# Headers setup
headers = CaseInsensitiveDict()
headers["Authorization"] = "token " + GITHUB_TOKEN

# Execute and view status
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
   print(resp.content)
else:
   print("Request failed!")

显然,如今,rawgithubusercontent 链接也可以简单地使用令牌,但在 python 的请求情况下,它们需要一个 username:token 组合,这曾经是github 之前的规范已将其更改为仅一个令牌就足够了。

所以:

https://{token}@raw.githubusercontent.com/username/repo/master/file.csv

变成

https://{username}:{token}@raw.githubusercontent.com/username/repo/master/file.csv

上面的示例代码如下:

from requests import get as rget

res = rget("https://<username>:<token>@raw.githubusercontent.com/<username>/repo/<repo>/file.csv")
with open('file.csv', 'wb+') as f:
        f.write(res.content)