浅而稀疏 GIT 存储库克隆

Shallow AND Sparse GIT Repository Clone

我有一个超过 1 GB 的浅克隆 git 存储库。我使用稀疏结帐来满足 files/dirs 的需要。

我怎样才能将存储库克隆减少到只有稀疏签出 files/dirs?

最初我能够通过在克隆时禁用签出来将克隆的存储库限制为仅稀疏签出。然后在进行初始结帐之前设置稀疏结帐。这将存储库限制为仅约 200 MB。更易于管理。但是,在将来某个时候更新远程分支信息会导致其余文件和目录包含在存储库克隆中。将回购克隆大小发送回超过 1 GB,我不知道如何处理稀疏的签出文件和目录。

简而言之,我想要的是一个浅薄的 AND 稀疏存储库 clone。不仅仅是浅层回购克隆的稀疏结账。完整的回购是对 space 的浪费,并且某些任务的性能会受到影响。

希望有人可以分享解决方案。谢谢

浅和稀疏意味着“部分”或“狭窄”。

部分克隆(或“窄克隆”)在理论上是可能的,并于 2017 年 12 月首次实施 Git 2.16,如 seen here
但是:

  • 只有 Git 2.18 才能进行这样的部分克隆:see here for a test example.
  • 仅使用 和 Git 2.19:这将确保仅传输最少量的数据。

这在 Git 2.20(2018 年第 4 季度)中得到了进一步优化,因为在将从原始存储库中延迟水合的部分克隆中,我们通常希望避免“此对象是否存在(本地)? “在我们故意省略的对象上 当我们创建 (partial/sparse) 克隆时。
然而,缓存树代码路径(用于从索引中写入树对象)坚持该对象存在,即使对于部分检查区域之外的路径也是如此。
代码已更新以避免此类检查。

参见 commit 2f215ff (09 Oct 2018) by Jonathan Tan (jhowtan)
(由 Junio C Hamano -- gitster -- in commit a08b1d6 合并,2018 年 10 月 19 日)

cache-tree: skip some blob checks in partial clone

In a partial clone, whenever a sparse checkout occurs, the existence of all blobs in the index is verified, whether they are included or excluded by the .git/info/sparse-checkout specification.
This significantly degrades performance because a lazy fetch occurs whenever the existence of a missing blob is checked.


在 Git 2.24(2019 年第 4 季度)中,cache-tree 代码被教导在尝试查看它计算的树对象是否已经存在时不那么激进 存储库。

参见 commit f981ec1 (03 Sep 2019) by Jonathan Tan (jhowtan)
(由 Junio C Hamano -- gitster -- in commit ae203ba 合并,2019 年 10 月 7 日)

cache-tree: do not lazy-fetch tentative tree

The cache-tree datastructure is used to speed up the comparison between the HEAD and the index, and when the index is updated by a cherry-pick (for example), a tree object that would represent the paths in the index in a directory is constructed in-core, to see if such a tree object exists already in the object store.

When the lazy-fetch mechanism was introduced, we converted this "does the tree exist?" check into an "if it does not, and if we lazily cloned, see if the remote has it" call by mistake.
Since the whole point of this check is to repair the cache-tree by recording an already existing tree object opportunistically, we shouldn't even try to fetch one from the remote.

Pass the OBJECT_INFO_SKIP_FETCH_OBJECT flag to make sure we only check for existence in the local object store without triggering the lazy fetch mechanism.


在 Git 2.25(2020 年第一季度)中,“git fetch”代码路径有一个很大的“当我询问是否存在某些东西时不要懒惰地获取丢失的对象”开关。

这已通过标记“这东西存在吗?”得到更正。调用带有“如果不是请不要懒惰地获取它”标志。

参见 commit 603960b, commit e362fad (13 Nov 2019), and commit 6462d5e (05 Nov 2019) by Jonathan Tan (jhowtan)
(由 Junio C Hamano -- gitster -- in commit fce9e83 合并,2019 年 12 月 1 日)

clone: remove fetch_if_missing=0

Signed-off-by: Jonathan Tan

Commit 6462d5eb9a ("fetch: remove fetch_if_missing=0", 2019-11-08) strove to remove the need for fetch_if_missing=0 from the fetching mechanism, so it is plausible to attempt removing fetch_if_missing=0 from clone as well. But doing so reveals a bug - when the server does not send an object directly pointed to by a ref, this should be an error, not a trigger for a lazy fetch. (This case in the fetching mechanism was covered by a test using "git clone", not "git fetch", which is why the aforementioned commit didn't uncover the bug.)

The bug can be fixed by suppressing lazy-fetching during the connectivity check. Fix this bug, and remove fetch_if_missing from clone.

并且:

promisor-remote: remove fetch_if_missing=0

Signed-off-by: Jonathan Tan

Commit 6462d5eb9a ("fetch: remove fetch_if_missing=0", 2019-11-08) strove to remove the need for fetch_if_missing=0 from the fetching mechanism, so it is plausible to attempt removing fetch_if_missing=0 from the lazy-fetching mechanism in promisor-remote as well.

But doing so reveals a bug - when the server does not send an object pointed to by a tag object, an infinite loop occurs: Git attempts to fetch the missing object, which causes a deferencing of all refs (for negotiation), which causes a lazy fetch of that missing object, and so on.
This bug is because of unnecessary use of the fetch negotiator during lazy fetching - it is not used after initialization, but it is still initialized (which causes the dereferencing of all refs).

Thus, when the negotiator is not used during fetching, refrain from initializing it. Then, remove fetch_if_missing from promisor-remote.


通过“Bring your monorepo down to size with sparse-checkout" from Derrick Stolee

查看更多内容

Pairing sparse-checkout with the partial clone feature accelerates these workflows even more.
This combination speeds up the data transfer process since you don’t need every reachable Git object, and instead, can download only those you need to populate your cone of the working directory

$ git clone --filter=blob:none --no-checkout https://github.com/derrickstolee/sparse-checkout-example
Cloning into 'sparse-checkout-example'...
Receiving objects: 100% (373/373), 75.98 KiB | 2.71 MiB/s, done.
Resolving deltas: 100% (23/23), done.
 
$ cd sparse-checkout-example/
 
$ git sparse-checkout init --cone
Receiving objects: 100% (3/3), 1.41 KiB | 1.41 MiB/s, done.
 
$ git sparse-checkout set client/android
Receiving objects: 100% (26/26), 985.91 KiB | 5.76 MiB/s, done.

在 Git 2.25.1(2020 年 2 月)之前,has_object_file() 表示“no”给定了一个通过 pretend_object_file() 注册到系统的对象,使其不一致使用 read_object_file(),导致延迟获取尝试从承诺者远程获取空树。

See discussion.

I tried to reproduce this with

empty_tree=$(git mktree </dev/null)
git init --bare x
git clone --filter=blob:none file://$(pwd)/x y
cd y
echo hi >README
git add README
git commit -m 'nonempty tree'
GIT_TRACE=1 git diff-tree "$empty_tree" HEAD

and indeed, it looks like Git serves the empty tree even from repositories that don't contain it.

参见 commit 9c8a294 (02 Jan 2020) by Jonathan Tan (jhowtan)
(由 Junio C Hamano -- gitster -- in commit e26bd14 合并,2020 年 1 月 22 日)

sha1-file: remove OBJECT_INFO_SKIP_CACHED

Signed-off-by: Jonathan Tan

In a partial clone, if a user provides the hash of the empty tree ("git mktree </dev/null" - for SHA-1, this is 4b825d...) to a command which requires that that object be parsed, for example:

git diff-tree 4b825d <a non-empty tree>

then Git will lazily fetch the empty tree, unnecessarily, because parsing of that object invokes repo_has_object_file(), which does not special-case the empty tree.

Instead, teach repo_has_object_file() to consult find_cached_object() (which handles the empty tree), thus bringing it in line with the rest of the object-store-accessing functions.
A cost is that repo_has_object_file() will now need to oideq upon each invocation, but that is trivial compared to the filesystem lookup or the pack index search required anyway. (And if find_cached_object() needs to do more because of previous invocations to pretend_object_file(), all the more reason to be consistent in whether we present cached objects.)

As a historical note, the function now known as repo_read_object_file() was taught the empty tree in 346245a1bb ("hard-code the empty tree object", 2008-02-13, Git v1.5.5-rc0 -- merge), and the function now known as oid_object_info() was taught the empty tree in c4d9986f5f ("sha1_object_info: examine cached_object store too", 2011-02-07, Git v1.7.4.1).

repo_has_object_file() was never updated, perhaps due to oversight.
The flag OBJECT_INFO_SKIP_CACHED, introduced later in dfdd4afcf9 ("sha1_file: teach sha1_object_info_extended more flags", 2017-06-26, Git v2.14.0-rc0) and used in e83e71c5e1 ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26, Git v2.14.0-rc0), was introduced to preserve this difference in empty-tree handling, but now it can be removed.


Git 2.25.1 还会警告程序员 pretend_object_file() 允许代码暂时使用内核对象。

参见 commit 60440d7 (04 Jan 2020) by Jonathan Nieder (artagnon)
(由 Junio C Hamano -- gitster -- in commit b486d2e 合并,2020 年 2 月 12 日)

sha1-file: document how to use pretend_object_file

Inspired-by: Junio C Hamano
Signed-off-by: Jonathan Nieder

Like in-memory alternates, pretend_object_file contains a trap for the unwary: careless callers can use it to create references to an object that does not exist in the on-disk object store.

Add a comment documenting how to use the function without risking such problems.

The only current caller is blame, which uses pretend_object_file to create an in-memory commit representing the working tree state. Noticed during a discussion of how to safely use this function in operations like "git merge" which, unlike blame, are not read-only.

所以the comment is now:

/*
 * Add an object file to the in-memory object store, without writing it
 * to disk.
 *
 * Callers are responsible for calling write_object_file to record the
 * object in persistent storage before writing any other new objects
 * that reference it.
 */
int pretend_object_file(void *, unsigned long, enum object_type,
            struct object_id *oid);

Git 2.25.1(2020 年 2 月)包括用于确保测试不依赖于当前实施细节的未来验证。

参见 commit b54128b (13 Jan 2020) by Jonathan Tan (jhowtan)
(由 Junio C Hamano -- gitster -- in commit 3f7553a 合并,2020 年 2 月 12 日)

t5616: make robust to delta base change

Signed-off-by: Jonathan Tan

Commit 6462d5eb9a ("fetch: remove fetch_if_missing=0", 2019-11-08) contains a test that relies on having to lazily fetch the delta base of a blob, but assumes that the tree being fetched (as part of the test) is sent as a non-delta object.
This assumption may not hold in the future; for example, a change in the length of the object hash might result in the tree being sent as a delta instead.

Make the test more robust by relying on having to lazily fetch the delta base of the tree instead, and by making no assumptions on whether the blobs are sent as delta or non-delta.


Git 2.25.2(2020 年 3 月)修复了最近一项使协议 v2 成为默认协议的更改所揭示的错误。

参见 commit 3e96c66, commit d0badf8 (21 Feb 2020) by Derrick Stolee (derrickstolee)
(由 Junio C Hamano -- gitster -- in commit 444cff6 合并,2020 年 3 月 2 日)

partial-clone: avoid fetching when looking for objects

Signed-off-by: Derrick Stolee

While testing partial clone, I noticed some odd behavior. I was testing a way of running 'git init', followed by manually configuring the remote for partial clone, and then running 'git fetch'.
Astonishingly, I saw the 'git fetch' process start asking the server for multiple rounds of pack-file downloads! When tweaking the situation a little more, I discovered that I could cause the remote to hang up with an error.

Add two tests that demonstrate these two issues.

In the first test, we find that when fetching with blob filters from a repository that previously did not have any tags, the 'git fetch --tags origin' command fails because the server sends "multiple filter-specs cannot be combined". This only happens when using protocol v2.

In the second test, we see that a 'git fetch origin' request with several ref updates results in multiple pack-file downloads.
This must be due to Git trying to fault-in the objects pointed by the refs. What makes this matter particularly nasty is that this goes through the do_oid_object_info_extended() method, so there are no "haves" in the negotiation.
This leads the remote to send every reachable commit and tree from each new ref, providing a quadratic amount of data transfer! This test is fixed if we revert 6462d5eb9a (fetch: remove fetch_if_missing=0, 2019-11-05, Git v2.25.0-rc0), but that revert causes other test failures.
The real fix will need more care.

修复:

When using partial clone, find_non_local_tags() in builtin/fetch.c checks each remote tag to see if its object also exists locally. There is no expectation that the object exist locally, but this function nevertheless triggers a lazy fetch if the object does not exist. This can be extremely expensive when asking for a commit, as we are completely removed from the context of the non-existent object and thus supply no "haves" in the request.

6462d5eb9a (fetch: remove fetch_if_missing=0, 2019-11-05, Git v2.25.0-rc0, , Git v2.25.0-rc0) removed a global variable that prevented these fetches in favor of a bitflag. However, some object existence checks were not updated to use this flag.

Update find_non_local_tags() to use OBJECT_INFO_SKIP_FETCH_OBJECT in addition to OBJECT_INFO_QUICK.
The _QUICK option only prevents repreparing the pack-file structures. We need to be extremely careful about supplying _SKIP_FETCH_OBJECT when we expect an object to not exist due to updated refs.

This resolves a broken test in t5616-partial-clone.sh.


git clone --single-branch”自动关注标签的逻辑不小心避免延迟获取不必要的标签,已通过 Git 2.27(2020 年第 2 季度)、

参见 commit 167a575 (01 Apr 2020) by Jeff King (peff)
(由 Junio C Hamano -- gitster -- in commit 3ea2b46 合并,2020 年 4 月 22 日)

clone: use "quick" lookup while following tags

Signed-off-by: Jeff King

When cloning with --single-branch, we implement git fetch's usual tag-following behavior, grabbing any tag objects that point to objects we have locally.

When we're a partial clone, though, our has_object_file() check will actually lazy-fetch each tag.

That not only defeats the purpose of --single-branch, but it does it incredibly slowly, potentially kicking off a new fetch for each tag.
This is even worse for a shallow clone, which implies --single-branch, because even tags which are supersets of each other will be fetched individually.

We can fix this by passing OBJECT_INFO_SKIP_FETCH_OBJECT to the call, which is what git fetch does in this case.

Likewise, let's include OBJECT_INFO_QUICK, as that's what git fetch does.
The rationale is discussed in 5827a03545 (fetch: use "quick" has_sha1_file for tag following, 2016-10-13, Git v2.10.2), but here the tradeoff would apply even more so because clone is very unlikely to be racing with another process repacking our newly-created repository.

This may provide a very small speedup even in the non-partial case case, as we'd avoid calling reprepare_packed_git() for each tag (though in practice, we'd only have a single packfile, so that reprepare should be quite cheap).


在 Git 2.27(2020 年第二季度)之前,使用在线协议通过“git://”和“ssh://”协议为“git fetch”客户端提供服务当客户端需要发出后续请求时,版本 2 在服务器端出现错误。自动关注标签。

参见 commit 08450ef (08 May 2020) by Christian Couder (chriscool)
(由 Junio C Hamano -- gitster -- in commit a012588 合并,2020 年 5 月 13 日)

upload-pack: clear filter_options for each v2 fetch command

Helped-by: Derrick Stolee
Helped-by: Jeff King
Helped-by: Taylor Blau
Signed-off-by: Christian Couder

Because of the request/response model of protocol v2, the upload_pack_v2() function is sometimes called twice in the same process, while 'struct list_objects_filter_options filter_options' was declared as static at the beginning of 'upload-pack.c'.

This made the check in list_objects_filter_die_if_populated(), which is called by process_args(), fail the second time upload_pack_v2() is called, as filter_options had already been populated the first time.

To fix that, filter_options is not static any more. It's now owned directly by upload_pack(). It's now also part of 'struct upload_pack_data', so that it's owned indirectly by upload_pack_v2().

In the long term, the goal is to also have upload_pack() use 'struct upload_pack_data', so adding filter_options to this struct makes more sense than to have it owned directly by upload_pack_v2().

This fixes the first of the 2 bugs documented by d0badf8797 ("partial-clone: demonstrate bugs in partial fetch", 2020-02-21, Git v2.26.0-rc0 -- merge listed in batch #8).


在 Git 2.29(2020 年第 4 季度)中,pretend-object 机制会在决定将数据保留在核心中之前检查给定对象是否已存在于对象存储中,但检查会触发从承诺者远程延迟获取此类对象。

参见 commit a64d2aa (21 Jul 2020) by Jonathan Tan (jhowtan)
(由 Junio C Hamano -- gitster -- in commit 5b137e8 合并,2020 年 8 月 4 日)

sha1-file: make pretend_object_file() not prefetch

Signed-off-by: Jonathan Tan

When pretend_object_file() is invoked with an object that does not exist (as is the typical case), there is no need to fetch anything from the promisor remote, because the caller already knows what the object is supposed to contain. Therefore, suppress the fetch. (The OBJECT_INFO_QUICK flag is added for the same reason.)

This was noticed at $DAYJOB when "blame" was run on a file that had uncommitted modifications.