- Shallow cloning
- Git strategy
- Git clone path
- Git clean flags
- Git fetch extra flags
- Fork-based workflow
- Git fetch caching or pre-clone step
Optimize GitLab for large repositories
Large repositories consisting of more than 50k files in a worktree may require more optimizations beyond pipeline efficiency because of the time required to clone and check out.
GitLab and GitLab Runner handle this scenario well but require optimized configuration to efficiently perform its set of operations.
The general guidelines for handling big repositories are simple. Each guideline is described in more detail in the sections below:
- Always fetch incrementally. Do not clone in a way that results in recreating all of the worktree.
- Always use shallow clone to reduce data transfer. Be aware that this puts more burden on GitLab instance due to higher CPU impact.
- Control the clone directory if you heavily use a fork-based workflow.
- Optimize
git clean
flags to ensure that you remove or keep data that might affect or speed-up your build.
Shallow cloning
Introduced in GitLab Runner 8.9.
GitLab and GitLab Runner perform a shallow clone by default.
Ideally, you should always use GIT_DEPTH
with a small number
like 10. This instructs GitLab Runner to perform shallow clones.
Shallow clones make Git request only the latest set of changes for a given branch,
up to desired number of commits as defined by the GIT_DEPTH
variable.
This significantly speeds up fetching of changes from Git repositories, especially if the repository has a very long backlog consisting of number of big files as we effectively reduce amount of data transfer.
The following example makes the runner shallow clone to fetch only a given branch; it does not fetch any other branches nor tags.
variables:
GIT_DEPTH: 10
test:
script:
- ls -al
Git strategy
Introduced in GitLab Runner 8.9.
By default, GitLab is configured to use the fetch
Git strategy,
which is recommended for large repositories.
This strategy reduces the amount of data to transfer and
does not really impact the operations that you might do on a repository from CI.
Git clone path
Introduced in GitLab Runner 11.10.
GIT_CLONE_PATH
allows you to
control where you clone your sources. This can have implications if you
heavily use big repositories with fork workflow.
Fork workflow from GitLab Runner’s perspective is stored as a separate repository with separate worktree. That means that GitLab Runner cannot optimize the usage of worktrees and you might have to instruct GitLab Runner to use that.
In such cases, ideally you want to make the GitLab Runner executor be used only for the given project and not shared across different projects to make this process more efficient.
The GIT_CLONE_PATH
has to be
within the $CI_BUILDS_DIR
. Currently, it is impossible to pick any path
from disk.
Git clean flags
Introduced in GitLab Runner 11.10.
GIT_CLEAN_FLAGS
allows you to control
whether or not you require the git clean
command to be executed for each CI
job. By default, GitLab ensures that you have your worktree on the given SHA,
and that your repository is clean.
GIT_CLEAN_FLAGS
is disabled when set
to none
. On very big repositories, this might be desired because git
clean
is disk I/O intensive. Controlling that with GIT_CLEAN_FLAGS: -ffdx
-e .build/
(for example) allows you to control and disable removal of some
directories within the worktree between subsequent runs, which can speed-up
the incremental builds. This has the biggest effect if you re-use existing
machines and have an existing worktree that you can re-use for builds.
For exact parameters accepted by
GIT_CLEAN_FLAGS
, see the documentation
for Git fetch extra flags
For example, if your project contains a large number of tags that your CI jobs don’t rely on,
you could add --no-tags
to the extra flags to make your fetches faster and more compact.
Also in the case where you repository does not contain a lot of
tags, See the Introduced in GitLab Runner 11.10.
Following the guidelines above, let’s imagine that we want to:
Let’s consider the following two examples, one using Let’s assume that you have the following This Let’s assume that you have the following This Once we have the executor configured, we need to fine tune our Our pipeline is most performant if we use the following The above configures a:
Why use Ideally, all job-related configuration should be stored in In the above example of Forks, making this configuration discoverable for users may be preferred,
but this brings administrative overhead as the We can extend our This makes the cloning configuration to be part of the given runner
and does not require us to update each For very active repositories with a large number of references and files, you can either (or both):
GIT_FETCH_EXTRA_FLAGS
allows you
to modify git fetch
behavior by passing extra flags.
--no-tags
can .
If your CI builds do not depend on Git tags it is worth trying.
GIT_FETCH_EXTRA_FLAGS
documentation
for more information.
Fork-based workflow
shell
executor and
other using docker
executor.
shell
executor example
config.toml
.
concurrent = 4
[[runners]]
url = "GITLAB_URL"
token = "TOKEN"
executor = "shell"
builds_dir = "/builds"
cache_dir = "/cache"
[runners.custom_build_dir]
enabled = true
config.toml
:
shell
executor,
/builds
directory where all clones are stored.
GIT_CLONE_PATH
,
docker
executor example
config.toml
.
concurrent = 4
[[runners]]
url = "GITLAB_URL"
token = "TOKEN"
executor = "docker"
builds_dir = "/builds"
cache_dir = "/cache"
[runners.docker]
volumes = ["/builds:/builds", "/cache:/cache"]
config.toml
:
docker
executor,
/builds
directory on disk where all clones are stored.
We host mount the /builds
directory to make it reusable between subsequent runs
and be allowed to override the cloning strategy.
GIT_CLONE_PATH
as it is enabled by default.
Our
.gitlab-ci.yml
.gitlab-ci.yml
.
.gitlab-ci.yml
:
variables:
GIT_DEPTH: 10
GIT_CLONE_PATH: $CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME
build:
script: ls -al
git fetch
commands.
$CI_CONCURRENT_ID
? The main reason is to ensure that worktrees used are not conflicting
between projects. The $CI_CONCURRENT_ID
represents a unique identifier within the given executor.
When we use it to construct the path, this directory does not conflict
with other concurrent jobs running.
Store custom clone options in
config.toml
.gitlab-ci.yml
.
However, sometimes it is desirable to make these schemes part of the runner’s configuration.
.gitlab-ci.yml
needs to be updated for each branch.
In such cases, it might be desirable to keep the .gitlab-ci.yml
clone path agnostic, but make it
a configuration of the runner.
config.toml
with the following specification that is used by the runner if .gitlab-ci.yml
does not override it:
concurrent = 4
[[runners]]
url = "GITLAB_URL"
token = "TOKEN"
executor = "docker"
builds_dir = "/builds"
cache_dir = "/cache"
environment = [
"GIT_DEPTH=10",
"GIT_CLONE_PATH=$CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME"
]
[runners.docker]
volumes = ["/builds:/builds", "/cache:/cache"]
.gitlab-ci.yml
.
Git fetch caching or pre-clone step
gitlab-org/gitlab
development.
pre_clone_script
of GitLab Runner. See
SaaS runners on Linux for details.