Product Direction - Delivery

The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.

Overview
3-Year Strategy
Challenges
1-Year Plan
What's Next
Prioritization Framework

Overview

The Delivery Group enables GitLab Engineering to deliver features in a safe, scalable, and efficient fashion to both GitLab.com and self-managed customers. The Group deploys changes to GitLab.com continuously throughout the day and also ensures that GitLab's monthly, security, and patch releases are made available to our SaaS Platforms and self-managed customers on a predictable schedule.

3-Year Strategy

Our 3-year Delivery strategy contributes to the company strategy primarily by maturing the platform. In order to meet our planned growth(internal) GitLab is evolving its architecture toward Cells. To support this, Delivery must scale rollout strategies to provide safe and efficient rollouts to a fleet of potentially thousands of Cells in the future. With deep expertise in the application, its components, and how to deploy GitLab at a scale that serves millions, the Delivery Group is uniquely positioned to become the orchestrator of the fleet, ensuring that we can roll out changes safely at scale.

To ensure that the evolution to Cells is successful, technical as well as cultural changes will be needed. For example, to enable gradual rollout changes across an entire fleet will require much stronger backward and forward compatibility than is currently required for GitLab.com.

To facilitate this longer-term organizational change, Delivery has changed how it thinks about deployments and releases and now considers GitLab.com to be a “fleet of one”. This change allows us to focus on learning about how best to apply our tools and processes to effectively coordinate rollouts across a fleet of Cells.

For deployments to be efficient across a fleet of Cells, our tools and processes must be fully automated (automatic rollout & rollback, deployment observability, self-healing, scalable). To achieve this goal in an iterative manner, we will continue to learn as much as we can about operating effective rollouts on a fleet of Cells at the same time as operating effective continuous delivery to GitLab.com.

Over the course of the next 3 years, we expect to see strong strategic partnerships evolve across the Platforms Section to drive prioritization, swarm to solve the most important problem, and clear roadblocks for other groups in the section. The Dedicated Group has solved a considerable amount of problems encountered with operating a single tenant of GitLab in an automated way. Additionally, the Scalability Group supports our SaaS platforms and has expertise in the kind of monitoring and logging that will be required to run Cells.

Finally, we will require assistance from other groups within the company where their domain expertise is invaluable as part of moving toward the next iteration of deployments. For example, the tests that we run may be prohibitively expensive to horizontally scale and health checks may be a better fit. As the DRI for our test coverage, the Quality Department are positioned to help us navigate this. Another example is the Database Group which can help us with the longer-term direction of post-deployment migrations to unblock rollbacks for all users.

Challenges

Open Core, Staying Secure

Since GitLab is public by default and operates both SaaS platforms as well as a self-managed offering, there are some unique challenges as part of day to day operations. GitLab's business model is based around an open core and we believe in maintaining transparency over the source code that is part of GitLab, even where this introduces additional challenges.

However, as a typical part of business for a software company in 2023, we are required to implement security fixes and remediations on a regular basis. In order to keep both ourselves and our customers safe and secure, we must discuss and implement these security fixes whilst maintaining confidentiality until the fixed release is made available to all of our customers. As a result, we have two streams of code that flow into GitLab and divergence can add complexity.

Tight Logical Coupling

GitLab is made up of a series of components that have a tight logical coupling and strong forward/backward compatibility requirements. Our GitLab.com deployments and managed versioning release processes span many parts of the organization and are reflective of our organization structure, see Conway’s Law. This can make the process highly resistant to change, as there is a significant organizational burden to coordinate and align on the changes that need to be made across a process, where each department has their own metrics and responsibilities. Visibility across processes is low and it has prevented Delivery from truly evolving the process as opposed to iterating on sections over time.

High Operational Load

Delivery models the complex subsystem team pattern and is responsible for ensuring that GitLab is delivered to customers, without outages, multiple times a day. The team has deep expertise in the architecture, deployment patterns, and hands-on remediations involved in deploying and releasing GitLab. Onboarding, operating, and maintaining these systems have a high cognitive and operational load. As a consequence, project work that could evolve our deployment capabilities and unlock new business opportunities can be deprioritized over “keep the lights on work” which prevents/mitigates user impact.

1-Year Plan

Over FY24 we've identified the following 4 key themes and aligned them with the Infrastructure & Quality department’s direction.

Improving the efficiency of Release Management

Release management currently operates through a combination of manual and automated steps. As GitLab grows and our feature velocity increases, we need to evolve the current process to handle the increased demand for scale. Our current releases processes are unlikely to scale much further and we risk being a bottleneck to throughput if we do not invest in evolving and streamlining the release and deployment tools & processes. This should also contribute to the infrastructure goal of Achieve 50% growth year-on-year in engagement surveys results compared to FY23.

Our specific goal is to reduce stress and cognitive load by measuring and reducing Release Manager workload and as a result increase the amount of time we can spending improving our tools and removing tech debt. We'll achieve that by:

Simplifying deployment and release processes by removing manual steps for example by automating security release steps.
Improving documentation to make it easier to find and use process overviews, runbooks to guide day-to-day decisions, and training guides to support onboarding
Identifying and removing frequent deployment and release failures
Moving to a predictable release schedule

Move Toward Self Service as a Path to end-to-end Ownership for Stage Teams

Our release processes are complex, involve a lot of manual touch points and currently require deep domain expertise to execute. As a result the Delivery group can become a bottleneck or a gate for many teams wanting to make changes to their deployments, fix bugs and support older versions. Additionally, things like backports and major dependency upgrades often represent sudden and unplanned work which have to be prioritized against the current needs. As a result, non-critical fixes and upgrades can be rejected and the benefit is therefore not realized by customers.

Moving toward self-service and the Delivery group as a maintainer of the tools, instead of an executor of the process, and allowing Stage teams to deploy independently will allow them to own their features across the entire feature development lifecycle, increasing their efficiency and removing bottlenecks. This also supports our department goal of Preparing self-servicing for stage group teams to enable end-to-end development. In order to get there, the first steps we have to take are:

Creating a self-serve backport request process to allow other teams to prepare patch releases for older versions
Enhancing package management and deployment strategies to allow release managers and Stage groups to control package contents for deployment and testing.
Enabling Independent Deployments for GitLab services
Providing a way to allow stage group teams to self-serve common administrative tasks such as configuring security mirrors

Increase the Efficacy of Delivery teams' metrics

Delivery's MTTP PI has tracked our work to make deployments to GitLab.com fast and reliable but it is limited to an overview of whether things are going well or not so well. It doesn't give us the level of insight we need to make adjustments to improve our processes and tools. In order for us to effectively and efficiently improve the way we do things, we need a more granular level of detail and instrumentation. This will help us to increase the efficiency of the deployment and release processes and every feature released will benefit from this.

In FY24 we'll review existing group metrics and create a set that represents all of Delivery group responsibilities to allow us to measure our impact. Additional metrics will include:

Measure Release Manager workload
Measure time to deliver patch releases to self-managed users

In addition we are improving the deployment pipeline observability to increase the insight we get into areas of improvement, reliability and deployment duration. We will:

Measure and create SLOs for Packaging time
Measure and create SLOs for QA testing
Identify and reduce frequently-failing jobs

As we review and increase the number of metrics in use we'll need to focus on how and where we track these metrics to make sure we maintain a usable overview of the Delivery group.

Increase GitLab.com resilience to planned and unplanned growth, while keeping the cost of running the platform in check

GitLab is growing consistently and the demands of the platform are constantly evolving. As part of the move toward becoming the best in class AI enabled DevSecOps platform, we have a renewed focus on experimentation. We get the best data and insights for our experiments by exposing them to real customer traffic and getting feedback in a production environment. However, this can introduce risk and friction in the deployment process and our approach to experimentation is more manual than we would like. In order to increase the number of concurrent experiments we can run and remove risk, we have to add more flexibility into our deployment options.

This approach will also allow us to trial major platform upgrades (Ruby 2.x - 3.x, Rails 6.x - 7.x, etc.) in a way that is safe and gives us confidence. This strategy could drastically reduce the coordination needed between teams as well as the time taken and risk to make this type of change. It's also an area where our competitors are able to leverage the latest features in a way that we can't yet. In order to continue to win against GitHub, we must reduce the cost of change to our platform.

We’ll start by:

Implementing flexible deployment strategies (e.g.: Support for zero downtime deployments in Dedicated) to increase change confidence and minimize customer's impact.
Supporting Stage Groups with widely impacting changes (e.g.: Rails 7 rollout) leveraging release capabilities and safer rollout plans that will minimize impact on reliability.
Scaling existing deployment and release processes to meet organization and user needs. Including:
- Improve the backport request policy
- Scale the security release process

What we're not doing

As part of being a complicated subsystem team with a high operational load, we have to be deliberate about the work that we take on. There are a few things that we’re interested in, but can’t take on right now:

Supporting Teams with non standard release processes in automating their deployments and implementing the component requirements
We made the decision to continue a pilot and not extend the maintenance policy as a part of re-prioritisation.

What's Next

Adapt release process to address customer current needs
Supporting the release schedule change
Enabling flexible deployments to support gradual traffic shifting and blue/green rollouts
Documenting more of our processes
Establishing a release process for Gitlab.com to improve change visibility
Upgrading to Ruby 3.1

Prioritization Framework

Because Delivery are responsible for deploying our multi-tenant SaaS offering (gitlab.com) as well as releasing GitLab packages for Dedicated and Self Managed, we prioritize "Keep the lights on" activities (e.g. deployment failures, incidents, release management) above all else to ensure we provide customers a high level of service that continually meets our reliability and performance SLAs. Aside from this our work assumes the normal product prioritization process and the top priority is just a reflection of our operational responsibilities.