Category Direction - Fleet Visibility

The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.


Stage	Verify
Maturity	Complete
Content Last Reviewed	`2024-01-25`

Introduction and how you can help

Thanks for visiting this direction page on the Fleet Visibility category at GitLab. This page belongs to the Runner Group within the Verify Stage and is maintained by Darren Eastman.

Strategy and Themes

Our vision is that as customers more deeply integrate AI into their development processes, in one unified dashboard, they can manage a GitLab Runner Fleet at scale and have deep visibility into CI/CD pipeline execution metrics that easily correlate to the specific CI/CD build environment.

Adequate Fleet Visibility starts with providing at-a-glance insights into all the statuses (online, offline, stale) of all runner-build servers in your organization. In addition, Fleet Visibility will allow you to determine the group or project a runner is associated with and surface critical metrics such as runner build queue performance, failure rates, and the most heavily used runners.

Combining Fleet Management visibility with CI/CD visibility will provide platform administrators and developers the metrics they need to identify issues with CI pipeline performance or reliability and to determine which component to focus on when using trial-and-error approaches to optimization. These solutions eliminate trial and error approaches for optimizing CI job execution speed and performance and the effort to troubleshoot and resolve CI/CD job failures.

By correlating the insights provided to platform administrators at the admin or group levels exposed in the Fleet Dashboard with the CI/CD job and pipeline execution metrics (job duration trends, job failure rates, pipeline reliability) at the project level exposed in CI Insights, organizations will spend less time building custom observability tooling for CI/CD pipelines. Developers and platform administrators will have a shared understanding of CI/CD performance trends across the platform.

For executives, as the AI-powered GitLab DevSecOps platform enables your development teams to deliver secure software faster, Fleet Visibility provides your operations team the visibility they need to operate a CI/CD build infrastructure at scale cost-effectively and efficiently.

1 year plan

** Runner Fleet Dashboard**

The Runner Fleet Dashboard - Admin View: Starter Metrics was released to GitLab.com (Admin view only) for internal testing in 16.5. The included metrics widgets for the initial release are as follows:

Fleet Health (Postgres DB)
Top Active Runners (Postgres DB)
Wait Time to Pick up a Job (Clickhouse DB)

In the first quarter of FY25, a critical goal is to release the Runner Fleet Dashboard to top-level group namespaces on GitLab.com. We have heard from several customers in the past few weeks that they absolutely need this capability along with APIs to augment the current observability tooling, and custom dashboards that they rely on to monitor and operate GitLab CI and Runners.

The next major goal for FY25 is ensure that we have a supported solution for self-managed customers who need to use the Runner Fleet Dashboard and specifically more advanced metrics such as the wait time to pick up a job that relies on Clickhouse, an open-source columnar database that provides fast query performance for large datasets.

With these foundational elements in place, we plan to incorporate feedback from customers who adopt the Fleet Dashboard in the first half of FY25 to determine the next evolution of the Fleet Dashboard strategy. While we have already identified additional metrics, such as runner failure trends, that could be valuable to include in the dashboard, it is also likely, based on recent customer feedback, that simply extending the metrics data model and enabling customers to create their reports and visualizations is the most valuable future iteration.

Regarding prediction, one central theme in customer conversations is determining when there may be a slowdown in runner queue performance. This is a classic prediction problem, so we aim to explore if we can reduce the cost of prediction and fleet operational costs for our customers by incorporating ML/AI into the Fleet Dashboard. With Clickhouse as the database layer and a new analytics database table structure for Runner Fleet, we believe the foundational elements are in place to make this next evolution a reality in FY25.

CI Insights

The first phase in the unified Fleet Visibility strategy is solving the critical visibility problems developers and platform administrators have with using GitLab CI/CD. Those visibility problems include the fact that there is no built-in report in GitLab CI/CD analytics for a developer to determine if a CI job is running as expected from a duration perspective or whether a CI job is unhealthy, as represented by an increase in job failures. As a result, customers have had to create custom reporting systems or implement third-party observability tools using data exposed in the GitLab jobs API.

Our goal for the first half of FY25 is to refactor the GitLab CI/CD analytics view to incorporate the critical metrics our customers tell us they need to use, monitor, and optimize GitLab CI/CD efficiently.

What is next for us

In the next three months (February to April) we are focused on the following:

Runner Fleet Dashboard

What we are currently working on

-Add compute minute breakdown by project to the card in Fleet Dashboard

What we recently completed

In the past three months, we have shipped the following key feature:

Runner Fleet dashboard - Admin View:Add CSV export of compute minutes used by project on instance level runners.
Add the estimated wait time metric to the Fleet Dashboard and enabled on GitLab SaaS.
Fleet Dashboard - Admin View:Starter Metrics (self-managed) - experiment

What is Not Planned Right Now

In the near term we are not focused on design or development efforts to improve Runners usability in CI/CD settings at the project level.

While improvements in this view could be valuable to the software developer persona, feedback from customers indicates that providing meaningful CI insights that cover vital metrics such as CI job success and failure rates, job duration metrics, average job retries, average queue time for each job, are more valuable for customers and an enabler for broader CI adoption

Best in Class Landscape

BIC (Best In Class) is an indicator of forecasted near-term market performance based on a combination of factors, including analyst views, market news, and feedback from the sales and product teams. It is critical that we understand where GitLab appears in the BIC landscape.

At GitLab, a critical challenge is simplifying the administration and management of a CI/CD build fleet at an enterprise scale. This effort is one foundational pillar to realizing the vision of GitLab Duo AI-optimized DevSecOps. Competitors are also investing in this general category. Earlier this year GitHub announced a new management experience that provides a summary view of GitHub-hosted runners. This is a signal that there will be a focus on reducing maintenance and configuration overhead for managing a CI/CD build environment at scale across the industry.

We also now see additional features on the GitLab public roadmap signaling an increased investment in the category we coined here at GitLab, 'Runner Fleet.' These features suggest that GitHub aims to provide a first-class experience for managing GitHub Actions runners and include features in the UI to simplify runner queue management and resolve performance bottlenecks. With this level of planned investment, it is clear that there is recognition in the market that simplifying the administrative maintenance and overhead of the CI build fleet is critical for large customers and will help enable deeper product adoption.

Indirect competitor Actutated is the first solution that we have seen whose product includes a dashboard for Runners and build queue visibility. This is another strong signal that providing solutions that reduce the CI/CD build infrastructure's management overhead is valuable for organizations with mature DevOps practices.

In the CI Insights arena, a few startups, for example, Trunk.io, are providing CI visibility solutions for GitHub actions. The Datadog CI Visbility product is a mature, full-featured offering that provides CI/CD insights for GitLab CI/CD using the GitLab jobs API as the foundational layer.

To ensure that our GitLab customers can fully realize the value of GitLab's product vision, we must provide solutions that eliminate the complexities, manual tasks, and operational overhead and reduce the costs of delivering a CI build environment at scale. Our goal in FY25 is to include good enough Fleet visibility solutions that customers not yet fully invested in third-party observability or custom tooling can use out of the box to observe, analyze, optimize CI jobs, or troubleshoot CI job failures natively in GitLab.

Key Capabilities

The key capabilities that we hear from customers describing fleet management pain points are as follows:

CI/CD job failure rate trends
CI/CD job duration trends
CI/CD job retry rate trends
Runner queue visibility (wait time)
Runner Fleet management metrics
Frictionless upgrades
Security
Cost visibility for runners hosted on public cloud infrastructure
Fleet autoscaling
Automatic fleet configuration optimization
Managing runner sprawl
Configuring and managing a heterogeneous runner fleet (container builds on Linux, container builds on Windows, shell builds on Windows, shell builds on macOS)
Self-service runner creation for the developer persona
Automating choosing the right cloud and compute to host a Runner based on CI/CD build performance

Top [1/2/3] Competitive Solutions

Runner Fleet is still a nascent category; competitors like GitHub are beginning to invest in this area. On their future roadmap, GitHub plans to introduce seamless management of GitHub-hosted and self-hosted runners. This feature aims to deliver a "single management plane to manage all runners for a team using GitHub." GitHub also plans to offer Actions Performance Metrics to provide organizations with deep insights into critical CI/CD performance metrics. One example of how the cloud infrastructure market can evolve is Active Assist for Google Cloud - a solution to optimize cloud operations cost reduction. Therefore we can imagine a future where Microsoft and GitHub bring to market AI-based solutions that integrate GitHub Actions with infrastructure on Azure. Our GitLab competitive position is solid in that we will continue to invest in features and capabilities to ensure that customers can use GitLab Runners efficiently on any cloud provider.

In the insights space DataDog has a CI and Test Visibility offering and CircleCI has had an insights offering for some time. While there is no main GitHub actions functionality there are several offerings in the marketplace for collecting test/run data and displaying it on a dashboard.

In the Test Reduction area Sealights offers a CutTests solution and Redefine.dev is a new player in the space taking advantage of AI to reduce future test runs for faster pipelines.

Maturity Plan

The Runner Fleet category maturity scorecard project ended on 2022-09-14. Runner Fleet scored 3.63, which puts the maturity level at "Complete".
For CI insights our next targeted maturity level is complete.