Product Section Direction - Data Science

The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.

Section Overview

The Data Science section is comprised of three stages:

AI-powered - AI-powered workflows boost efficiency and reduce cycle times with the help of AI.
ModelOps - enable GitLab to be used for machine learning and artificial intelligence use cases.
Anti-Abuse - leverage GitLab usage behavior to produce automated actions that reduce risks from insider threats and increase stability of GitLab instances.

Team and Investment

GitLab's Data Science section was introduced in late 2021 and has grown and laid the foundation of data science at GitLab throughout 2022. 2023 will see us continue to invest in Data Science usecases across the platform and build features to enable our customers to more effectively and efficiently build ML/AI into their products.

To learn more about GitLab’s investment areas, please visit the Product Investments section of the GitLab Handbook.

Evolution of Businesses in Digital Transformation

Over the last decade GitLab has helped companies navigate digital transformation into software companies.

Digital Transformation is about creating new opportunities for your business to drive innovation and efficiency, to improve how your teams work, and to leapfrog ahead of competitors, all with the goal of delivering new and improved customer experiences.

Every industry is undergoing a transformation. Customers are expecting more and retaining them is increasingly difficult as your competitor is now just a click away. Irrespective of your industry, technology now needs to be front and center of your offering as competition is coming from unexpected sources.

Companies are increasingly leveraging data within their businesses to power next generation software powered by machine learning (ML) and artificial intelligence (AI). We believe that the next stage of digital transformation is software companies adopting ML/AI to power next generation, data rich applications. This comes with new challenges with managing the big data needed to power these algorithms, and unique challenges running AI/ML at scale including data cleaning, job orchestration, model training/testing/deployment, and observability.

Leveraging over a decade of experience with DevOps best practices, we're aiming to support businesses making this data science transformation. This section focuses on the new challenges of building these data rich, highly interactive, ML/AI applications. Our ModelOps stage will extend the GitLab platform to enrich features with data science features while also enabling customers to build ML/AI workloads with GitLab.

Drowning in data

Most businesses today generate a lot of data. Data about their customers, their products, metadata and more. Businesses are literally drowning in data and struggle on to extract value from it to power next generation applications and experiences for their customers.

As businesses advance their digital transformations they increasingly create more applications that generate more and more data. This creates challenges just to manage all that data. From storing, aggregating, cleaning, organizing, and even deleting data. That's all just the management of the data, not actually doing anything with it. Many organizations also have data in many different locations, from within their applications themselves, in bespoke data stores, or possibly even the cloud. This leads organizations to build data warehouses where they can manage and unify disparate data sources. This is where the concept of Extract, Load, and Transform (ELT or ETL) derives. ELT platforms have become big businesses with organizations spending lots of money to just store and organize all their data. Data comes at a cost and thus organizations need to extract value from it, that leads to the next challenge.

Extracting insights

With businesses generating endless data streams and spending money to store, manage, and organize it, it's easy to understand why organizations want to extract value from it. Most businesses today have internal business intelligence groups or data analysts who comb through this data looking for insights and ways to extract insights. These insights might be used to answer business questions about what product features to build next, or power next generation customer experiences. It all comes down to extracting value from data. This is usually how data science gets started within an organization.

Data analysts and data sciences within organizations work with the vast data businesses have within their data warehouses cleaning, organizing, and deriving data into more useful forms. As organizations become more data driven they tend to increase the integration of data into their customer facing applications. This introduces new software development lifecycle (SDLC) challenges. Applications that are data rich usually need connections to data, that data flows through applications which has to be managed leading to more complex software development. The most modern organizations are now even embedding real time data science into their applications further complicating software stacks. Live data flows through applications, through ML and AI models which make realtime decisions and outcomes based on the data flowing through them leading to even more complex applications. All of this introduces new challenges within the software development lifecycle (SLDC) that have to be managed by engineering teams that build, deploy, and run these customer facing applications.

Lots of Moving Pieces

Looking back over the past decade of software engineering we've seen a transition of companies going through digital transformations to become software companies. Today most companies are software companies. Part of GitLab's historical success has been helping companies streamline complex software development lifecycles into our single application DevOps platform reducing complexity and speeding up time to value. We're now seeing these software companies embrace data science with many of the same challenges as before:

Complex data science toolchains
Many expensive specialized vendors
Lack of integration with existing tools

Reduce Complexity

Our Data Science section aims to help organizations solve these new challenges as they add ML/AI into their applications. But it's not just our customer's software that's going through this transformation. GitLab itself is transforming our software to become more intelligent. With our ModelOps stage we're integrating machine learning and artificial intelligence into the GitLab product itself to allow Gitlab to offer suggestions and recommendations. We're also leveraging the data our platform generates to provide new and advanced features to our platform customers. Our Anti-Abuse stage is using GitLab data within the platform to make real time decisions to keep the platform running smoothly. In the future we'll also use this data to the platform more reactive to real time insider threats.

Building GitLab with GitLab

GitLab builds GitLab with GitLab, we dogfood all our own features. As we enrich our platform with ML/AI, we experience the same challenges our customers experience building ML/AI into their applications. These insights will inform the features we build into the GitLab DevSecOps platform to support these ML/AI workloads making it easier for our customers and GitLab itself to integrate ML/AI into applications built with GitLab.

Holistic Approach

The work of the Data Science section cuts across the entire Gitlab DevOps platform, from our reliance on features like source code management (SCM) and CI/CD to support machine learning (ML) and artificial intelligence (AI) workloads to how we enhance platform features with ML/AI to make them more intelligent and automated. The section is unique in that the value it creates slices horizontally across all other GitLab sections and stages, providing a holistic approach to data science use cases across the software development lifecycle.

Both ModelOps and Anti-abuse are components of GitLab's Data Science product strategy. ModelOps focuses on enabling Data Scientists to use GitLab effectively. Anti-Abuse will use Data Science techniques to build a user activity data system and automation to protect GitLab from abuse and misuse. Initially, Anti-abuse's work is focused on stabilizing GitLab from abuse, but it will also build new revenue-generating products related to Insider Threat detection and UEBA tooling, both of which will rely on data science techniques.

Aligning Use Cases

This section aligns cross-functional teams and organizational structures across Product, Engineering, UX, and technical writing teams. This streamlines the management chain of all individuals across functions as well as aligns unique product development areas of focus and challenges. Both the ModelOps and Anti-Abuse stages share some unique properties that other Gitlab sections/stages do not:

Unlike our vertical groups and stages, both ModelOps and Anti-abuse horizontally cross all other stages and sections at GitLab. Both stages will interact with features across the platform and the data that underlies those features in order to provide their core value.
A key focus on consuming and leveraging product usage data to provide customer value. That data spans the entire platform and includes user activity and repository metadata. This will be used by the AI Assisted group to enrich GitLab features to make them more intelligent and automated. Anti-Abuse will use this data to protect the platform and offer insider threat detection capabilities to customers.
Automation and Action are key tenets for these product areas. Smart defaults will enable customers to discover new features, recommendations will surface features across the platform based on usage heuristics, and automations will reduce the overhead of managing and operating a GitLab instance.
Wide surface area. Both stages require 'T'-shaped knowledge that is both broad across all GitLab features, but also deep in the specific knowledge areas of data science and anti-abuse.

Important PI milestones

We've established a Data Science internal handbook PI page (internal link) which will be updated monthly as part of PI review meetings. We're still working to actively orchestrate all our performance indicator metrics.

3 Year Section Themes

Reduce complexity

With complex toolchains and new vendors emerging every day the data science landscape is a lot of glue and ducktape holding many systems together. We want to streamline this complexity into the GitLab platform to reduce complexity, remove maintenance burden, and enable faster model development and exploration.

As examples, GitLab will provide:

Native integrations to popular data science toolchains and open-source frameworks.
First-party solutions for DataOps and MLOps workloads.
Open APIs to allow flexibility through the platforms.

Repeatability for Collaboration

Many data science teams struggle with lack of repeatability cobbling together environments on local machines. These environments rarely have source code management or CI. We want to bring the best practices of DevOps with SCM and CI/CD to data sciences and make it easy for them to start with repeatable and stable environments.

As examples, GitLab will provide:

Improved Python Notebook experience across GitLab
Support for more powerful compute within GitLab runner
Simplified CI configuration for popular data science toolchains

Smooth HandOffs

Model handoffs are only one part of the collaboration needed to make data science handoffs smooth. We want to create seamless handoffs across the software development lifecycle of data science workloads, from connecting data to pipelines, managing model code, and the deployment to production. GitLab already is critical for modern software developers managing production applications. We'll bring the best of our existing DevOps platform to data scientists.

As examples, GitLab will provide:

Model registry for management and versioning of ML/AI Learning models
Open APIs for smooth handoffs whether you are using GitLab tools or integrating your choice toolchains
Integrations across existing GitLab features to better support data science workloads
Intelligent recommendations and suggestions across existing Gitlab features to increase velocity and increase efficiency.

Data in Motion

Long gone are the days of stale data. Today data is in motion. It's always being created, moved, transformed, and drifting. It's in the cloud and sometimes many clouds. Modern data science toolchains need to support cloud-native, data in motion.

As examples, GitLab will provide:

Native data connectors to cloud-native data warehouses
Basic ELT tools to prepare data for data science workloads
Integrated data versioning and feature stores for tracking data definitions
Real time platform usage insights

Pricing

We expect the Data Science section will provide multiple monetization strategies across all GitLab plans with features targeted for data science use cases and Insider Threat detection capabilities. These paid features will follow GitLab's pricing themes to determine how to package various features we develop.

Ultimate

Data Science aims to make GitLab smarter and more automated using ML. Features we develop will help organizations automate their portfolio management, improve their security posture, and detect Insider Threats.

As a general rule of thumb, features will fall in the Ultimate/Gold tier when they meet one or more of the following criteria:

The feature is focused on enabling an organization or enterprise to operate at scale rather than an individual with a few smaller personal projects
The feature is natively developed or acquired by GitLab rather than being provided by an open-source project
The feature has a significant ongoing cost for GitLab to maintain and update the feature

Some examples include:

Features provided by our acquisition of UnReview

Premium

Features targeted at premium will include a focus on enabling data science use cases across existing GitLab features like source code management (SCM), CI/CD as well as help protect precious intellectual property like source code hosted within GitLab. We want GitLab natively to support data science workloads and much of the value of managing workloads is found in the premium tier which ModelOps will seek to enhance.

Free

Although paid features are the primary focus, there are several reasons why features for unpaid tiers might be prioritized above paid features:

Data Science workloads are increasing across all industries and verticals, though many organizations are still only dabbling in ML/AI. We want to ensure we support these organizations at every stage of the software development lifecycle which in turn will encourage them to find more value in our paid tiers as they become more advanced with their use cases.
Data Science is still very new. The wider open source community has contributed greatly to many frameworks and tools to enable the foundations of AI/ML as we currently know them. To be good stewards in the open-source community basic integrations we support to popular open-source data science tools will be available in an unpaid tier by default, along with the "table stakes" set of functionality required to allow that feature to be usable with GitLab.

As a general rule of thumb, features will fall in the Core/Free tier when they meet one or more of the following criteria:

The feature is primarily for an individual with a few small projects rather than meeting the needs of an organization or enterprise that is operating at scale
The feature is provided by an integration with an open-source project rather than being natively developed by GitLab
The ongoing cost for GitLab to maintain and update the feature is relatively minimal

Some examples include:

Basic support for Python notebooks in source code management (SCM)
Basic GPU support in GitLab Runner

Target audience

GitLab identifies who our DevSecOps application is built for utilizing the following categorization. We list our view of who we will support when in priority order:

🟩 - Targeted with strong support
🟨 - Targeted but incomplete support
⬜️ - Not targeted but might find value

Today

To capitalize on the potential opportunities, the ModelOps Stage has features that make it useful to the following personas today:

🟨 - Developers
🟨 - Data scientists
🟨 - Data analysts
⬜️ - Security Teams
⬜️ - QA engineers / QA Teams

Medium Term (1-2 years)

As we execute our 3 year strategy, our medium-term (1-2 year) goal is to provide a single DevSecOps application that enables collaboration between developers, data teams, data scientists, and engineers across organizations.

🟩 - Developers
🟩 - Data scientists
🟩 - Data analysts
🟨 - Security Teams
🟨 - QA engineers / QA Teams

Developers

Data Science workloads can be complicated and can leverage specialized hardware and development environments not common to traditional software development teams. The ModelOps stage is focused on the intersection of data scientists exploring models and feature development and the developers who must then deploy those data science features into production.

Personas

Data Scientists

Data scientists have unique roles within organizations. They are more scientists than developers, following hypotheses and data to explore models and develop data science-powered features.

We aim to serve data scientists as they balance art and science within software engineering teams. Data scientists wear a lot of hats to get from hypothesis to data science feature that generates value. GitLab is not a tool of choice for data scientists and we aim to change that by making it easy to configure, build, and execute data science feature development within GitLab.

Personas

Daphne - Data Scientist - a new persona for GitLab we are actively exploring for use cases and workflows.

Security Teams

The larger the organization, the harder it is for security teams to stay on top of everything happening in complex, ever-changing environments. As an organization's source code management and DevSecOps platform, GitLab holds a lot of sensitive, high-value data. We want to help security teams secure that data. This is a job to which automated data science features can be well suited, including monitoring high-value assets around the clock.

Personas

Last Reviewed: 2022-12-12
Last Updated: 2022-12-12

Product Section Direction - Data Science

On this page

Section Overview

Team and Investment

Evolution of Businesses in Digital Transformation

Drowning in data

Extracting insights

Lots of Moving Pieces

Building GitLab with GitLab

Holistic Approach

Aligning Use Cases

Important PI milestones

3 Year Section Themes

Reduce complexity

Repeatability for Collaboration

Smooth HandOffs

Data in Motion

Pricing

Ultimate

Premium

Free

Target audience

Today

Medium Term (1-2 years)

Developers

Data Scientists

Security Teams