The promise of DataOps is compelling even though the term has progressed through the hype cycle for years now and, for many, little impact has yet been achieved. DataOps is defined as the application of Agile DevOps methods and tools to our data engineering disciplines. This concept has succeeded in becoming the preferred approach for creation of quality data products. However, the early hype promised 10x productivity gains for data engineering teams adopting this method. That created urgency in the early stages and may now be generating disappointment for those not experiencing a step level change. The potential remains strong though. If impact is elusive, there are likely two main causes.
First, organizational change management requires time and commitment. The larger the data engineering organization, the greater the challenge in designing and executing this change. Scale defeats many good intentions of changing organizations. If “DataOps” seems to have become another fancy name for what we have always done, look carefully at how the change has been designed for scale and how the transformation was rolled out.
Another leading cause of frustration is the lack of effective automation and tooling. Consistent, successful data engineering practices not only enable tooling, but also require it. The complexity of data operations can be overwhelming. Just listing the considerations of a data shop – data extractions, data warehouses, data lakes, data pipelines, streaming data, data governance, data lineage, data quality, data security, etc. – can be exhausting. This complexity is well-illustrated in the following chart:
Deploying new data products to production can result in breaking either the code or breaking the data or both. This is further challenged by the variety of data structures within a distributed data landscape.
The best organizations avoid wallowing in this complexity. Instead, they simplify.
The key to simplifying is automating the flow of information and specification between these many considerations with well designed, well integrated tools. This enhancement is done in an iterative life cycle, where neither the process nor tools come first, but are instead designed to complement DataOps objectives together, using the same continuous integration, continuous delivery (CI/CD) virtuous cycle our DataOps enables for the creation of data products. When automation is neglected and tooling deemphasized, data engineering processes become brittle and gapped, and teams are undermined by complication. When tooling is prioritized without proper consideration of process, the same might occur. Instead, DataOps must be viewed as a holistic system, where automation and simplification can continuously strive for optimization.
We learned how this works within our application development programs, as we became skilled at the development of microservices. The application development world is more self-contained and driven by the clearer goal of creating and sustaining new application features. The world of enterprise data challenges is driven more by our search for insights than features, and complicated by a more volatile world. Rather than being self-contained, enterprise data must conform to a data model, data standards, data quality measures, etc. Achieving the same tight integration of the end-to-end development life cycle has proven to be another level of challenge.
Consider how your organization might simplify DataOps complexity, modernizing data engineering with an architecture that resembles the following:
The ideal automation of DataOps today finds foundation in data cloud platforms, primarily Snowflake, Databricks, Redshift, or BigQuery, capable of providing cloud-native, scale-out compute resources together with optimized storage models. These platforms might also enable zero-copy cloning, time travel, versioning, semi-structured and unstructured data access, as well as integrated security and governance. Fundamentally, the data cloud’s capacity for infinite expansion of compute and/or storage separately, without dependence between each concern, is foundational. Such offerings are essential building blocks of DataOps automation.
Further, the zero-copy cloning of data clouds, or the ability to create shadow copies of production data, represents another key breakthrough. In the past, we could not test our new code against production data until the last step. Even creating full copies of production data for tests was inhibited by privacy and compliance considerations, such as HIPAA (Health Insurance Portability and Accountability Act). This meant going live was the first time our new code experienced a real-world test, a dicey, troubling moment that led to many unpleasant surprises.
The spine of a well-designed DataOps can be provided by a technology such as DBT Cloud, capable of exploding out transformations of raw data into traditional star schema, snowflake schema, or data vault models. An accelerator like DBT deploys code using a SQL templating engine, providing model building, testing, profiling, and validating through SQL and YAML (Yet Another Markup Language) scripts. The accurate consistency of many interrelated data concerns becomes automatically synchronized through use of a tool this powerful, often integrated along with a developer automation technology such as GitHub, which can engineer the entire workflow and manage the code repository.
We should note DBT, with the acronym representing “data build tool,” provides another fundamental change in data engineering. The standard extract, transform, load (ETL) tools of the past, such as Informatica, had us manage two important considerations separately, the data model and the code that implemented the source to target mappings. Change one of these and you needed an accurate, corresponding change to the other. Making it worse, these two separate considerations were usually managed by different people, either an analyst or a developer. The DBT approach enables one to derive the other. We build our data model in a way that automatically generates the corresponding code. This eliminates one of data engineering’s historically most problematic complexities.
Additional data pipeline technology will be needed, to cover a variety of ingestion types, including batch files, Web APIs, and data streaming. Many good options exist, including solid offerings native to the chosen data cloud. These data pipelines will also require scheduling and orchestration. Tools such as Prefect and Airflow provide scalable orchestration services capable of supporting complex dependencies. Use of orchestration technology can extend beyond the data pipeline challenge to help with cloud infrastructure code and the overall DataOps automation solution. Remember the many forms data might take on the way to serving our insights. Data can be used in its raw, source level form, or at various stages of transformation, extending all the way to dimensional models used for advanced visualizations or machine learning models used to drive predictions. Orchestrating this variety requires sophisticated tooling.
Data observability and development monitoring can also be crucial. Tools such as Grafana can provide a dashboard view of our team’s data engineering progress and help us monitor patterns of success within our data life cycle.
These examples clarify some strong approaches to enhancing implementations of DataOps. The promise of step change in productivity and velocity can still be realized for the organization that feels past impact has been low. Effective tooling and automation are key. DataOps must be viewed as a continuously improving system, expanding, scaling, and enhancing to address changing business needs as well as innovative capabilities introduced to the data engineering marketplace. The perfect DataOps will never be reached. The automation journey is essential to the DataOps destination.