The MLOps Glossary

Dark DataDark Data is the data collected and stored by an organization which end up useless or impossible to leverage, oftentimes because the organization doesn't know how to access it, or even ignores its existence. A significant part of a company's data is actually dark, leading to a huge waste of resources.

Data AnnotationJust like Data Labeling, Data Annotation is the process of analyzing raw data and adding metadata to provide context about each record. However, contrarily to labels, annotations are not simple unidimensional variables, but more complex objects or lists of objects. An example of an annotation is the bounding box whose data structure is characterized by a set of two coordinates (typically, the top left and bottom right of the smallest rectangle containing a particular object) and a textual label (the name of the object).

Data AnnotatorRefer to Annotator.

Data ArchitectureData Architecture is the framework that describes how a company's infrastructure supports the acquisition, transfer, storage, query and securitization of its data.

Data AugmentationData Augmentation is the process of artificially expanding the number of records in a training dataset by applying a set of different techniques aiming at modifying the records already present in that dataset. Some of the most common Data Augmentation techniques for Computer Vision include horizontal and vertical swap, translation, rotation, zoom and other types of distortions.

Data BlendingRefer to Data Fusion.

Data CatalogingData Cataloging is the process of creating an organized inventory of an organization's data with the intent to make it searchable and easily usable by data scientists and other data consumers.

Data CenterA Data Center is a building or a dedicated space within a building that is solely dedicated to the purpose of housing computer systems and storage units, along with other related components.

Data-Centric AIData-Centric AI is an approach to AI and Machine Learning in which the performance of an ML model is improved via (usually incremental) modifications made to the training data or to its labels. By opposition to Model-Centric AI which treats the training dataset as immutable, it places an emphasis on ensuring that the dataset is tuned until it provides a reliable representation of the information that the model is meant to learn.

Data CleaningData Cleaning, a.k.a. Data Cleansing, is the process of identifying and removing or fixing incorrect, incomplete, duplicated, corrupted, poorly formatted data from a dataset, most frequently with the goal of obtaining a dataset appropriate for training a Machine Learning model, or for other business-related applications. The Data Cleaning process is usually regarded as requiring heavy manual intervention, and extremely difficult to standardize and automate.

Data CleansingRefer to Data Cleaning.

Data CollectionData Collection is the process of gathering data, usually with the intent of using that data to train a Machine Learning model, or for other business-related purpose.

Data ComplianceData Compliance is the practice of abiding by the rules and best practices set forth by corporate Data Governance and governments to ensure that confidential or sensitive data is protected from loss, theft and inappropriate usage.

Data CompressionRefer to Compression.

Data CurationData Curation is the process of identifying and prioritizing the most useful data for a model dynamically during the training process.

Data DriftData Drift is a (usually slow and progressive) change in production data that causes a deviation between its current distribution, and the distribution of the data that was used to train, test and validate the model before it was deployed. Not detecting and addressing data drift can cause the model in production to make faulty predictions because it was not meant to operate under the conditions reflected in the new data and does no longer reflect real-world conditions.

Data EngineerA Data Engineer is a software engineer whose focus is on the design and development of systems that collect, store, and analyze data at scale.

Data EnrichmentData Enrichment is the process of combining the records of a dataset with data from other sources with the purpose of expanding the features available for those records. An example of Data Enrichment would be to scrape social media channels to enhance customer data with additional information like location or job title. Data Enrichment is similar to, but different from Data Labeling in that Data Labeling consists in creating additional information instead of retrieving it from another data source.

Data EthicsData Ethics is the area of ethics that focuses on the evaluation of data practices, including the way that data is collected, processed and analyzed, and on the potential that those practices have to adversely impact society.

Data FilterA Data Filter is a light-weight (usually binary) classification model capable of separating useful data from useless and harmful data that can be deployed on the edge of a device in order to select data at the point of collection.

Data FilteringData Filtering is a selective sampling technique based on applying a pre-trained data filter on streaming data, or on batch data prior to the training of an ML model. Data Filtering presents the advantage to select data at the point of collection and hence, to reduce data collection and data storage costs, but usually leads to a lower compression than Data Curation.

Data FusionData Fusion, a.k.a. Data Blending, is the process of combining and integrating multiple data sources to produce more consistent, accurate, and useful information than the one provided by any single data source. Data Fusion can often be performed by a simple join of two data tables, but can also take more sophisticated forms when the relationship between the rows of two different sources is unclear or uncertain.

Data GovernanceData Governance is a set of principles, policies and best practices that ensure the effective use of data in enabling an organization to achieve its goals.

Data HubA Data Hub is a novel data-centric storage architecture that allows the producers and consumers of data within an organization to share data frictionlessly in order to power AI workloads. A Data Hub differs from a Data Lake in that it is a data store that acts as an integration point while a Data Lake is a central repository of raw data.

Data IngestionData Ingestion is the process of importing and transferring data from the source into a database or another data storage system, or directly into a data application or product.

Data IntegrationData Integration is the process of combining and consolidating data from different sources in order to provide data consumers (data analysts and data scientists) with a unified view of all data available to them.

Data IntegrityData Integrity is the process of maintaining and insuring the accuracy and consistency of data throughout its lifecycle. Data Integrity is critical to the design and development of any data product or system.

Data LabelingData Labeling is the process of analyzing raw data and adding supplemental meaningful metadata called labels in order to provide context. A label can for example be the name of an object represented on a picture; while the picture might have been collected programmatically, the interpretation of its content requires additional processing, either by a human or a machine.

Data LakeA Data Lake is a centralized system or repository that allows an organization to store all of its structured and unstructured data in its original format, at any scale.

Data MartA Data Mart is a partition of a Data Warehouse focused on a specific subject, line of business or department.

Data MeshData Mesh is a type of data platform architecture built to support and combine a company's ubiquitous data sources by leveraging a domain-oriented and self-serve design.

Data MiningData Mining is the process of sifting through large datasets in order to discover patterns, correlations and anomalies and to predict outcomes.

Data OrchestrationData Orchestration is the automated process of managing data, combining data from multiple sources and making it ML-ready.

Data PoisoningData Poisoning is the concept of tampering with ML training data with the intent of causing undesirable outcomes. Data poisoning is expecting to represent a significant fraction of Cybersecurity attacks in the few years to come.

Data PreparationData Preparation is the process of converting raw data into a dataset suitable to train a Machine Learning model. Data Preparation involves, among others, selecting and engineering features, addressing missing data, annotating data (if that data is unstructured), augmenting and/or synthetizing data.

Data PrivacyData privacy is a field of Data Management that deals with the usage and governance of personal data in compliance with data protection laws, regulations and best practices.

Data QualityData Quality is a measurement of the condition of data based on its validity, completeness, consistency, reliability and recency.

Data ScrapingData Scraping is a technique in which a computer program extracts data from a web page or another the human-readable output typically generated by a computer process (such as computer log files). Data Scraping is one of many ways to collect data.

Data SecurityData Security refers to the controls, standard policies and procedures implemented by an organization in order to protect its data from data breaches and attacks and to prevent data loss through unauthorized access.

Data Selection (similar to Selective Sampling)Data Selection is the process of reducing the size (number of records) of a dataset, usually strategically, in order to reduce operational costs, such as the amount of compute resources required for training a Machine Learning model, or the cost of Data Preparation. The two approaches to Data Selection are Data Curation (in-training Data Selection) and Data Filtering (pre-training Data Selection).

Data StorageData Storage refers to the process of capturing and recording of digital information on electromagnetic, optical or silicon-based storage, and by extension, to the various methods and technologies enabling this process.

Data StrategyA Data Strategy is the collection of policies set by an organization to ensure that the data it collects and stores can be properly managed as the quantity of data grows, with the underlying hope that that data can be leveraged down the line for decision making and the development of future data applications.

Data ValueData Value refers to a measurable quantity (typically measured on a scale from 0 to 1) describing the impact of a specific data record on the training process of a model trained with that data. Data Value is a model-specific metric.

Data VersioningData Versioning is the practice of governing and organizing training datasets (and by extension, the Machine Learning models that are trained on those datasets) in order to ensuring the reproducibility of Machine Learning experiments. A Data Versioning system is also necessary to make sure that an older version of a Machine Learning model can be rolled back in case of a problem in production.

Data WarehousingData Warehousing is the process of integrating data collected from various sources into one consolidated database.

DatabaseA Database is an organized collection of structured data stored in a computer system and set up for easy access, management and updating.

DataOpsDataOps is essentially the evolution of the Agile Manifesto for Software Development extended to Data Engineering. It combines best practices and technical tools into a collaborative data management methodology focused on improving the communication, integration and automation of data flows between the people managing the data (the data engineers) and those consuming it (the data analysts and data scientists) with an organization.

DataPrepOpsDataPrepOps is the subfield of MLOps that concerns itself with the creation and maintenance of ML pipelines meant to prepare ML data; by extension, any tool and framework which is part of such pipelines. At a high level, a DataPrepOps pipeline is a ML pipeline designed for Data-Centric AI.

Deep LearningDeep Learning is a category of Machine Learning algorithms leveraging neural networks with representation learning to imitate the way that the human brain works and gathers knowledge. Deep Learning is commonly used and have enabled tremendous progress in the fields of Computer Vision and Natural Language Processing.

Deep Reinforcement LearningDeep Reinforcement Learning is the field of Machine Learning that combines Deep Learning and Reinforcement Learning. It is essentially an implementation of Reinforcement Learning where agents learn how to reach their target goals instead of receiving those target as an arbitrary rule.

Deterministic LabelA deterministic label is a data label with a clear, indisputable value. In reality, no label is every completely deterministic, as even the most objective cases come with a little bit of uncertainty and require a certain level of interpretation. That being said, in most use cases, deterministic labels can be assigned without a significant impact on the performance of the model that will consume them.

DevOpsDevOps is a portmanteau term which refers to the combination of Software Development and IT Operations. As a methodology, DevOps aims to integrate the work performed by a software development team and an IT team by promoting the collaboration and shared responsibility.

Diffusion ModelsDiffusion models are a class of mathematical models used to analyze decision-making processes and response times in cognitive psychology and neuroscience. These models assume that decision-making involves the accumulation of noisy information over time, and that responses are triggered when the accumulated evidence reaches a certain threshold. Diffusion models allow researchers to estimate various parameters of the decision-making process, such as the speed of information accumulation, the amount of noise in the decision process, and the decision threshold. They are widely used in fields such as experimental psychology, behavioral economics, and neuroimaging.

Distributed ComputingDistributed Computing is a technique that consists in grouping several computer systems with the purpose of coordinating processing power so that those systems appear as a single computer to the end-user.