Accuracy vs. Explainability Tradeoff
The Accuracy vs. Explainability Tradeoff (sometimes called Predictivity vs. Explainability Tradeoff) refers to the tradeoff that exists between the performance of a Machine Learning model and its level of explainability. Shortly put, models that tend to be explainable tend to have a lower performance than the black-box ones (such as Deep Learning models).
An Activation Function is a function that determines how the weighted sum of the inputs of a node is filtered or transformed into an output to be fed as an input to the next layer of the network. The common examples of Activation Functions include the rectified linear activation function (ReLU) and the sigmoid activation function.
In the Alectio ecosystem, Active Autotuning refers to the iterative retuning of the hyperparameters of a neural network trained with Active Learning. With Active Learning, the model is retrained with increasingly large amounts of training dat. This means that the hyperparameters that were best suited for the early loops might not be appropriate for the latter ones, and that retuning the model before resuming the training process might be necessary to avoid underfitting. Active Autotuning is a necessary condition for high performance Deep Active Learning.
A Machine Learning training paradigm, part of the wider Semi-Supervised ML category, which consists in incrementally selecting data from a larger pool of unlabeled data and retraining the model, most of the time with the intent to reduce the required amount of labeled records.
Active Learning Loop
One full cycle in an Active Learning process. Active Learning is based on the iteration of a selective sampling process, a labeling task (to get the selected data annotated), a training process (using the union of all selected samples as training data) and an inference process (applied on the unselected batch of data, the process being meant to generate inference metadata to be used in the next selective sampling process).
Active Synthetic Data Generation
Active Synthetic Data Generation is the process of dynamically guiding the generation of synthetic data, usually in order to increase the informational value and impactfulness of the data on the learning process of a Machine Learning model. While technologies like GANs are getting ever better at generating high quality synthetic data, they do not provide specific information about what data should be generated to fulfill a specific purpose.
AGI, a.k.a., Artificial General Intelligence, is the term used to refer to the ability for an intelligent agent, such as a robot or computer, to understand or learn any intellectual task at the same level as a human being.
Agile Data Labeling
Agile Data Labeling is an approach to Data Labeling which relies on resource allocation and improved operations to reduce the latency and increase the flexibility of an otherwise slow, manual and rigid process. Agile Data Labeling leverages micro-jobs and other Real-Time Labeling techniques, attempts to evaluate the performance and real-time availability of annotators to optimize the flow and aims to remove other points of friction (like the generation of labeling instructions and the audit of labels) to simplify the process of labeling.
AI, a.k.a., Artificial Intelligence, is the simulation of human intelligence by machines and computer systems programmed to mimic the actions of human beings and to think like them.
An AI accelerator is a high-performance parallel computation machine that is specifically designed for the efficient processing of AI workloads, and in particular, of neural networks.
AIOps is the process of leveraging advanced analytics, Machine Learning and Artificial Intelligence towards the automating of operations to enable an organization's IT team to move at the space required by that organization's business goals.
AI Strategy is the roadmap that an organization sets to progress towards its internal adoption and implementation of Artificial Intelligence.
The term AI Winter refers to a period of time, which typically lasts several years, when support for and interest in Artificial Intelligence research and commercial ventures dries up, usually due to the global economic situation, or to the market losing confidence in the fact researchers are able to deliver on the promises of AI.
An Annotator, a.k.a., Data Annotator, is a person whose job is to provide labels or annotations for a proprietary dataset. Annotators can be sourced through a crowdsourced annotation platform or highly-trained professional specialized in generating labels for ML training datasets. Because annotating data for highly specialized applications (such as medical imaging or the analysis of legal documents) requires specific expertise, annotators can sometimes also be experts in a highly-technical domain (such as medical doctors, surgeons, lawyers, translators or scientists).
In Machine Learning, Anomaly Detection is the process of identifying data records (called outliers) which deviate from the expected distribution of a dataset.
Artificial General Intelligence
Refer to AGI.
Refer to AI.
Auto Active Learning
Auto Active Learning is the automated tuning of an Active Learning process to achieve maximum compression while ensuring the constraints set by the user are respected. Auto Active Learning does not necessarily involve the use of ML-driven querying strategies and is hence different to ML-Driven Active Learning. It involves the automated choice of the optimal querying strategy, the tuning of the size of the loops and the tuning of the number of loops.
Auto ML, a.k.a, Automated Machine Learning, is the automation of the steps typically required to build Machine Learning models and to applying ML to real-world problems. The goal of Auto ML is to reduce or even eliminate the need for highly-skilled data scientists when building Machine Learning models
Autolabeling is the process of using a pre-trained ML model or an automated process to generate (synthetic) labels for a training dataset. It can be a great way to build an continuous labeling pipeline, but can only be successfully implemented for common, objective use cases.
Automated Infrastructure Provisioning
Automated Infrastructure Provisioning is the process of reducing the need for engineers to manually provision and manager servers, storage units and other infrastructure whenever they develop or deploy an application.
Automated Machine Learning
Refer to Auto ML.
Autotuning refers to the process of automatically calibrating and adjusting the hyperparameters of a Machine Learning model, so that it can generate more accurate predictions.
Backpropagation is the most widely used algorithm to train feedforward neural nets. The Backpropagation Algorithm backpropagates the errors from output nodes to the input nodes by computing the gradient of the loss function with respect to each weight and iterating backward starting from the last layer.
The term Batch Data refers to (static) data that have already been stored over a period of time.
Bayesian Optimization is a method used for the hyperparameter optimization of Machine Learning algorithms, and in particular, of neural networks. The algorithm works by maintains a Gaussian process model of the objective function and using evaluations on this objective function to train the model.
Big Data refers to very large, hard-to-manage volumes of data (which can be either structured or unstructured) that grow at ever-increasing rates, and are typically characterized by the "3 Vs": Volume, Velocity, Variety. In the industry, Big Data and Machine Learning have a symbiotic relationship, as many people see Big Data as a necessity for building high-performance ML models, though this position is starting to change thanks for the Data-Centric AI proposition.
Black Box Model
A Black Box Model is a model capable of generating useful predictions with revealing any details about its internal workings or how a prediction is made, i.e. they are models that are not explainable. Black Box Models are considered undesirable in some industries (such as FinTech) where companies can be held liable for making arbitrary decisions, such as refusing a loan to a customer. Deep Learning models are the most common example of a black box model.
Bounding Boxes are a type of annotation used to annotate training data for object detection. A bounding box is essentially a rectangle that is drawn around the object of interest that the model is built to identify.
Catastrophic Forgetting is a problem in Machine Learning (in particular in Deep Learning) that occurs when a model forgets a learned pattern when learning another one. Catastrophic Forgetting typically happens when the model's parameters which are important for one task, class or pattern are being changed during the training phase in order to meet the objectives of another task, class or pattern.
Cloud Computing refers to the availability of computer system resources (such as data storage and computing power) on-demand. Cloud Computing is appealing to many category of users whose needs for such resources vary across time and/or are unpredictable, and who do not want to actively manage them.
Compression, a.k.a. Data Compression, is a key metric used at Alectio to measure the performance of a Data Selection algorithm. Compression is the same as compression at 0% degradation.
Compression at X% Degradation
Compression at X% degradation is a key metric used at Alectio to measure the performance of a Data Selection algorithm. If a Data Selection process has a 70% compression at 0.2% degradation, it means that the user could leverage a selection algorithm to reach a performance only 0.2% lower than the performance that was expected when training on the entire dataset while training on only 30% of the data.
Compute-Saving Active Learning
Compute-Saving Active Learning is a category of Active Learning algorithm that attempts to reduce or control the increased need for compute resources usually involved when using Active Learning. Active Learning tends to be compute-greedy by default as it requires the model to be retrained from scratch at each loop, which leads to quadratic relationship between the amount of data used and the amount of computing resources used. The optimal Compute-Saving Active Learning algorithm aims at reducing the amount of computing resources used for training by the ratio equivalent to the achieved reduction in labeling costs.
Containerization is the process of packaging applications with just the necessary libraries and dependencies required to run its code to create an executable (which is called a container) which can run in a consistent manner on any infrastructure and ensure its transportability and reproducibility.
Continual Learning is the subfield of Online Learning that deals with Catastrophic Forgetting and attempts to strategically filter or re-order streaming training data that might cause the model to forget knowledge it had previously gathered.
Generally speaking, Continuous Delivery is the ability to push any type of change (such as bug fixes, new features or configuration changes) easily, quickly and safely to production. In the context of Machine Learning, the concept of Continuous Delivery focuses more particularly on the uninterrupted delivery of predictions by a Machine Learning model when the model needs to be retrained, refreshed or updated in production.
In the context of Machine Learning, Continuous Integration is the ability to automatically and seamlessly integrate of a model and/or data updates generated by multiple contributors into a single software codebase.
Continuous Labeling is the concept of getting data annotated in near real-time, as data get collected and ingested in a continuous ML pipeline. Without a continuous labeling pipeline, it is impossible for an organization to continuously deploy an ML model built with unstructured data in production. The huge majority of labeling companies on the market focus on large-scale labeling but tend not to provide continuous labeling solutions.
Continuous Monitoring is a process of continuously monitoring the performance of a Machine Learning model in production, the validity and relevance of its outputs, as well as the quality and freshness of the data that is run through it. It is a key aspect of ML Observability.
Continuous Training (sometimes called Continual Learning) is a subfield of MLOps focused on supporting the automatic and continuous retraining of a Machine Learning model in production to enable that model to adapt to real-time changes in the data, or to continuously learn from a stream of data.
Curse of Dimensionality
The Curse of Dimensionality is a phenomena that happens when the dimensionality of the feature space is too large comparatively to the size of the training dataset, causing the number of training records to be too low to compensate for the size and sophistication of the model, which leads to underfitting.
Dark Data is the data collected and stored by an organization which end up useless or impossible to leverage, oftentimes because the organization doesn't know how to access it, or even ignores its existence. A significant part of a company's data is actually dark, leading to a huge waste of resources.
Just like Data Labeling, Data Annotation is the process of analyzing raw data and adding metadata to provide context about each record. However, contrarily to labels, annotations are not simple unidimensional variables, but more complex objects or lists of objects. An example of an annotation is the bounding box whose data structure is characterized by a set of two coordinates (typically, the top left and bottom right of the smallest rectangle containing a particular object) and a textual label (the name of the object).
Refer to Annotator.
Data Architecture is the framework that describes how a company's infrastructure supports the acquisition, transfer, storage, query and securitization of its data.
Data Augmentation is the process of artificially expanding the number of records in a training dataset by applying a set of different techniques aiming at modifying the records already present in that dataset. Some of the most common Data Augmentation techniques for Computer Vision include horizontal and vertical swap, translation, rotation, zoom and other types of distortions.
Refer to Data Fusion.
Data Cataloging is the process of creating an organized inventory of an organization's data with the intent to make it searchable and easily usable by data scientists and other data consumers.
A Data Center is a building or a dedicated space within a building that is solely dedicated to the purpose of housing computer systems and storage units, along with other related components.
Data-Centric AI is an approach to AI and Machine Learning in which the performance of an ML model is improved via (usually incremental) modifications made to the training data or to its labels. By opposition to Model-Centric AI which treats the training dataset as immutable, it places an emphasis on ensuring that the dataset is tuned until it provides a reliable representation of the information that the model is meant to learn.
Data Cleaning, a.k.a. Data Cleansing, is the process of identifying and removing or fixing incorrect, incomplete, duplicated, corrupted, poorly formatted data from a dataset, most frequently with the goal of obtaining a dataset appropriate for training a Machine Learning model, or for other business-related applications. The Data Cleaning process is usually regarded as requiring heavy manual intervention, and extremely difficult to standardize and automate.
Refer to Data Cleaning.
Data Collection is the process of gathering data, usually with the intent of using that data to train a Machine Learning model, or for other business-related purpose.
Data Compliance is the practice of abiding by the rules and best practices set forth by corporate Data Governance and governments to ensure that confidential or sensitive data is protected from loss, theft and inappropriate usage.
Refer to Compression.
Data Curation is the process of identifying and prioritizing the most useful data for a model dynamically during the training process.
Data Drift is a (usually slow and progressive) change in production data that causes a deviation between its current distribution, and the distribution of the data that was used to train, test and validate the model before it was deployed. Not detecting and addressing data drift can cause the model in production to make faulty predictions because it was not meant to operate under the conditions reflected in the new data and does no longer reflect real-world conditions.
A Data Engineer is a software engineer whose focus is on the design and development of systems that collect, store, and analyze data at scale.
Data Enrichment is the process of combining the records of a dataset with data from other sources with the purpose of expanding the features available for those records. An example of Data Enrichment would be to scrape social media channels to enhance customer data with additional information like location or job title. Data Enrichment is similar to, but different from Data Labeling in that Data Labeling consists in creating additional information instead of retrieving it from another data source.
Data Ethics is the area of ethics that focuses on the evaluation of data practices, including the way that data is collected, processed and analyzed, and on the potential that those practices have to adversely impact society.
A Data Filter is a light-weight (usually binary) classification model capable of separating useful data from useless and harmful data that can be deployed on the edge of a device in order to select data at the point of collection.
Data Filtering is a selective sampling technique based on applying a pre-trained data filter on streaming data, or on batch data prior to the training of an ML model. Data Filtering presents the advantage to select data at the point of collection and hence, to reduce data collection and data storage costs, but usually leads to a lower compression than Data Curation.
Data Fusion, a.k.a. Data Blending, is the process of combining and integrating multiple data sources to produce more consistent, accurate, and useful information than the one provided by any single data source. Data Fusion can often be performed by a simple join of two data tables, but can also take more sophisticated forms when the relationship between the rows of two different sources is unclear or uncertain.
Data Governance is a set of principles, policies and best practices that ensure the effective use of data in enabling an organization to achieve its goals.
A Data Hub is a novel data-centric storage architecture that allows the producers and consumers of data within an organization to share data frictionlessly in order to power AI workloads. A Data Hub differs from a Data Lake in that it is a data store that acts as an integration point while a Data Lake is a central repository of raw data.
Data Ingestion is the process of importing and transferring data from the source into a database or another data storage system, or directly into a data application or product.
Data Integration is the process of combining and consolidating data from different sources in order to provide data consumers (data analysts and data scientists) with a unified view of all data available to them.
Data Integrity is the process of maintaining and insuring the accuracy and consistency of data throughout its lifecycle. Data Integrity is critical to the design and development of any data product or system.
Data Labeling is the process of analyzing raw data and adding supplemental meaningful metadata called labels in order to provide context. A label can for example be the name of an object represented on a picture; while the picture might have been collected programmatically, the interpretation of its content requires additional processing, either by a human or a machine.
A Data Lake is a centralized system or repository that allows an organization to store all of its structured and unstructured data in its original format, at any scale.
A Data Mart is a partition of a Data Warehouse focused on a specific subject, line of business or department.
Data Mesh is a type of data platform architecture built to support and combine a company's ubiquitous data sources by leveraging a domain-oriented and self-serve design.
Data Mining is the process of sifting through large datasets in order to discover patterns, correlations and anomalies and to predict outcomes.
Data Orchestration is the automated process of managing data, combining data from multiple sources and making it ML-ready.
Data Poisoning is the concept of tampering with ML training data with the intent of causing undesirable outcomes. Data poisoning is expecting to represent a significant fraction of Cybersecurity attacks in the few years to come.
Data Preparation is the process of converting raw data into a dataset suitable to train a Machine Learning model. Data Preparation involves, among others, selecting and engineering features, addressing missing data, annotating data (if that data is unstructured), augmenting and/or synthetizing data.
Data privacy is a field of Data Management that deals with the usage and governance of personal data in compliance with data protection laws, regulations and best practices.
Data Quality is a measurement of the condition of data based on its validity, completeness, consistency, reliability and recency.
Data Scraping is a technique in which a computer program extracts data from a web page or another the human-readable output typically generated by a computer process (such as computer log files). Data Scraping is one of many ways to collect data.
Data Security refers to the controls, standard policies and procedures implemented by an organization in order to protect its data from data breaches and attacks and to prevent data loss through unauthorized access.
Data Selection (similar to Selective Sampling)
Data Selection is the process of reducing the size (number of records) of a dataset, usually strategically, in order to reduce operational costs, such as the amount of compute resources required for training a Machine Learning model, or the cost of Data Preparation. The two approaches to Data Selection are Data Curation (in-training Data Selection) and Data Filtering (pre-training Data Selection).
Data Storage refers to the process of capturing and recording of digital information on electromagnetic, optical or silicon-based storage, and by extension, to the various methods and technologies enabling this process.
A Data Strategy is the collection of policies set by an organization to ensure that the data it collects and stores can be properly managed as the quantity of data grows, with the underlying hope that that data can be leveraged down the line for decision making and the development of future data applications.
Data Value refers to a measurable quantity (typically measured on a scale from 0 to 1) describing the impact of a specific data record on the training process of a model trained with that data. Data Value is a model-specific metric.
Data Versioning is the practice of governing and organizing training datasets (and by extension, the Machine Learning models that are trained on those datasets) in order to ensuring the reproducibility of Machine Learning experiments. A Data Versioning system is also necessary to make sure that an older version of a Machine Learning model can be rolled back in case of a problem in production.
Data Warehousing is the process of integrating data collected from various sources into one consolidated database.
A Database is an organized collection of structured data stored in a computer system and set up for easy access, management and updating.
DataOps is essentially the evolution of the Agile Manifesto for Software Development extended to Data Engineering. It combines best practices and technical tools into a collaborative data management methodology focused on improving the communication, integration and automation of data flows between the people managing the data (the data engineers) and those consuming it (the data analysts and data scientists) with an organization.
DataPrepOps is the subfield of MLOps that concerns itself with the creation and maintenance of ML pipelines meant to prepare ML data; by extension, any tool and framework which is part of such pipelines. At a high level, a DataPrepOps pipeline is a ML pipeline designed for Data-Centric AI.
Deep Learning is a category of Machine Learning algorithms leveraging neural networks with representation learning to imitate the way that the human brain works and gathers knowledge. Deep Learning is commonly used and have enabled tremendous progress in the fields of Computer Vision and Natural Language Processing.
Deep Reinforcement Learning
Deep Reinforcement Learning is the field of Machine Learning that combines Deep Learning and Reinforcement Learning. It is essentially an implementation of Reinforcement Learning where agents learn how to reach their target goals instead of receiving those target as an arbitrary rule.
A deterministic label is a data label with a clear, indisputable value. In reality, no label is every completely deterministic, as even the most objective cases come with a little bit of uncertainty and require a certain level of interpretation. That being said, in most use cases, deterministic labels can be assigned without a significant impact on the performance of the model that will consume them.
DevOps is a portmanteau term which refers to the combination of Software Development and IT Operations. As a methodology, DevOps aims to integrate the work performed by a software development team and an IT team by promoting the collaboration and shared responsibility.
Distributed Computing is a technique that consists in grouping several computer systems with the purpose of coordinating processing power so that those systems appear as a single computer to the end-user.
Early Abort is an algorithm that identifies when an Active Learning algorithm is unlikely to yield good results and is essentially "doomed to failed", most frequently because of poorly initialization of parameters and training dataset, so that the process can be stopped, reinitialized and restarted.
Early Stopping (Active Learning)
In Active Learning, Early Stopping is an algorithm that identifies the stage at which an Active Learning process can be stopped without negatively impacted the final performance of the model by detecting when the remaining data does not contain additional useful records.
Early Stopping (Deep Learning)
In Deep Learning, Early Stopping is a form of regularization meant to avoid overfitting by halting the training process at the point when the performance on a validation set begins to worsen.
Edge Computing refers to a distributed computing framework (which includes compute and data storage resources) that brings applications closer to the source of the data that they are built on top of. Saying that something is computed on the edge essentially means that the computation happens directly on the device (usually, an IoT device) where the data got generated / collected, instead of having that data sent back to the Cloud where processing would happen otherwise.
ETL (an acronym for Extract-Transform-Load) is a three-step process through which data is extracted from different sources, transformed into a usable resource, and loaded into systems that data consumers can access and use downstream to solve business problems or build data products.
Explainability is the ability to understand what knowledge a specific parameter or node of an ML model refers to, and its importance and impact relatively to the overall performance of the model.
Feature Engineering is the process of selecting and combining raw data features into new (usually more predictive) ones that can be used to train a Machine Learning model.
Feature Selection is the process of selecting a subset of relevant input variables and reducing their total number when developing a Machine Learning model.
A Feature Store is a service or platform designed to ingest large volumes of data (either streaming or batch) and engineer, compute, store, register and monitor features so that they can be easily consumed by data scientists and ML processes down the line.
FLOPS (also sometimes written flops or flop / s) is the acronym to refer to the number of floating point operations per second and is a measure of computer performance used in fields where scientific computations require floating-point calculations, such as AI.
Refer to Generative Adversarial Network.
Generative Adversarial Network
Generative Adversarial Networks, a.k.a, GANs, constitute a category of Machine Learning frameworks in which two neural networks compete against one another in a zero-sum game, to force each other to continuously improve. More specifically, a first neural net, the generator, attempts to generate candidates to match a specific input data distribution, while the second neural net, called the discriminator, evaluates how realistic those candidates look by measuring their distance to the original data distribution. Their best known application is in the generation of images; Dall-E and ImageGen are two such models that blew away the ML community by achieving photorealistic results.
Graphic Processing Unit (GPU)
A Graphic Processing Unit (more commonly referred to as a GPU) is a processor whose main function is to render graphics and images by performing rapid mathematical calculations. Though originally designed for gaming, GPUs have quickly found a spot as a critical tool among Machine Learning experts due to their quick processing capabilities.
The Ground Truth of a data record is the desired result of the output prediction generated by the Machine Learning model according the the person in charge of annotating the dataset. Ground Truth is paradoxically always subjective, even in seemingly objective use cases like the classification of pictures of every day objects because there is always a small uncertainty about what the picture actually represents, including for a human being.
Harmful Data is the part of a training data that is causes issues to the learning process of an ML model, and can potentially cause drop in that model's performance. Corrupted data or records that trigger Catastrophic Forgetting are examples of Harmful Data. The concept of harmfulness can be model-specific, meaning that a record that is harmful to the learning process of model A might not be harmful to the learning process of model B.
In the context of Supervised Learning, a Holdout Dataset is a labeled dataset which is set aside before the training process, and which is used to train the model on. The validation dataset and the test dataset are both holdout datasets.
Human-in-the-Loop Machine Learning
Human-in-the-Loop Machine Learning is a subfield of Machine Learning where human agents are actively involved in the optimization of the learning process of a ML model, most frequently by acting on the training data.
A Hyperparameter is a parameter whose value controls the learning process. Hyperparameters are to be distinguished from regular model parameters whose values are being learned during training.
Refer to Hyperparameter Tuning.
Hyperparameter Tuning, a.k.a. Hyperparameter Optimization, is the process of choosing the optimal set of hyperparameters for a learning algorithm (usually, a Deep Learning model). Some of the most common hyerparameter tuning techniques include grid search, random search, Bayesian optimization and gradient-based optimization.
Incremental Learning is the subfield of Online Learning where the scope of the problems (such as the number of classes in a classification problem) is expended over time. Incremental Learning offers the framework to allow a model to learn new concepts, and is a very important research area on the way to Artificial General Intelligence. It can be considered a part of Continual Learning because of the likelihood that expending the scope of learning will induce Catastrophic Forgetting if no special precautions are taken.
In Machine Learning, Inference is the process of feeding novel data (different from the data which was used to train the model) into a Machine Learning model to compute predictions.
Inference Metadata refers to any information that gets generated by a Machine Learning model at inference type, that is not the prediction itself. Inference Metadata is sometimes gathered by data scientists, mostly in the context of an Active Learning process. A common example of inference metadata functions is the confidence level of the prediction as derived from the logits, the logits themselves, or the value of the activation function when inferring.
ML Interpretability refers to the ability to associate a cause to an effect within a Machine Learning model. A model is said to be interpretable if a human can accurately predict the output of the model based on the input.
Judgment Aggregation is the process of combining multiple judgments generated for the same data record into a consolidated label or annotation value to be used when training a Machine Learning model.
A Judgment is the opinion of a specific human annotator or an automated annotation system regarding the label or annotation to be attributed to a specific record. It is essentially a single instance of annotation. In practice, in order to avoid outliers generating through malpractice or to mitigate subjectivity, data scientists tend to aggregate multiple judgments made on the same record into a simple label to be used for training.
Label Auditing, a.k.a. Label Quality Auditing, is the process of checking and validating the annotations generated for a training dataset. The best option to have annotations validated is to rely to a third-party different from the party in charge of providing the labels, in order to ensure truthfulness of the results.
Label Quality Auditing
Refer to Label Auditing.
Label Resource Optimization
Labeling Resource Optimization is the process of optimally allocating and managing all human resources and computer systems available to annotate data, taking into consideration annotators' schedules and computer resources on one hand, and customer budget and constraints on the other hand, in order to balance the real-time supply and demand for labeling tasks.
Label Versioning is the use of complex labeling workflows which allow to identify and re-annotate faulty records lead to the existence of successive versions of a label for the same data record. Label Versioning is the process of storing and managing those multiple versions, and of making them easily accessible to the user.
Labeling instructions are instructions logged in a document that is shared with a labeling provider when a labeling task is submitted, designed to explain in details the expectations of the human annotators in charge of labeling the data. Labeling instructions usually contain class definitions as well as examples and counter-examples in order to educate the annotators on the challenges they might encounter when working on the user's proprietary dataset.
A marketplace of labeling providers, designed to enable ML teams to find the most appropriate way to get their data annotated, based on their use case, expectations and constraints. Using a labeling marketplace allow users to compare labeling companies, avoid long and expensive POCs, leverage specific expertise based on their particular project, and get full transparency on pricing and timelines. The Alectio Labeling Marketplace provides real-time access to both manual labeling solution providers, and autolabeling models.
A Labeling Provider is a third-party organization who provides annotation services for Machine Learning teams that do not have the skills or the bandwidth to annotate the data themselves.
A Labeling Task is a request sent to a labeling provider or to an automated labeling system to get a batch of raw data annotated accordingly to the provided labeling instructions.
A Labeling Tool is an application that facilitate or automate the generation of labels for unstructured training data. Most tools today are interfaces developed to enable a data annotator to generate, visualize and store the annotations (the exact functionalities depending on the type of data and task), but more and more tools nowadays are also capable of pre-generating synthetic labels which the annotator need to either validate or modify (this strategy is one way to achieve human-in-the-loop labeling), or even fully automate the data annotation process.
A Labeling Workflow is a data workflow meant to transfer raw data, data labels and labeling instructions from one system to the next in an organized manner with the purpose of getting data annotated. A labeling workflow can for example route raw data to a labeling provider and retrieve, store and organize the generated labels. Complex labeling workflows can include feedback loops, human-in-the-loop logic, label validation components or even a succession of conditional labeling pipelines; an example of that would be for a data scientist working on a license plate transcription model to send images to an object detection job first in order to detect the ones containing a license plate so that only those can be sent to a secondary transcription task.
LabelOps is the subfield of MLOps focused on the storage, processing and management of data labels and annotations. LabelOps includes, among others, label resource optimization, label quality auditing and label versioning.
In Machine Learning, a Learning Curve is a graphical representation of the relationship between a Machine Learning model's performance, and the amount of training data that the model was trained on, Learning Curves are of particular importance in Active Learning which aims at reducing the quantity of training data without reducing the performance of the model.
The Loss Function of a Machine Learning algorithm is the function that computes the distance between the current prediction of the algorithm and the ground truth.
Machine Learning is a field of Computer Science focused on getting computers and machines to perform specific actions which they haven't been explicitly programmed to do.
Machine Learning Pipeline
A Machine Learning Pipeline is a sequence of steps meant to orchestrate the flow of data in and out of a Machine Learning model.
Machine Teaching is a novel field of Computer Science that aims at improving the learning speed or performance of a Machine Learning model by intervening during the learning process or taking educated actions to alter the training algorithm. The most common Machine Teaching approach consists in building an optimal training dataset to allow an algorithm to learn more efficiently. Active Learning (which consists in dynamically selecting data) and Human-in-the-Loop Machine Learning (which consists in dynamically fixing the labels) are the two most popular approaches to Machine Teaching.
Meta Learning is a category of Machine Learning algorithms that aim to learn from the output of other Machine Learning algorithms.
A Microtask (or micro labeling task) is a labeling task involving a small number of records. Microtasks offer the benefit of a lower latency, which is necessary to Machine Learning applications requiring continuous labeling.
Missing Data occurs when one or several variables are missing within an observation. It can also refer to the cases when entire records, sequences or files are missing from a data store. Dealing with missing data is one of the trickiest challenge for a data scientist to solve and can lead to the unexpected behavior of a Machine Learning model trained on such data.
ML-Driven Active Learning
ML-Driven Active Learning is a category of Active Learning algorithms where the querying strategy uses Machine Learning. Traditionally, Active Learning relies on arbitrary, static, rules-based querying strategies which are tuned on neither the type and size of the dataset nor the type and state of the model.
An ML Engineer is a software engineer whose focus is on the design and development of systems and pipelines meant to deploy and monitor ML models in production.
A ML framework is an interface that allows developers to build and deploy ML models faster and more easily. Such an interface allows organizations to scale their Machine Learning initiatives securely while maintaining a healthy ML lifecycle.
An ML Library is a compilation of learning algorithms and other peripheral utility functions and routines (such as dimensionality reduction tools) readily available for use by data scientists so that they do not have to reimplement those functionalities from scratch when developing a Machine Learning model. Some of the most famous ML Libraries are Scikit Learn, Keras, Tensorflow and PyTorch.
Observability (or ML Observability) is the subfield of MLOps focused on monitoring data quality and the performance of Machine Learning models in production, with the intent of detecting data drifts, identifying and addressing issues and improving their explainability across their entire lifecycle.
An ML Platform is a system designed to manage the lifecycle of a Machine Learning model, with an emphasis on experimentation, reproducibility, monitoring and deployment.
MLOps is short for Machine Learning Operations. MLOps is the field of Machine Learning Engineering that focuses on streamlining the process of deploying, monitoring, maintaining and updating Machine Learning models in production.
An MLOps Engineer is an engineer specialized in building MLOps pipelines and systems designed to facilitate and automate the deployment, monitoring and maintenance of Machine Learning models in production.
Model-Centric AI is an approach to AI and Machine Learning in which the performance of an ML model is improved by making changes to the model (for example, to its architecture, optimization functions or hyperparameters). Model-Centric AI is the approach used by the huge majority of ML practitioners.
Model Degradation is a key metrics used at Alectio to refer to the drop in performance caused by a Data Selection process. Degradation are relative to the performance metric they refer to (such as accuracy or mAP score), and can take negative values when the selection process actually caused an improvement in model performance.
Model Deployment is the process of integrating (people often use the term "pushing") a Machine Learning model into an existing production environment in order to leverage its decision capabilities for business purposes.
Refer to Inference.
A Model Library is a compilation of Machine Learning models, sometimes pre-trained, available for data scientists to use of-the-shelf, without the need to code one from scratch. A model library is the low-code approach to Machine Learning.
Model Serving is the process of hosting a Machine Learning model and to make it available via APIs, so that data products can use the predictions of that model.
Model Testing is the process of measuring the final performance of a Machine Learning model on a holdout labeled dataset after the training process is completed. Model Testing is the last step in the development of an ML model before it can be deployed to production.
Model Training is the concept of injecting (training) data into a Machine Learning algorithm and attempting to fit the model to the data by tuning its parameters to minimize a loss function over the prediction range. Model Training is at the core of Supervised Machine Learning.
Model Validation is the process of confirming the accuracy and performance of a Machine Learning model on a holdout labeled dataset before the data scientist can be decide if the model's training process is complete, or if the model needs to be modified or updated. Model Validation differs from Model Testing even though both operations are carried out on holdout labeled datasets in that the model can be validated multiple times before the model is considered trained, while it is only tested once after the end of the training process to measure the final performance of the model.
Multi-Generational Querying Strategy
A Multi-Generational Querying Strategy is a querying strategy that leverages the training and inference metadata generated over all past loops, instead of just focusing on the metadata generated during the previous loop.
Refer to ML Observability.
Online Learning is an approach used in Machine Learning where training data is injected into the model one observation at a time because it is computationally infeasible to train over the entire dataset at once. Data scientists typically rely on Online Learning when the dataset is too large but they still want to extract information from the entire dataset, or when they are dealing with streaming data.
An Optimizer is an algorithm or a method that updates the attributes of a neural network, such as its weights and learning rate.
Parallelization is the process of executing jobs and tasks simultaneously in parallel, in order to achieve a significant speedup and boost in performance.
Pooling Active Learning
Pooling Active Learning is a type of Active Learning algorithms where records are selected only after evaluating all of the unselected data pool. Pooling Active Learning tends to be more compute-greedy, but provides a better chance for the most optimal records to be selected for the following loop.
Probabilistic Data Label
Refer to Probabilistic Label.
A Probabilistic Label is a data label with no clear-cut value. Instead of a single value, a Probabilistic Label takes a mathematical distribution as a value, which can be used as such to train a Machine Learning model.
A Querying Strategy is a selective sampling algorithm or strategy using iteratively for each loop of an Active Learning process.
Most labeling providers focus on their ability to label data at scale, but rarely provide guarantees in terms of timeframe, which can represent a major issue for projects on a strict deadline. Real-Time Labeling, on the contrary, is a framework which allows ML teams to get their data labeled in a timely manner, if not immediately, by leveraging multiple vendors, automated labeling systems and Labeling Resource Optimization algorithms.
Reinforcement Learning is a category of Machine Learning algorithms that enables a robot or a computer (called an agent) to learn in an interactive environment through trial and error, by using the feedback generated from its own actions and experiences.
Responsible AI is the practice of developing AI applications that fairly impact customers and society as a whole and inspire sufficient trust among the general public as to lead to the global adoption of AI.
ROT Data (a.k.a, R.O.T. Data)
ROT Data is the data that an organization retains even though the information it contains has no longer any business value. R.O.T. is the acronym for Redundant - Obsolete - Trivial.
Refer to Querying Strategy.
Semi-Supervised Learning is a Machine Learning training paradigm that leverages partially labeled data.
Specialized Data Labeling
Specialized Data Labeling (also sometimes called Specialty Data Labeling) refers to the process of having subject matter experts (such as doctors, surgeons, lawyers, translators or scientists) label specialty data (for example, x-rays, legal or scientific documents). Most labeling companies do not provide Specialized Data Labeling services as find expert annotators is relatively difficult.
Streaming Active Learning
Streaming Active Learning is a type of Active Learning algorithms where records are selected in a streaming fashion, i.e., record by record, without requiring the entire remaining data to be evaluated and inferred before the next Active Learning loop can be started. Streaming Active Learning presents the advantage of lower operational costs, but unlike Pooling Active Learning, does not guarantee a controlled progression of the training process as loop size cannot be guaranteed.
Streaming Data is data that flows into a system and is continuously collected and stored. Streaming data needs real-time or incremental processing incrementally, without having access to the totality of the data.
The term Structured Data refers to data that can be organized into a standardized format, meaning that it can easily be fitted into a database, so that its elements can be accessed more effectively for Data Analysis and Machine Learning. Some people may use the term abusively to refer to data that does not require to be labeled or annotated before it can be used for Supervised Learning, though in some rare cases labeling such data might still be necessary or desired; for example, if the data refers to the test results of a patient, a physician might still be required to provide a prognosis as a label.
Subjective Data Labeling
Subjective Data Labeling refers to the process of labeling data for use cases where a clear and indisputable label cannot be determined, and a label is subject to personal judgment and preferences. Content moderation is an example of use case where subjective labeling is involved. Technically, no label can ever be 100% objective.
Supervised Learning is a category of Machine Learning algorithms that are trained on labeled datasets to learn how to predict outcomes or classify data accurately. These algorithms learn by being explicitly fed input - desired output pairs.
Sustainable Machine Learning
Sustainable Machine Learning refers to a category of Machine Learning algorithms or techniques aiming at reducing the carbon footprint typically associated with training or serving a Machine Learning model. Transfer Learning or Few-Shot Learning are examples of Sustainable Machine Learning approaches.
Synthetic Data Generation
Synthetic Data Generation is the generation of annotated information that computer simulations or algorithms (for example, GANs) generate as an alternative to real-world data. Synthetic Data Generation is appealing to many organization because they see it as a more cost-effective alternative to the collection and annotation of natural data.
A Synthetic Label is a data label that has been generated by a Machine Learning model.
Training Metadata refers to any peripheral information that gets generated by a Machine Learning model during the training process. Training Metadata is almost never gathered by data scientists, which leads to losing precious information that could otherwise be used to better understand how Machine Learning models learn and even to actively teach other models how to perform similar tasks. The loss function of Deep Learning model, or the values of the weights and biases after each epoch, constitute valid forms of Training Metadata.
Unstructured data is data for which a preset data model or schema cannot be established and which therefore cannot be stored in a relational database. Text, image, audio and video data all fit into this category because the information it contains isn't columnar. The huge majority of unstructured data needs to be labeled before it can be used for Supervised Learning.
Unsupervised Learning is a category of Machine Learning algorithms that analyzes, clusters and learns patterns from unlabeled data.
Useful data is the part of a training data that is dense in relevant information, and hence, is the most likely to positively impact the learning process of a Machine Learning model and boosting its performance. Prioritizing the most useful data when training a Machine Learning model aims at obtaining a steeper learning curve. The concept of usefulness is specific to the Machine Learning that is to be trained with the data.
Useless data is the part of a training dataset which has no impact (neither positive nor negative) on the learning process of a Machine Leaning model. Data is useless either because its informational content is irrelevant to the task at play, or before it is redundant with the information already retained by the model.
Weak Supervision is a branch of Machine Learning and Data Preparation methodology that combines rules or imprecise data (which can be either human- or machine-generated) to power a scalable labeling process capable of efficiently annotating large amounts of training data.
A White-Box Model is a model whose predictions can be explained and for whose predictive features it is possible to identify.