Pipelines and Dockers

Data registration occurs through the execution of a data ingestion pipeline. A data processing pipeline consists of the following modules.

Image Preprocessor: Any processing on the image/frame before feeding it to the featurizer.
Featurizer: Featurizes each image/frame, typically using a deep neural network(DNN).
Thumbnail generator: Generates a compact representation of the image/frame that will be displayed on the Data Explorer UI.
Attribute generator: Generates a CSV file with attributes for each image/frame, and these attributes will be ingested into the Data Explorer catalog. There can be more than one attribute generator attached to a pipeline.

Customizing pipelines

The custom pipelines allow the construction of a pipeline with one or more of the above modules provided by the user in the form of a docker image that complies with the Interface specification expected by each type of module. Some use cases that can be achieved with this support are as below:

Bring your own featurizer(BYOF) that is tuned for your dataset.
Domain-specific pre-processors that improve the behaviour of the featurizer.
Custom logic to extract attributes from each image/frame. For e.g., a lens dirt detection program can be packaged as an attribute generator docker, and the result of this program will be stored by Data explorer as an attribute against each frame/image that is available for querying and joining with other internal and external catalog tables.

The section below provides a brief overview of the entities involved in defining pipelines, with later articles describing the details.

Docker repositories: A docker repository hosted at Dockerhub or AWS ECR(Elastic Container Registry) that is owned and managed by your organization. Only OrganizationAdmin role users are allowed to create docker repositories.
Docker Images: A docker image that provides a preprocessor, featurizer, thumbnail generator or attribute generator functionality. Data Explorer comes with a few pre-registered Docker images out of the box that are sufficient for most general use cases.
Pipelines: A pipeline is a directed acyclic graph(DAG) of docker images present in docker repositories. Data Explorer provides a few pre-registered pipelines for both video and image data types out of the box that are sufficient for most general-purpose visual datasets.