Overview
  • 31 May 2023
  • 7 Minutes to read
  • Dark
    Light
  • PDF

Overview

  • Dark
    Light
  • PDF

Article Summary

System Overview

 Data Explorer consists of the following broad set of operations

  1. Data location registration: Registering a data container that captures the location of input images and videos. Multiple types of data containers are supported, as shown in the above diagram.
  2. Data ingestion: This step scans the data in the registered data container and produces features(embeddings) and summaries registered to the Data Explorer web portal. Data ingestion can be performed using any of the following three modes.
    1. adectl command line utility: adectl is a command line utility supported on Linux and Mac machines. This command is executed on a self-provisioned Linux/Mac machine(including a laptop) and hence provides the highest levels of data privacy. This model is easy to start but is not scalable to use compute resources across multiple machines. 
    2. Self-hosted Kubernetes cluster(Coming soon): The data ingestion can be executed on a self-provisioned Kubernetes cluster on-prem or in your VPC in the cloud.
    3. Auto-provisioned compute: For a fully hosted SaaS experience, then Data Explorer auto-provisions compute resources to ingest your data.
  3. Catalog registration: Catalog is registered to Data Explorer using the following modes
    1. External catalog connect: Catalogs stored in self-hosted catalog stores like MySQL database can be connected to Data Explorer to make the full catalog available in Data Explorer.
    2. Import through CSV(comma-separated values): Data Explorer UI allows importing catalog CSV files into a table.
    3. adectl based import(deprecated): The adectl command line utility allows importing catalog from a CSV file.
  4. Data Exploration and analysis: Once the above preparatory operations are complete, the data is available for exploration, curation, model analysis and other operations supported by Data Explorer.
File updates are skipped
Data ingestion scans for new files in the source to be ingested based on file names. Hence, if a file was ingested and then was updated with new content, then the new content will not be reingested.


Your data and catalog information stay where they are
Only extracted features and thumbnails are uploaded to the web portal. Full data and catalog are not copied.

Local mode setup

For evaluation purposes, the software running on the web portal can be installed on the user-provisioned Linux/Mac machine where adectl is installed. This setup supports limited scale and capabilities while ensuring extracted features and thumbnails stay within the user-provisioned machine. 

Self-hosting
The software components powering the web portal are deployed in a Kubernetes cluster and are hence architecturally friendly to be self-hosted on any cloud/on-prem Kubernetes cluster. Please contact us for more details.  

User Roles

The following user roles are available.

  1. Organization Admin: This role is a super user who has access to all capabilities on the web portal. This role has the following capabilities not available without this role.
    1. Registering secrets (credentials) to access data and catalog stores.
    2. Registering data stores (Container) and External catalogby providing the necessary URLs and secrets.
    3. User and group management.
  2. User: This role maps to a data engineer/data scientist persona who is responsible for
    1. Defining datasets - Specify Datasetand the choice of pre-processing and featurization to be run on the objects in the dataset.
    2. Creating different types of data analysis jobs to curate data objects and create a Resultset.
  3. Finance Admin: This role maps to the finance team person responsible for keeping track of invoices and payments.

Terminology

Container

A container describes a storage location from where data is ingested into the system. A container can be an S3 bucket, Azure blob store, Google Cloud Store or a directory on the local file system. The container is registered through the web UI, with the user providing the details like the endpoint URL, credentials etc.   

Local Container
If the data to be ingested is present on the local file system, then explicit container creation is not required.

Dataset

A dataset is an entity that specifies a selector on the contents of the container. A dataset can be of Image or Video type.

For example, an S3 bucket has two directories, CAMERA-FRONT and CAMERA-BACK, with images from front and back cameras, respectively, and each of these camera images has a different feature extraction model that is most appropriate. For such a case, you can define two datasets with glob pattern CAMERA-FRONT/**/*.jpg and CAMERA-BACK/**/*.jpg, respectively to logically group the images from two cameras.

Pipelines

A pipeline is an abstraction that captures the ingest processing routines. A pipeline typically has pre-processing, feature extraction, thumbnail generation and feature summarization stages.

Pipelines are triggered through the following command.

adectl run -n <dataset-name> -i <directory-with-input-objects>

The above processing registers features and produces catalog. The catalog is accessed using the 'Catalog' button on the dataset card, as shown below.

Data Explorer has pre-registered pipelines available out of the box. Additionally, you can define your pipelines by either using pre-registered docker images or uploading your docker image and using them in your pipelines. Please go through Pre-registered pipelines for more details.

Data Visualization Job

Once data is ingested, a data visualization job can be created by browsing the catalog through the 'Catalog' button in the dataset card, as shown in the previous section.

 

Resultset

Data visualization UI provides capabilities to explore, drill down and curate the data using cluster views, nearest neighbour searches and similarity searches. The curated subset of the data objects is referred to as a resultset. A resultset can be downloaded to a local directory or exported to an S3 bucket, Azure blob store, or Google Cloud Store for downstream processing like labelling or machine learning training.

Clustering and Embedding

When a job is submitted, the low dimensional representations and coresets cluster the data objects to enable exploration and curation.


Was this article helpful?

What's Next