- 31 May 2023
- 7 Minutes to read
- Updated on 31 May 2023
- 7 Minutes to read
Data Explorer consists of the following broad set of operations
- Data location registration: Registering a data container that captures the location of input images and videos. Multiple types of data containers are supported, as shown in the above diagram.
- Data ingestion: This step scans the data in the registered data container and produces features(embeddings) and summaries registered to the Data Explorer web portal. Data ingestion can be performed using any of the following three modes.
- adectl command line utility: adectl is a command line utility supported on Linux and Mac machines. This command is executed on a self-provisioned Linux/Mac machine(including a laptop) and hence provides the highest levels of data privacy. This model is easy to start but is not scalable to use compute resources across multiple machines.
- Self-hosted Kubernetes cluster(Coming soon): The data ingestion can be executed on a self-provisioned Kubernetes cluster on-prem or in your VPC in the cloud.
- Auto-provisioned compute: For a fully hosted SaaS experience, then Data Explorer auto-provisions compute resources to ingest your data.
- Catalog registration: Catalog is registered to Data Explorer using the following modes
- External catalog connect: Catalogs stored in self-hosted catalog stores like MySQL database can be connected to Data Explorer to make the full catalog available in Data Explorer.
- Import through CSV(comma-separated values): Data Explorer UI allows importing catalog CSV files into a table.
- adectl based import(deprecated): The adectl command line utility allows importing catalog from a CSV file.
- Data Exploration and analysis: Once the above preparatory operations are complete, the data is available for exploration, curation, model analysis and other operations supported by Data Explorer.
Local mode setup
For evaluation purposes, the software running on the web portal can be installed on the user-provisioned Linux/Mac machine where adectl is installed. This setup supports limited scale and capabilities while ensuring extracted features and thumbnails stay within the user-provisioned machine.
The following user roles are available.
- Organization Admin: This role is a super user who has access to all capabilities on the web portal. This role has the following capabilities not available without this role.
- Registering secrets (credentials) to access data and catalog stores.
- Registering data stores (Container) and External catalogby providing the necessary URLs and secrets.
- User and group management.
- User: This role maps to a data engineer/data scientist persona who is responsible for
- Defining datasets - Specify Datasetand the choice of pre-processing and featurization to be run on the objects in the dataset.
- Creating different types of data analysis jobs to curate data objects and create a Resultset.
- Finance Admin: This role maps to the finance team person responsible for keeping track of invoices and payments.
A container describes a storage location from where data is ingested into the system. A container can be an S3 bucket, Azure blob store, Google Cloud Store or a directory on the local file system. The container is registered through the web UI, with the user providing the details like the endpoint URL, credentials etc.
A dataset is an entity that specifies a selector on the contents of the container. A dataset can be of Image or Video type.
For example, an S3 bucket has two directories, CAMERA-FRONT and CAMERA-BACK, with images from front and back cameras, respectively, and each of these camera images has a different feature extraction model that is most appropriate. For such a case, you can define two datasets with glob pattern CAMERA-FRONT/**/*.jpg and CAMERA-BACK/**/*.jpg, respectively to logically group the images from two cameras.
A pipeline is an abstraction that captures the ingest processing routines. A pipeline typically has pre-processing, feature extraction, thumbnail generation and feature summarization stages.
Pipelines are triggered through the following command.
adectl run -n <dataset-name> -i <directory-with-input-objects>
The above processing registers features and produces catalog. The catalog is accessed using the 'Catalog' button on the dataset card, as shown below.
Data Explorer has pre-registered pipelines available out of the box. Additionally, you can define your pipelines by either using pre-registered docker images or uploading your docker image and using them in your pipelines. Please go through Pre-registered pipelines for more details.
Data Visualization Job
Once data is ingested, a data visualization job can be created by browsing the catalog through the 'Catalog' button in the dataset card, as shown in the previous section.
Data visualization UI provides capabilities to explore, drill down and curate the data using cluster views, nearest neighbour searches and similarity searches. The curated subset of the data objects is referred to as a resultset. A resultset can be downloaded to a local directory or exported to an S3 bucket, Azure blob store, or Google Cloud Store for downstream processing like labelling or machine learning training.
Clustering and Embedding
When a job is submitted, the low dimensional representations and coresets cluster the data objects to enable exploration and curation.