Overview
  • 21 Dec 2022
  • 6 Minutes to read
  • PDF

Overview

  • PDF

System Overview

System Overview

 Akridata Data Explorer consists of two major components

  • A web portal accessible through a web browser.
  • A command line tool named adectl and set of feature extraction docker images. This software is installed on a user provisioned Linux or Mac machine that will read data from one of the supported cloud data stores or local file system, extract features and upload these features to the web portal. Additionally, adectl supports ingesting externally generated features and catalog information provided through comma separated value (CSV) files.
Your data and catalog information stay where they are
Only extracted features and thumbnails are uploaded to web portal. Full data and catalog are not copied.

Local mode setup

For evaluation purposes, the software running on the web portal can be installed on the user provisioned Linux/Mac machine where adectl is installed. This setup supports limited scale and capabilities while ensuring that extracted features and thumbnails stay within the user provisioned machine. 

Self-hosting
The software components powering the web portal are deployed in a Kubernetes cluster and hence architecturally friendly to be self hosted on any cloud/on-prem Kubernetes cluster. Please contact us for more details.  

User Roles

The following user roles are available

  1. Organization Admin: This role is a super user who has access to all capabilities on the web portal. This role has the following capabilities not available without this role.
    1. Registering secrets (credentials) to access data and catalog stores.
    2. Registering data stores (Container) and External catalog by providing necessary URL and secrets.
    3. User and group management.
  2. User: This role maps to a data engineer/data scientist persona who is responsible for
    1. Defining datasets - Specify Dataset and the choice of pre-processing and featurization to be run on the objects in the dataset.
    2. Creating different types of data analysis jobs to curate data objects and create a Resultset.
  3. Finance Admin: This role maps to finance team person responsible for keeping track of invoices and payments.

Terminology

Container

A container describes a storage location from where data is ingested into the system. A container can be a S3 bucket, Azure blob store, Google Cloud Store or a directory on the local file system. The container is registered through the web UI with user providing the details like the end point URL, credentials etc.   

Local Container
If data to be ingested is present on the local file system, then explicit container creation is not required.

Dataset

A dataset is an entity that specifies a selector on the contents of the container. A dataset can be of Image or Video type.

For example, an S3 bucket has two directories CAMERA-FRONT and CAMERA-BACK with images from front and back cameras respectively and each of these camera images have different feature extraction model that is most appropriate. For such a case, you can define two datasets with glob pattern CAMERA-FRONT/**/*.jpg and CAMERA-BACK/**/*.jpg respectively to logically group the images from two cameras.

Pipelines

A pipeline is an abstraction that captures the ingest processing routines. A pipeline typically has feature extraction, thumbnail generation and feature summarization stages.

The pipeline is triggered through the following command.

adectl run -n <dataset-name> -i <directory-with-input-objects>

The above processing registers features and produces catalog. The catalog is accessed using the 'Catalog' button on the dataset card as shown below.

dataset-catalog-access


Coming Soon - Pipeline customization
The default pipeline may not be best fit for data across all domains. The pipeline abstraction will be extended to support user provided featurizer in upcoming releases.

Data Visualization Job

Once data is ingested, a data visualization job can be created by browsing the catalog through the 'Catalog' button in the dataset card as shown in the previous section.

Catalog browsing to create a data visualization job


Visualization job before submission

 


Visualization Job

Resultset

Data visualization UI provides capabilities to explore, drill down and curate the data using cluster views, nearest neighbour searches and similarity searches. The curated subset of the data objects is referred to as a resultset. A resultset can be downloaded to a local directory or exported to a S3 bucket, Azure blob store, Google Cloud Store for downstream processing like labelling or machine learning training.

Clustering and Embedding

When a job is submitted, the low dimensional representations and coresets are used to cluster the data objects to enable exploration and curation.


Was this article helpful?

What's Next