- 21 Dec 2022
- 6 Minutes to read
- Updated on 21 Dec 2022
- 6 Minutes to read
Akridata Data Explorer consists of two major components
- A web portal accessible through a web browser.
- A command line tool named adectl and set of feature extraction docker images. This software is installed on a user provisioned Linux or Mac machine that will read data from one of the supported cloud data stores or local file system, extract features and upload these features to the web portal. Additionally, adectl supports ingesting externally generated features and catalog information provided through comma separated value (CSV) files.
Local mode setup
For evaluation purposes, the software running on the web portal can be installed on the user provisioned Linux/Mac machine where adectl is installed. This setup supports limited scale and capabilities while ensuring that extracted features and thumbnails stay within the user provisioned machine.
The following user roles are available
- Organization Admin: This role is a super user who has access to all capabilities on the web portal. This role has the following capabilities not available without this role.
- Registering secrets (credentials) to access data and catalog stores.
- Registering data stores (Container) and External catalog by providing necessary URL and secrets.
- User and group management.
- User: This role maps to a data engineer/data scientist persona who is responsible for
- Defining datasets - Specify Dataset and the choice of pre-processing and featurization to be run on the objects in the dataset.
- Creating different types of data analysis jobs to curate data objects and create a Resultset.
- Finance Admin: This role maps to finance team person responsible for keeping track of invoices and payments.
A container describes a storage location from where data is ingested into the system. A container can be a S3 bucket, Azure blob store, Google Cloud Store or a directory on the local file system. The container is registered through the web UI with user providing the details like the end point URL, credentials etc.
A dataset is an entity that specifies a selector on the contents of the container. A dataset can be of Image or Video type.
For example, an S3 bucket has two directories CAMERA-FRONT and CAMERA-BACK with images from front and back cameras respectively and each of these camera images have different feature extraction model that is most appropriate. For such a case, you can define two datasets with glob pattern CAMERA-FRONT/**/*.jpg and CAMERA-BACK/**/*.jpg respectively to logically group the images from two cameras.
A pipeline is an abstraction that captures the ingest processing routines. A pipeline typically has feature extraction, thumbnail generation and feature summarization stages.
The pipeline is triggered through the following command.
adectl run -n <dataset-name> -i <directory-with-input-objects>
The above processing registers features and produces catalog. The catalog is accessed using the 'Catalog' button on the dataset card as shown below.
Data Visualization Job
Once data is ingested, a data visualization job can be created by browsing the catalog through the 'Catalog' button in the dataset card as shown in the previous section.
Data visualization UI provides capabilities to explore, drill down and curate the data using cluster views, nearest neighbour searches and similarity searches. The curated subset of the data objects is referred to as a resultset. A resultset can be downloaded to a local directory or exported to a S3 bucket, Azure blob store, Google Cloud Store for downstream processing like labelling or machine learning training.
Clustering and Embedding
When a job is submitted, the low dimensional representations and coresets are used to cluster the data objects to enable exploration and curation.