- 15 Jul 2022
- 6 Minutes to read
- Print
- DarkLight
- PDF
System components
- Updated on 15 Jul 2022
- 6 Minutes to read
- Print
- DarkLight
- PDF
The Akridata System enables the construction, deployment, and operation of user-defined edge-to-core data pipelines.
Data pipelines encapsulate data management operations, including the ingestion of data close to their point of generation, the transformation and selection of data subsets according to user requirements, the prioritized transfer of the selected subsets to one or more target locations, and finally storage and access mechanisms to interact with the data. As an example, in the autonomous vehicle domain, data pipelines can be used to specify how sensor data generated by a test vehicle augmented with camera, radar and LiDAR sensors can be ingested, transferred, stored, and accessed by data scientists for building autonomous vehicle perception and planning models.
The Akridata System consists of the following components:
- Edge Microsite(s) - These systems are intended to be deployed in edge data centers close to the data sources, and provide functionality for ingesting source data into the pipeline. The Akridata System also supports ingestion of data using Soft Edges, which are non-physical manifestations (e.g. virtual machines, cloud instances) of the microsite functionality.
- (External) Catalogs and Data Stores - The Akridata System routes data collected from one or more edge microsites into data sinks, which correspond to a catalog store for storing meta-data, and one or more tiers of scalable capacity data stores for storing data objects. A key feature of the Akridata System is its support for enabling data pipelines to produce and maintain in-sync meta-data and data objects.
- Akridata Cloud Software Services - These services together host the management and control plane, which orchestrates operations across multiple data pipelines. The Akridata Cloud Server also provides simplified, standard methods of access to the meta-data and data objects produced by a pipeline, insulating users from details about how the data is stored and whether or not it is directly accessible.
The figure below provides an overview of the Akridata system components. The figure illustrates a system setup with an Edge Microsite cluster deployed in a corporate data center, and a Soft Edge cluster deployed on AWS cloud — and the catalog and data stores hosted on AWS cloud using corresponding database and storage services.
Edge Microsite
A microsite refers to physical hardware to be deployed in edge locations close to where data is collected. In the context of autonomous vehicle domain, the edge location could be a parking garage, service center, charging facility or a mini data center where data is ingested.
A microsite typically consists of the following sub-components:
- Ingest Dock: A JBOD with slots for hot-swappable hard disk drives (HDDs) or solid state drives (SSDs). Note that this choice is aligned with the requirements of the autonomous vehicle domain, where test vehicles produce such high volumes of data that the most efficient way to offload sensor data from the vehicle is by physically moving storage media. The Akridata System is capable of interfacing with other ingest mechanisms if required e.g. network-based ingest.
- AkriNode(s): These are standard servers that host and run Akridata’s Data Filter Framework software for executing data pipelines, and are optimized for stream processing on complex data. AkriNode(s) come in multiple configurations — Extra Small, Small, Medium, and Large — which differ in the amount of CPU, memory, and GPU resources, and provide correspondingly varied ingest and processing capabilities. Multiple AkriNode(s) in a microsite can be organized into a fault-tolerant cluster, and can cooperate to ingest from the data sources targeted by the microsite. As part of executing the data pipeline, an AkriNode performs user-defined operations for data ingest, cleaning, tagging, and prioritization, and produces meta-data and data objects that are transferred to the catalog and data stores associated with the pipeline.
- Cartridge: Data pipeline specifications govern how the produced meta-data and data objects are transferred to the data sinks. Urgent or high-priority data is transferred over the network, using standard connections between the edge data center and the target data sink environment. The Akridata System also supports the physical transfer of processed data from edge data centers using a high-density removable storage cartridge. This cartridge is integrated into the construction of the AkriNode, and when filled, can be shipped using standard delivery services to a gateway data center. At the gateway data center, where one is expected to have higher bandwidth connectivity to the target data sinks (e.g., AWS DirectConnect, Azure Express Route), the cartridge is inserted into another AkriNode and its contents uploaded over the network path.
- Edge Store: A microsite can optionally be configured with edge-local storage. The Akridata System can interface with either file-system (POSIX, NFS, HDFS) and object-store (S3 compatible) forms of edge stores, and uses them to provide two capabilities:
- Enable a subset of data objects produced as a consequence of edge processing to be retained on the edge, avoiding their transfer to and storage in the target data sink. This data can get dropped from the edge based on configured data lifecycle management policies.
- Provide data safety using a redundant copy of data, in case a cartridge gets lost or damaged in shipment.
Soft Edges
The Data Filter Framework software described above can also be deployed on virtual machines (e.g., available in public cloud environments such as AWS EC2 instances or Azure VMs) or on containers/pods running in a public or private cloud Kubernetes cluster.
Physical AkriNode(s) and Soft Edge(s) are equivalent and managed in a similar fashion — to run data pipeline functionality and work with clusters of nodes — with the difference that the data sources and sinks for soft edge configurations are expected to be network-accessible data stores (e.g., AWS S3 for AWS EC2 based soft-edge cluster). Additionally, a soft-edge cluster can be dynamically provisioned, and scaled up or scaled down depending on the data ingest load.
Catalogs and Data Stores
Data pipelines specify required processing on the edge sites (or on soft edges), which results in the production of meta-data and data objects (or blobs).
The Akridata System stores the meta-data in a relational database such as MySQL, enabling the user to interact with meta-data using standard and familiar SQL queries. The data objects can be stored across a multi-tier storage hierarchy, which can span one or more tiers of file-system (POSIX, NFS, HDFS) or object-store (S3 compatible) storage.
Both the catalog and data stores are configured by the user and provided to the Akridata System, which only requires that the services be network-accessible with the supplied access credentials. The above figure shows a catalog store comprising an AWS RDS deployment in a highly availability mode, with multiple availability zones. The figure also shows a multi-tier object store spanning AWS S3, AWS Glacier, and AWS Deep Archive.
The Akridata System handles the fault-tolerant and consistent transfer of meta-data and data objects between the edge and core data centers, respecting any tiering or prioritization requirements specified by the user.
Akridata Cloud Software Services
Akridata Cloud Software services form a fault-tolerant and highly-available management plane for the Akridata System. The Cloud Software services are packaged as Kubernetes microservices, and hence can be deployed in any public cloud or in private data centers.
The main services making up the Akridata Cloud Software module include:
- AkriManager: This component provides a single pane of glass to administrators and users for management and monitoring of the Akridata System. It can be used to configure, deploy, and operate data pipelines.
- Unified Storage Gateway (USG): In general, the data objects produced by a data pipeline can be stored (or be in transit) across multiple storage tiers, which can additionally be distributed across multiple edge and core locations. The USG service provides a global namespace for these objects, which insulates the users from tracking and carrying out per-location and per-tier operations. USG provides location-independent access, eliminates redundant data transfers, and supports policy-driven data lifecycle management to optimize storage costs.