- 15 Jul 2022
- 4 Minutes to read
- Print
- DarkLight
- PDF
Example Workflow
- Updated on 15 Jul 2022
- 4 Minutes to read
- Print
- DarkLight
- PDF
The figure below shows an example workflow consisting of a single filter graph bound to input and output containers. This example filter graph processes camera images and runs an object detector on each image and specifies an output prioritization of data transfer to blob stores.
The filter graph consists of the following filter nodes:
- Partitioner: This filter scans the files/objects in the input container and partitions them into smaller chunks (tokens) for parallel processing by downstream filters. An example of a partitioning scheme could be to chunk input data based on the timestamp and create chunks representing a small contiguous time window (e.g. 10 seconds per chunk).
- DataReader: This filter reads files/objects in each chunk and creates primary data object blobs. Primary blobs contain raw input data.
- ObjDetector: This filter runs an object detection model and generates a count per object type detected. The filter generates 1 row per chunk (10 seconds in the example) in the catalog table. In addition, it generates a per image triplet of [object detected, confidence score, area of bounding box] as a metadata blob in arrow or parquet data format that is written out into a catalog database.
- HumanDetector: This filter runs a human detector function on all images in a chunk (10s in the example) and computes the percentage of images with humans. The latter is used to first classify the chunk into different priorities based on whether the percentage score crosses a threshold or not, and second route the data to one of two outputs corresponding to the different priorities. The priorities determine the order in which data will be uploaded to destination containers with a lower priority number representing a higher priority for upload.
The Akridata Edge Software runtime system handles the uploading of meta-data records and the data object blobs to the catalog and data stores identified in the output container. The catalog data has highest priority and is uploaded ahead of any data blobs. A derived object can additionally be expanded and written as individual records into the catalog too to enable faster querying.
A workflow produces as its outputs (1) meta-data entries that are added to relational database tables in a catalog; and (2) raw data and derived object blobs that can be categorized, and uploaded to different destination data stores at different priorities. The names of the data object blobs are stored in the catalog, which enables the catalog to act as a gateway to accessing the data set.
Accessing Workflow Data
As the workflow executes, the catalog and high priority data start appearing in real-time in the centralized catalog and data stores respectively.
The user can browse the catalog using the AkriManager UI and can optionally decide to retrieve selected blobs to be able to operate upon them using a populate operation. The populate operation encapsulates all of the complexity of retrieving the requested data objects, handling scenarios where the objects may be present in a storage tier that is not directly accessible or other situations where the requested data may still be in transit from the edge location to the core storage tiers. The operation results in a request being sent to the Akridata USG service, which orchestrates the movement of data from wherever it is to tier 1 storage for any downstream processing the user intends to execute. This operation can happen while the edge is still ingesting data, and is optimized to support concurrent access to an overlapping set of objects from multiple users.
Workflow Data Lifecycle
After the workflow finishes execution, its raw and derived data objects continue to be transferred to the specified destination data stores. The transfer is determined by user-defined policies for transfer priority and storage tiering. The Akridata system includes mechanisms that track the completion of this activity, including handling any error recovery. Depending on the chosen prioritization and data transfer policies (e.g., either network based or physical transfer), it may take anywhere from a few hours to several days for all of the workflow data objects to be stored in their eventual destinations.
As described above, all of the data objects are accessible as soon as the global catalog is updated. This is possible even while the workflow is running, independent of whether or not the data transfer operations have completed. As users browse the catalog and request specific objects, the Akridata system handles the inter-tier and inter-site movement of data to satisfy the request in the most efficient fashion, including creating a copy of the objects when appropriate. The lifecycle of such object copies created on demand is managed entirely by the Akridata system — copies exist for a certain lease period, which can be extended by the user, and are automatically deleted upon expiration of the lease.
On a longer timescale, the Akridata system provides three mechanisms to enable tiered storage capacity management:
- Data container-level auto-tiering policies, which govern how data objects in a tier are either demoted to a lower tier or are permanently deleted as they age. The auto-tiering policy associated with a data container can be overridden for a workflow using that container.
- Explicit inter-tier movement of data objects requested by the customer admin, which enables flexibility in retaining a subset of older objects that retain their relevance.
- Archival workflows, which are a general-purpose mechanism that can be used in conjunction with the auto-tiering policies above to retain and optionally transform data objects before writing them to new storage tiers. For example, such workflows can be used to replace older data objects with lower resolution versions so as to save on storage costs without completely losing access to the information contained in those objects.