Pipeline operations on a dataset

Prev Next

As described in the Register dataset article, you can attach pipelines to the dataset when the dataset is created. This article describes the steps to attach and detach pipelines after the dataset is successfully registered.

Attach pipeline to the dataset

On the dataset listing page, the 3-dots button on the dataset card shows a list of operations available on the dataset.

Execute Pipeline

This option enables you to ingest the selected pipeline into the dataset. 

  1. Click the 3-dots icon, select Pipeline > Execute pipelines.
    The Ingest Now window opens displaying the Base URL of the pipeline. This field is not editable.

  2. Enter the subdirectory path.
    You can leave this field blank to ingest all data, provide path to the subfolder from where the data should be ingested.

  3. Select the maximum number of files that should be ingested.

  4. Click Ingest.
    The application initiates the process of ingestion.   

Attach Pipeline 

  1. Select Pipeline > Attach pipelines.

  2. In the Attach Pipeline screen, select the pipeline from the drop-down of available pipelines. The 'starred' pipelines are the recommended pipelines.

  3. Select the policy for this attachment that determines the mode(scheduled Vs triggered) of ingestion and the compute resources where ingestion will be done.

    1. Schedule policy(BETA): The ingestion will be triggered as per the provided schedule on the selected cluster. Currently, only one pre-provisioned cluster, 'AkridataEdgeCluster', is available for selection, and this list will be extended with user-registered clusters in the future. The schedule is specified using a cron string. 

    2. On-demand policy(BETA): In this mode, the ingestion is triggered by the user as needed on the selected cluster.

    3. Manual adectl run: In this mode, the compute resource for ingestion is to be provisioned by the user and ingestion must be triggered using the adectl command line utility.

  4. Click the Attach button.

View pipeline attachment details

The list of attachments for a dataset can be viewed on the dataset details page.

  • On the dataset card, click View.

  • On the dataset page, click the PIPELINE tab.
    This opens the dataset details page below with a listing of all pipelines attached to the dataset, the featurizer type and the corresponding policy attached the pipeline.

Operations on pipeline attachments

Under the Actions column, you can perform the following:

  1. For Schedule and On-demand policy attachments, click the View Details arrowhead as shown below to view details of the last executed ingestion session that was scheduled or triggered by the user. The details section shows the progress percent (for in-progress sessions) and other details.

  2. Click Catalog to view the catalog for the pipeline.

  3. For Schedule and On-demand policy attachments, click Ingest to execute the pipeline as per the base URL.
    You can select the sub-directory and maximum number of files for ingestion.

  4. Click the 3-dots icon to perform the following:

    1. Detach the pipeline.

    2. Edit the policy attached to the pipeline.

Detach pipeline from dataset

  1. On the dataset card, click the 3-dots icon, select Detach Pipeline.

  2. Select the pipeline to detach from the drop-down list of attached pipelines.

  3. Click the Detach button.

Ingested data stays after the 'Detach' operation

Any data ingested by the detached pipeline will remain in the system and be accessible for catalog browsing and job creation. The detached pipeline will not be executed on new data. If the same pipeline is reattached, all data that entered the dataset while the pipeline was detached will be processed through the reattached pipeline.