Ingesting data features
  • 21 Dec 2022
  • 2 Minutes to read
  • Dark
    Light
  • PDF

Ingesting data features

  • Dark
    Light
  • PDF

Article summary

Ingesting Data features

The data features are ingested into the Data Explorer using the adectl run command. The common usage of this command is as below.

adectl run -n <dataset-name> -i <data-directory>
  1. The <dataset-name> is the name provided when creating the dataset on the web portal UI.
  2. <data-directory> is an optional parameter that must be set as described below.
    1. For ingestion from the local file system, this must be the path to the subdirectory to be ingested within the directory configured as -i option to adectl config.
    2. For S3/Azure/GCP, this is a subdirectory within the URL configured in the container.  For example - If s3://bucket1/ is configured as a container URL and s3://bucket1/photos1 should be ingested then this option must be set to photos1

The following variations of the above command exist.

  1. In cases where the dataset name is not unique, then using -n will fail and -d option must be used to run the command with a dataset ID. The dataset ID is available on the web UI as shown below.
    get-dataset-id (1)
  2. -w option can be used to specify the wait mode of execution where the command will wait till all processing is complete.

Ingesting features for a dataset with External featurizer

When creating the dataset, if the Featurizer Type is selected as External, then a CSV file with features must be provided as -f option to adectl run command. Please refer to External features preparation for details on the structure of the featurizer CSV file.

Ingesting features for WMTS dataset

When creating the dataset, if the WMTS option is selected, then the input container (like S3 bucket) or data directory if -i option is chosen is expected to have JSON files as per the structure described in WMTS JSON file.

Monitor Ingest Progress

To monitor the progress of the background process, run the following command:

Shell
adectl show


Monitor ingest progress

 The output has the following information

  1. Dataset name and other details
  2. Status: 
    1. RUNNING: In progress
    2. COMPLETED: Completed successfully
    3. FAILED: There was some error
  3. Progress: Shows percentage progress based on the number of partitions processed.  The processing divides the source data into multiple partitions for parallel processing. It must be noted that the total number of partitions is updated progressively and hence percentage progress may show to be reduced between 2 successive snapshots of the output. For example, it is expected behavior to see Progress=(2/5 partitions) at time T0 and Progress=(3/8 partitions) at time T0+N seconds.

In case of a FAILED status, run the following command to get more information on the error.

Shell
adectl show -e

You can cancel the data ingestion process using the following command:

Shell
adectl abort



Was this article helpful?

What's Next