Register dataset
  • 25 Feb 2024
  • 4 Minutes to read
  • Dark
    Light
  • PDF

Register dataset

  • Dark
    Light
  • PDF

Article Summary

A dataset is a collection of images or video files hosted on a cloud bucket(S3, Azure or Google Cloud) or present on your local machine. The dataset registration records the URL(or path) to the data and the credentials to access the data. The data processing on a dataset is encapsulated as a pipeline attached to the dataset. The dataset registration and pipeline attachments have the following variants.

  1. Fully managed mode(Recommended): Data is on a cloud bucket connected to Data Explorer, and the pipelines are executed on compute resources that are automatically provisioned.
  2. adectl command line tool: If data is present on your machine or security policies disallow data access over the Internet, then the data can be processed using the adectl command line tool. adectl is supported on Linux and Mac systems and requires docker runtime. Please see adectl for more details.
  3. Python SDK: For reasons similar and environments similar to adectl command line tool usage, data can be registered using a Python program using the Data Explorer Python SDK. This mode supports a basic subset of pipelines. Please see SDK for more details.

The video below demonstrates the steps for registering a fully managed dataset(mode one above). The flow assumes an OrganizationAdmin user who can create containers and secrets inline. If you don't have OrganizationAdmin privileges, your organization admin must register a container(data bucket) before registering a dataset.


Detailed steps

  1. To register a dataset, click the '+Add Dataset' button on the top right of the Datasets page.
  2. A form as below opens up.
  3. Enter the fields described below.
    1. Dataset Name: An identifiable name for the dataset.
    2. Data Type
      1. Select Image or Video as the type of data in the dataset. This will prefill the 'Glob' field with a pattern capturing the common file extensions for an image. Edit the 'Glob' field if you have files with an extension not covered by the default value.
    3. Select Source Container: This container represents the source data location like an S3 bucket, Azure Blob Store bucket, etc. The list of containers registered is presented as a drop-down to make a selection on an existing container. 
      1. If you have OrganizationAdmin privileges, you can register a new container inline using the '+Add Container' option from the dropdown.
        1. Enter an identifiable name for the container.
        2. Select store type from the drop-down list.
        3. Enter the URI to the top-level data directory in the bucket. All files that match the glob pattern under this data directory will become part of the dataset.
          1. For S3 the URI format is s3://<bucket-name>/<optional-directory-name>
          2. For Azure storage, the URI format is wasbs://akridemoedge@storagebuckets.blob.core.windows.net/<optional-directory-name>
          3. For Google cloud storage, the URI format is gs://<bucket-name>/<sub-directory>
        4. Select a secret that holds the access credentials. If you have OrganizationAdmin privilege, then you can add a secret inline. Please refer to Secrets page for fields expected for each type of cloud.
      2. For data that is present on the local file system, use the '+Add Local Container' option from the 'Select Container' dropdown.
  4. The other fields are specific to the data type and use cases described below.
    1. Sampling rate
      1. For the Video type of the dataset, enter 'Sampling rate (fps)' with a frames-per-second value to be used for sampling frames from the video. By default, all frames will be sampled. A lower sampling rate will cover a larger duration video within the system limits, like the maximum number of frames that can be registered, the maximum number of frames in a visualization job, etc. A sampling rate of <=5fps is sufficient for a typical use case.
    2. WMTS: The Web Map Tile Service (WMTS) option is available if the Select Data Type is Image. This allows the input to be specified as a JSON file with URLs hosted by a WMTS service. For WMTS, additional query parameters that will be appended to the image URL can be provided. This typically specifies the API key passed to WMTS as a query parameter after the URL. For example, suppose a WMTS service has a URL convention https://mytile123&key=myapikey. In this case,
      1. Key: The value specified in this field should be 'key'
      2. Value: The value specified in this field should be 'myapikey'.
  5. A preview functionality is available for datasets defined on containers that point to cloud-hosted buckets. Click on the 'Show Preview' button shown below.

  6. Review the file sample list. By default, a pre-defined set of pipelines are recommended, as shown below. Click on 'Customize' to modify pipelines to be attached.
  7. On clicking 'Customize', the following screen is presented.
    Select one or more pipelines from the available list.
    1. The 'starred' pipelines are the recommended pipelines.
    2. Use 'preprocessor' or 'Featurizer' filtering to narrow down to a specific pipeline.
    3. The badges 'Patch Featurizer' and 'Full Image Featurizer' indicate whether or not the pipeline produces features that support patch search.
    4. Once the pipelines are selected, select the policy for ingestion.
      1. Schedule policy(BETA): The ingestion will be triggered per the selected cluster's provided schedule. Currently, only one pre-provisioned cluster, 'AkridataEdgeCluster', is available for selection, and this list will be extended with user-registered clusters in the future. The schedule is specified using a cron string.
      2. On-demand policy(BETA): In this mode, the user triggers ingestion as needed on the selected cluster.
      3. Manual adectl run: In this mode, the compute resource for ingestion is to be provisioned by the user and ingestion must be triggered using the adectl command line utility.
        changing ingestion modes
        If the ingestion policy is set to Schedule/On-demand, the selected pipeline attachments cannot be changed to 'Manual adectl run' and vice-versa.
  8. Review and accept license terms and conditions.
  9. On successful registration of the dataset, the following popup is presented.
  10. To trigger processing data, click on the 'Process Data' button.
  11. This triggers the execution of pipelines on the files in the dataset. The status of the pipelines is available on the Dataset details page, which can be accessed by clicking on the name of the dataset on the 'Datasets' page.


  12. Explore the dataset by clicking the 'Explore job' icon on the dataset card.


Was this article helpful?