Select and Refine
  • 15 Mar 2023
  • 4 Minutes to read
  • Dark
    Light
  • PDF

Select and Refine

  • Dark
    Light
  • PDF

Article Summary

The 'Visualize' operation shows the visualization view with the following elements and controls.

Plot View and Sampling Modes

The plot view shows the distribution of points with colors representing the clusters and an outlier category of points that don't belong to any cluster. Each point in the plot view is clickable, and this action populates the right 'selection' panel with sampled points in relation to the clicked point. The plot view supports zooming in/out and panning to get to the most relevant area representing the data points of interest. The points can be sampled using the following modes,

  • KNN - Nearest neighbours around the clicked point are sampled.
  • Group - Uniform distribution of points from the group to which the clicked point belongs is sampled. The grouping criteria are per the 'Group by' option in the bottom left selection.
  • Random - A random set of points are sampled.
  • Manual - The clicked point is sampled.

Selection (thumbnail) Panel

The right panel shows the sampled thumbnails and has the following controls,

  • Sample size setting - Controls the number of points sampled in response to a click in plot view.
  • Clear points - Clear points from the selection panel.
  • Highlight points - Highlight points in the plot view corresponding to thumbnails in the selection panel.

 

Each thumbnail has an action bar with the following actions(from left to right),

  • Remove this point from the selection.
  • Show catalog tags (only available for Analyze type of job)
  • Add to similarity search as a positive sample.
  • Add to similarity search as a negative sample.
  • Show full resolution image for this thumbnail.
  • Add to resultset

Detailed View

A detailed view shows a full-resolution image. In addition, catalog information is presented against each image to analyze the type of jobs, as shown below.

Group-by (Color-by)

The bottom left of the view has a Group-by option to color the points in the plot view based on the following attributes,

  • partition-id - Refer to partition. This mode colors the plot view based on the partition to which the point belongs. This is useful for cases where ingested data is sorted based on timestamp or other attributes, which results in a partition holding neighbouring points in the sorted order. Suppose colors are spread across the plot view. In that case, it indicates that each partition has a lot of variety versus colors grouping together, indicating that objects in a partition are very similar.
  • weight - The data explorer uses 'coresets' to keep a subset of points as representatives, and each representative is assigned a 'weight' based on how many other points it represents.  
  • Cluster (HDBSCAN) - Cluster id to which point belongs.
  • confidence - The algorithm's confidence in its cluster assignment to a point.

Tunables

The tunables button provides the following controls to filter the points in the plot view,

  • Number of clusters - Change the number based on intent towards a fine-grained or coarse-grained grouping of points.  
  • Sampling modes- The following probabilistic sampling modes are available,
    1. Inlier - Points that are strong inliers to some clusters.
    2. Outlier - Points that don't belong to any cluster.
    3. Bimodal - Points that are either strong inliers or outliers.
    4. Normal - Sample points using a normal distribution.
    5. Uniform - Sample 1 every N point with N decided by the sampling fraction. This is most useful to sample every N point on time-ordered data.
    6.  Coreset - Sample point in a way that preserves the clustering structure
  • Sampling fraction - Fraction of total points to be sampled.
  • Sampling weight - How strong of a preference must be given to the selected sampling mode. The higher the number, the stronger the preference toward the selected sampling mode.

The following picture shows sampling with the Cluster-by option selected to color points by confidence scores. Since the sampling weight was set to the highest allowed value (and hence indicating a strong preference for sampling mode), it can be seen that inlier sampling chooses points with high confidence scores, and outlier sampling chooses points with low confidence scores.

Ksegmentation-Specific Sampling Modes

  • Edge - Points that are at cluster boundaries representing transitions.
  • Line - A regression line (trendline) for each cluster is drawn, and this sampling mode prefers points close to this line.
  • Core - Points that are away from the cluster edges.

Filters

The 'Filter' button provides the following filtering criteria selection,

  • Partition ID - Choose the partition ID to view the clusters.
  • Cluster - Choose a subset of clusters to be displayed.
  • Confidence - Choose only those points with clustering confidence within the selected range.
  • Weight - The data explorer uses 'coresets' to keep a subset of points as representatives, and each representative is assigned a 'weight' based on how many other points it represents.  

Catalog Property Filters   

The Catalog Property Filters option allows you to filter the visualization results as per the selected dataset table columns.

Splitting and Merging Clusters

 If there is a large cluster with many points, it might help to split the cluster into sub-clusters for sampling and refinement. The above graphic shows the steps to split the cluster. The reverse operation of merging a cluster with the rest of the clusters is also supported.

Adding sampled points to a resultset

A resultset represents a curated set of points. From the selection panel, the points can be added to a resultset using the controls highlighted in the below picture.



Was this article helpful?