Skip to content

Scoring

The advisor assigns each candidate column a composite score from 0 to 100 by combining four weighted factors and applying a cardinality penalty.

Factor Breakdown

The default weights (all configurable):

Factor Max Points Config Parameter What it measures
Table size 30 score_weight_table_size How large the table is — clustering benefits scale with table size
Predicate frequency 30 score_weight_predicate_freq How often the column appears in WHERE predicates
Cardinality 25 score_weight_cardinality Whether the column has enough distinct values for effective clustering
Data type support 15 score_weight_data_type Whether the data type is fully/partially/not supported

Note: Weights must sum to 100. This is validated at runtime.

Table Size Scoring

Points are assigned based on row count relative to large_table_rows (default: 50M):

Condition Points (% of weight)
≥ 10× large_table_rows (500M+) 100%
≥ 1× large_table_rows (50M+) 75%
≥ 0.1× large_table_rows (5M+) 40%
≥ 0.01× large_table_rows (500K+) 15%
Below 0.01× 0%

Predicate Frequency Scoring

Points based on weighted predicate hit count (weighted by number_of_runs per query):

Predicate Hits Points (% of weight)
≥ 20 100%
≥ 10 80%
≥ 5 55%
≥ 2 30%
1 15%
0 0%

Cardinality Scoring

Points based on cardinality classification:

Cardinality Level Points (% of weight)
High 100%
Medium 65%
Low 10%
Unknown 0%

See Configuration for how levels are determined.

Data Type Scoring

Condition Points (% of weight)
Fully supported, no warnings 100%
Supported with warnings (e.g., decimal(>18), varchar(>32)) 65%
Not supported 0%

See Data Type Reference for the full support matrix.

Cardinality Penalty Multiplier

After computing the raw composite score (sum of the four factors), a penalty multiplier is applied based on cardinality classification:

Cardinality Multiplier Effect
High 1.0× No penalty — ideal for clustering
Medium 1.0× No penalty — suitable for clustering
Low 0.35× Heavy penalty — too few distinct values
Unknown 0.70× Moderate penalty — could not be estimated

Why a penalty multiplier?

Without it, a Low-cardinality column on a huge, frequently-queried table could score 77/100 — because cardinality is only 25% of the total weight. Clustering such a column would be ineffective and wasteful.

With the penalty:

Raw score:  77  (30 table-size + 30 predicate + 2 cardinality + 15 data-type)
Penalty:    × 0.35
Final:      26  → well below the default threshold of 40

This ensures that cardinality — the most fundamental requirement for effective clustering — has a decisive impact on the final recommendation.

Recommendation Labels

After scoring, each column receives a label:

Label Condition
RECOMMENDED Score ≥ min_recommendation_score + 20 (default: 60+)
Consider Score ≥ min_recommendation_score (default: 40–59)
Not recommended (low score) Score < min_recommendation_score
Not recommended (low cardinality) Cardinality classified as Low
Not recommended (unsupported data type) Data type cannot be clustered
Already clustered Column is in an existing CLUSTER BY and scores well
Already clustered — NOT RECOMMENDED Column is clustered but has low cardinality, unsupported type, or low score

Already-Clustered Column Validation

Columns that are already part of a CLUSTER BY are not excluded from scoring. They go through the same evaluation pipeline and receive the same score/label as any other column. This allows the advisor to:

  • Confirm that existing clustering choices are effective
  • Flag clustered columns that have low cardinality — suggesting they should be removed from CLUSTER BY
  • Flag clustered columns with unsupported data types or low scores
  • Generate explicit warnings in the report

CTAS DDL Generation

When generate_ctas=True, the advisor generates one CREATE TABLE ... AS SELECT statement per recommended column:

CREATE TABLE [schema].[table_clustered]
WITH (CLUSTER BY ([column_name]))
AS SELECT * FROM [schema].[table];

Only columns meeting all of these criteria get DDL:

  • Not already clustered
  • Supported data type
  • Score ≥ min_recommendation_score
  • Cardinality classified as Medium or High

The DDL appears in each per-table section and in a consolidated "All DDL" section at the end of every report format.

Worked Example

Consider a column OrderDate on a 200M-row table that appears in WHERE predicates 12 times, has High cardinality, and is a date type:

Factor Calculation Points
Table size 200M ≥ 10 × 50M → 100% of 30 30
Predicate frequency 12 hits ≥ 10 → 80% of 30 24
Cardinality High → 100% of 25 25
Data type date fully supported → 100% of 15 15
Raw composite 94
Penalty High cardinality → 1.0× 94
Label 94 ≥ 60 RECOMMENDED

Now compare with a StatusCode column on the same table — 8 predicate hits, only 5 distinct values, int type:

Factor Calculation Points
Table size 200M → 100% of 30 30
Predicate frequency 8 hits ≥ 5 → 55% of 30 16
Cardinality Low (5 ≤ 50 abs max) → 10% of 25 2
Data type int fully supported → 100% of 15 15
Raw composite 63
Penalty Low cardinality → 0.35× 22
Label 22 < 40 Not recommended (low cardinality)