Scoring¶
The advisor assigns each candidate column a composite score from 0 to 100 by combining four weighted factors and applying a cardinality penalty.
Factor Breakdown¶
The default weights (all configurable):
| Factor | Max Points | Config Parameter | What it measures |
|---|---|---|---|
| Table size | 30 | score_weight_table_size |
How large the table is — clustering benefits scale with table size |
| Predicate frequency | 30 | score_weight_predicate_freq |
How often the column appears in WHERE predicates |
| Cardinality | 25 | score_weight_cardinality |
Whether the column has enough distinct values for effective clustering |
| Data type support | 15 | score_weight_data_type |
Whether the data type is fully/partially/not supported |
Note: Weights must sum to 100. This is validated at runtime.
Table Size Scoring¶
Points are assigned based on row count relative to large_table_rows
(default: 50M):
| Condition | Points (% of weight) |
|---|---|
| ≥ 10× large_table_rows (500M+) | 100% |
| ≥ 1× large_table_rows (50M+) | 75% |
| ≥ 0.1× large_table_rows (5M+) | 40% |
| ≥ 0.01× large_table_rows (500K+) | 15% |
| Below 0.01× | 0% |
Predicate Frequency Scoring¶
Points based on weighted predicate hit count (weighted by
number_of_runs per query):
| Predicate Hits | Points (% of weight) |
|---|---|
| ≥ 20 | 100% |
| ≥ 10 | 80% |
| ≥ 5 | 55% |
| ≥ 2 | 30% |
| 1 | 15% |
| 0 | 0% |
Cardinality Scoring¶
Points based on cardinality classification:
| Cardinality Level | Points (% of weight) |
|---|---|
| High | 100% |
| Medium | 65% |
| Low | 10% |
| Unknown | 0% |
See Configuration for how levels are determined.
Data Type Scoring¶
| Condition | Points (% of weight) |
|---|---|
| Fully supported, no warnings | 100% |
Supported with warnings (e.g., decimal(>18), varchar(>32)) |
65% |
| Not supported | 0% |
See Data Type Reference for the full support matrix.
Cardinality Penalty Multiplier¶
After computing the raw composite score (sum of the four factors), a penalty multiplier is applied based on cardinality classification:
| Cardinality | Multiplier | Effect |
|---|---|---|
| High | 1.0× | No penalty — ideal for clustering |
| Medium | 1.0× | No penalty — suitable for clustering |
| Low | 0.35× | Heavy penalty — too few distinct values |
| Unknown | 0.70× | Moderate penalty — could not be estimated |
Why a penalty multiplier?¶
Without it, a Low-cardinality column on a huge, frequently-queried table could score 77/100 — because cardinality is only 25% of the total weight. Clustering such a column would be ineffective and wasteful.
With the penalty:
Raw score: 77 (30 table-size + 30 predicate + 2 cardinality + 15 data-type)
Penalty: × 0.35
Final: 26 → well below the default threshold of 40
This ensures that cardinality — the most fundamental requirement for effective clustering — has a decisive impact on the final recommendation.
Recommendation Labels¶
After scoring, each column receives a label:
| Label | Condition |
|---|---|
| RECOMMENDED | Score ≥ min_recommendation_score + 20 (default: 60+) |
| Consider | Score ≥ min_recommendation_score (default: 40–59) |
| Not recommended (low score) | Score < min_recommendation_score |
| Not recommended (low cardinality) | Cardinality classified as Low |
| Not recommended (unsupported data type) | Data type cannot be clustered |
| Already clustered | Column is in an existing CLUSTER BY and scores well |
| Already clustered — NOT RECOMMENDED | Column is clustered but has low cardinality, unsupported type, or low score |
Already-Clustered Column Validation¶
Columns that are already part of a CLUSTER BY are not excluded from
scoring. They go through the same evaluation pipeline and receive the
same score/label as any other column. This allows the advisor to:
- Confirm that existing clustering choices are effective
- Flag clustered columns that have low cardinality — suggesting they
should be removed from
CLUSTER BY - Flag clustered columns with unsupported data types or low scores
- Generate explicit warnings in the report
CTAS DDL Generation¶
When generate_ctas=True, the advisor generates one CREATE TABLE ...
AS SELECT statement per recommended column:
CREATE TABLE [schema].[table_clustered]
WITH (CLUSTER BY ([column_name]))
AS SELECT * FROM [schema].[table];
Only columns meeting all of these criteria get DDL:
- Not already clustered
- Supported data type
- Score ≥
min_recommendation_score - Cardinality classified as Medium or High
The DDL appears in each per-table section and in a consolidated "All DDL" section at the end of every report format.
Worked Example¶
Consider a column OrderDate on a 200M-row table that appears in WHERE
predicates 12 times, has High cardinality, and is a date type:
| Factor | Calculation | Points |
|---|---|---|
| Table size | 200M ≥ 10 × 50M → 100% of 30 | 30 |
| Predicate frequency | 12 hits ≥ 10 → 80% of 30 | 24 |
| Cardinality | High → 100% of 25 | 25 |
| Data type | date fully supported → 100% of 15 |
15 |
| Raw composite | 94 | |
| Penalty | High cardinality → 1.0× | 94 |
| Label | 94 ≥ 60 | RECOMMENDED |
Now compare with a StatusCode column on the same table — 8 predicate
hits, only 5 distinct values, int type:
| Factor | Calculation | Points |
|---|---|---|
| Table size | 200M → 100% of 30 | 30 |
| Predicate frequency | 8 hits ≥ 5 → 55% of 30 | 16 |
| Cardinality | Low (5 ≤ 50 abs max) → 10% of 25 | 2 |
| Data type | int fully supported → 100% of 15 |
15 |
| Raw composite | 63 | |
| Penalty | Low cardinality → 0.35× | 22 |
| Label | 22 < 40 | Not recommended (low cardinality) |