How It Works¶

The advisor runs a 7-phase pipeline that collects metadata, analyses query patterns, estimates cardinality, and produces scored recommendations. Everything runs inside a single Fabric notebook session.

Architecture Overview¶

┌───────────────────────────────────────────────────────────────┐
│                     Fabric Notebook                           │
│                                                               │
│  DataClusteringAdvisor.run()                                  │
│  │                                                            │
│  ├─ Phase 1: Metadata        → sys.tables/columns/types       │
│  ├─ Phase 2: Clustering      → sys.indexes/index_columns      │
│  ├─ Phase 3: Row Counts      → COUNT_BIG(*) per table         │
│  ├─ Phase 4: Query Patterns  → queryinsights.*                │
│  ├─ Phase 5: Predicates      → regex parser                   │
│  ├─ Phase 6: Cardinality     → APPROX_COUNT_DISTINCT          │
│  └─ Phase 7: Scoring         → composite score + reports      │
│                                                               │
│  All SQL runs via T-SQL passthrough (no data transfer to      │
│  Spark)                                                       │
└───────────────────────────────────────────────────────────────┘

Phase 1: Metadata Collection¶

Reads the warehouse's system catalog views via T-SQL passthrough:

sys.tables — user tables
sys.schemas — schema names
sys.columns — column metadata (name, type, length, precision)
sys.types — data type names

Produces a DataFrame with one row per column across all user tables.

If table_names is specified in the config, the metadata is filtered immediately to include only the requested tables.

Phase 2: Current Clustering Configuration¶

Reads the current CLUSTER BY configuration:

sys.indexes — index definitions
sys.index_columns — columns in each index, filtered by data_clustering_ordinal > 0

This phase also emits early warnings for potentially sub-optimal clustering choices:

char/varchar columns with max_length > 32 (only the first 32 characters produce column statistics)
decimal/numeric with precision > 18 (predicates won't push down to storage)
Unsupported data types (bit, varbinary, uniqueidentifier)
Tables exceeding max_clustering_columns clustered columns

Phase 3: Row Counts¶

Counts rows per table using T-SQL passthrough:

SELECT COUNT_BIG(*) AS cnt FROM [schema].[table]

This runs inside the SQL engine — one query per table. No user data is transferred to Spark, only the resulting count (~KB per table).

Tables below min_row_count are excluded from further analysis.

Phase 4: Query Pattern Analysis¶

Reads queryinsights.frequently_run_queries (enabled by default on every Fabric Warehouse) and categorises queries into:

WHERE queries — contain a WHERE clause and reference large tables
Full-scan queries — reference large tables but have no WHERE clause

Both types are valuable signals:

WHERE queries identify specific columns used for filtering
Full-scan queries indicate tables that would benefit from any clustering (reducing I/O even without a specific predicate)

Queries are filtered by min_query_runs and internal system queries (like COUNT_BIG(*) from Phase 3) are excluded automatically.

Phase 5: Predicate Extraction¶

Parses the SQL text of WHERE queries using a regex-based heuristic to identify which columns appear in predicates.

The parser:

Extracts WHERE clauses (everything between WHERE and ORDER BY/GROUP BY/HAVING/UNION/;/end of the query text)
Finds identifiers in the clause text
Matches them against known columns from Phase 1
Detects comparison operators (=, >, <, BETWEEN, IN, LIKE, etc.)

The parser handles:

Bracket-quoted identifiers: [schema].[table].[column]
Two-part and three-part names
Multiple WHERE clauses in a single query (e.g., subqueries)

Results are aggregated into a dictionary of (schema, table, column) → total_weighted_hits, weighted by each query's number_of_runs.

Phase 6: Cardinality Estimation¶

Estimates distinct value counts for candidate columns using T-SQL passthrough:

SELECT COUNT_BIG(*) AS total,
       APPROX_COUNT_DISTINCT([column]) AS distinct_cnt
FROM [schema].[table]

APPROX_COUNT_DISTINCT runs inside the SQL engine using HyperLogLog — it's fast and accurate enough for classification without transferring any data to Spark.

Cardinality is estimated for:

All columns that appeared in WHERE predicates (Phase 5)
All columns currently in a CLUSTER BY (Phase 2)
All supported-type columns on full-scan tables, using a batch query that estimates multiple columns in a single T-SQL call

If T-SQL passthrough fails for any reason, the advisor falls back to reading through the Spark connector (slower, but reliable).

Phase 7: Scoring & Recommendations¶

Combines all signals into a composite score per column and groups results into per-table recommendations. See Scoring for the detailed formulas.

This phase produces:

A sorted list of ColumnScore objects
A list of TableRecommendation objects (grouped by table)
Three report formats: text, Markdown, HTML
An optional per-column CTAS DDL (when generate_ctas=True)

Data Flow¶

Phase 1 (metadata) ──────────────┐
Phase 2 (clustering) ────────────┤
Phase 3 (row counts) ────────────┼──► Phase 7 (scoring)
Phase 4 (queries) ──► Phase 5 ───┤        │
Phase 6 (cardinality) ───────────┘        ▼
                                    DataClusteringResult
                                    ├── all_scores
                                    ├── recommendations
                                    ├── scores_df
                                    ├── text_report
                                    ├── markdown_report
                                    └── html_report

Performance Characteristics¶

Phase	Method	Data Transfer	Speed
1. Metadata	T-SQL passthrough	Metadata only (~KB)	Instant
2. Clustering	T-SQL passthrough	Metadata only (~KB)	Instant
3. Row Counts	T-SQL `COUNT_BIG(*)`	Count per table (~KB)	Fast
4. Query Patterns	T-SQL passthrough	Query text only (~KB-MB)	Fast
5. Predicates	Local regex	None (in-memory)	Instant
6. Cardinality	T-SQL `APPROX_COUNT_DISTINCT`	None (computed server-side)	Fast
7. Scoring	Local computation	None (in-memory)	Instant

No user data is ever transferred to Spark — only metadata, counts, and aggregates.

Overall execution time depends primarily on the number of tables and the number of columns per table, since Phases 3 and 6 issue one or more T-SQL queries per table. Warehouses with a small number of tables will complete in seconds; larger environments with hundreds of tables and many columns will take longer, particularly during cardinality estimation.