Configuration Reference¶

All parameters are fields of the DataClusteringConfig dataclass. Create an instance, override the defaults you need, and pass it to DataClusteringAdvisor.

from fabric_warehouse_advisor import DataClusteringAdvisor, DataClusteringConfig

config = DataClusteringConfig(
    warehouse_name="MyWarehouse",
    min_row_count=500_000,
    verbose=True,
)

advisor = DataClusteringAdvisor(spark, config)
result = advisor.run()

Connection Parameters¶

Parameter	Type	Default	Description
`warehouse_name`	`str`	`""`	Required. The Fabric Warehouse name used in the Spark connector's three-part naming.
`workspace_id`	`str`	`""`	Workspace GUID. Only needed for cross-workspace access.
`warehouse_id`	`str`	`""`	Warehouse item GUID. Only needed for cross-workspace access.

Threshold Parameters¶

Parameter	Type	Default	Description
`min_row_count`	`int`	`1_000_000`	Minimum rows for a table to be analysed. Tables below this are skipped entirely.
`large_table_rows`	`int`	`50_000_000`	Row-count threshold for the maximum table-size score. Tables at 10× this value get full points.
`min_predicate_hits`	`int`	`2`	Minimum number of times a column must appear in WHERE predicates to be considered a candidate.
`min_query_runs`	`int`	`2`	Minimum number of executions for a query to be treated as "frequently run" in Query Insights.

Cardinality Classification¶

Parameter	Type	Default	Description
`low_cardinality_upper`	`float`	`0.001`	Cardinality ratio below which a column is classified as Low.
`high_cardinality_lower`	`float`	`0.05`	Cardinality ratio at or above which a column is classified as High.
`low_cardinality_abs_max`	`int`	`50`	Absolute distinct-count ceiling — any column with ≤ this many distinct values is always classified as Low, regardless of the ratio.
`cardinality_sample_fraction`	`float`	`1.0`	Fraction of the table to sample when the Spark fallback path is used. Ignored when T-SQL passthrough succeeds (which is the default). Must be in `(0, 1.0]`.

How classification works¶

if approx_distinct <= low_cardinality_abs_max:
    → "Low"
elif ratio < low_cardinality_upper:
    → "Low"
elif ratio >= high_cardinality_lower:
    → "High"
else:
    → "Medium"

Where ratio = approx_distinct / total_rows.

Scoring Weights¶

The composite score is the sum of four weighted factors. The weights must sum to 100 — the config validates this at runtime and raises ValueError if they don't.

Parameter	Type	Default	Description
`score_weight_table_size`	`int`	`30`	Maximum points for the table-size factor.
`score_weight_predicate_freq`	`int`	`30`	Maximum points for predicate frequency.
`score_weight_cardinality`	`int`	`25`	Maximum points for column cardinality.
`score_weight_data_type`	`int`	`15`	Maximum points for data-type support.

See Scoring for the detailed formulas.

Recommendation Parameters¶

Parameter	Type	Default	Description
`max_clustering_columns`	`int`	`3`	Warn when a table already has more clustered columns than this. Does not limit CTAS output.
`min_recommendation_score`	`int`	`40`	Minimum composite score to surface a recommendation. Columns below this are labelled "Not recommended".
`generate_ctas`	`bool`	`False`	When `True`, generate one `CREATE TABLE ... AS SELECT` DDL statement per recommended column. Set this to include ready-to-run DDL in the report.

Parallelism¶

Parameter	Type	Default	Description
`max_parallel_tables`	`int`	`4`	Maximum number of tables to estimate cardinality for in parallel during Phase 6. Each table gets a single batched `APPROX_COUNT_DISTINCT` query covering all its candidate columns. Higher values reduce wall-clock time but increase concurrent SQL sessions on the warehouse. Set to `1` to disable parallelism.

Scope Filtering¶

Parameter	Type	Default	Description
`schema_names`	`list[str]`	`[]`	Restrict analysis to specific schemas. Empty = all user schemas.
`table_names`	`list[str]`	`[]`	Restrict analysis to specific tables. Each entry can be `"table_name"` (matches any schema) or `"schema.table_name"` (exact match). Empty list = all tables.

Examples:

# Only analyse tables in the 'sales' schema
config = DataClusteringConfig(
    warehouse_name="MyWarehouse",
    schema_names=["sales"],
)

# Analyse only these two tables
config = DataClusteringConfig(
    warehouse_name="MyWarehouse",
    table_names=["dbo.Orders", "FactSales"],
)

# Analyse only tables in the 'sales' schema (by listing them explicitly)
config = DataClusteringConfig(
    warehouse_name="MyWarehouse",
    table_names=["sales.FactOrders", "sales.FactReturns", "sales.FactLineItems"],
)

The filter applies to all phases — metadata collection, row counting, query pattern matching, cardinality estimation, and scoring.

Output Parameters¶

Parameter	Type	Default	Description
`verbose`	`bool`	`False`	When `True`, prints structured debug output for each phase including intermediate DataFrames, row counts, and predicate breakdowns. Useful for understanding what the advisor is doing.

Throttle Protection¶

Parameter	Type	Default	Description
`phase_delay`	`float`	`1.0`	Seconds to pause between phases to reduce HTTP 429 throttling from the Fabric control-plane API. Set to `0` to disable the delay.

Validation¶

The config is validated automatically when advisor.run() is called. The following checks are performed:

warehouse_name must be set to a non-empty value (not the placeholder "<your_warehouse_name>")
cardinality_sample_fraction must be in (0, 1.0]
Score weights must sum to exactly 100

If any check fails, a ValueError is raised with a descriptive message.