Skip to content

Configuration Reference

All parameters are fields of the DataClusteringConfig dataclass. Create an instance, override the defaults you need, and pass it to DataClusteringAdvisor.

from fabric_warehouse_advisor import DataClusteringAdvisor, DataClusteringConfig

config = DataClusteringConfig(
    warehouse_name="MyWarehouse",
    min_row_count=500_000,
    verbose=True,
)

advisor = DataClusteringAdvisor(spark, config)
result = advisor.run()

Connection Parameters

Parameter Type Default Description
warehouse_name str "" Required. The Fabric Warehouse name used in the Spark connector's three-part naming.
workspace_id str "" Workspace GUID. Only needed for cross-workspace access.
warehouse_id str "" Warehouse item GUID. Only needed for cross-workspace access.

Threshold Parameters

Parameter Type Default Description
min_row_count int 1_000_000 Minimum rows for a table to be analysed. Tables below this are skipped entirely.
large_table_rows int 50_000_000 Row-count threshold for the maximum table-size score. Tables at 10× this value get full points.
min_predicate_hits int 2 Minimum number of times a column must appear in WHERE predicates to be considered a candidate.
min_query_runs int 2 Minimum number of executions for a query to be treated as "frequently run" in Query Insights.

Cardinality Classification

Parameter Type Default Description
low_cardinality_upper float 0.001 Cardinality ratio below which a column is classified as Low.
high_cardinality_lower float 0.05 Cardinality ratio at or above which a column is classified as High.
low_cardinality_abs_max int 50 Absolute distinct-count ceiling — any column with ≤ this many distinct values is always classified as Low, regardless of the ratio.
cardinality_sample_fraction float 1.0 Fraction of the table to sample when the Spark fallback path is used. Ignored when T-SQL passthrough succeeds (which is the default). Must be in (0, 1.0].

How classification works

if approx_distinct <= low_cardinality_abs_max:
    → "Low"
elif ratio < low_cardinality_upper:
    → "Low"
elif ratio >= high_cardinality_lower:
    → "High"
else:
    → "Medium"

Where ratio = approx_distinct / total_rows.

Scoring Weights

The composite score is the sum of four weighted factors. The weights must sum to 100 — the config validates this at runtime and raises ValueError if they don't.

Parameter Type Default Description
score_weight_table_size int 30 Maximum points for the table-size factor.
score_weight_predicate_freq int 30 Maximum points for predicate frequency.
score_weight_cardinality int 25 Maximum points for column cardinality.
score_weight_data_type int 15 Maximum points for data-type support.

See Scoring for the detailed formulas.

Recommendation Parameters

Parameter Type Default Description
max_clustering_columns int 3 Warn when a table already has more clustered columns than this. Does not limit CTAS output.
min_recommendation_score int 40 Minimum composite score to surface a recommendation. Columns below this are labelled "Not recommended".
generate_ctas bool False When True, generate one CREATE TABLE ... AS SELECT DDL statement per recommended column. Set this to include ready-to-run DDL in the report.

Filtering

Parameter Type Default Description
table_names list[str] [] Restrict analysis to specific tables. Each entry can be "table_name" (matches any schema) or "schema.table_name" (exact match). Empty list = all tables.

Examples:

# Analyse only these two tables
config = DataClusteringConfig(
    warehouse_name="MyWarehouse",
    table_names=["dbo.Orders", "FactSales"],
)

# Analyse only tables in the 'sales' schema (by listing them explicitly)
config = DataClusteringConfig(
    warehouse_name="MyWarehouse",
    table_names=["sales.FactOrders", "sales.FactReturns", "sales.FactLineItems"],
)

The filter applies to all phases — metadata collection, row counting, query pattern matching, cardinality estimation, and scoring.

Output Parameters

Parameter Type Default Description
verbose bool False When True, prints structured debug output for each phase including intermediate DataFrames, row counts, and predicate breakdowns. Useful for understanding what the advisor is doing.

Validation

The config is validated automatically when advisor.run() is called. The following checks are performed:

  • warehouse_name must be set to a non-empty value (not the placeholder "<your_warehouse_name>")
  • cardinality_sample_fraction must be in (0, 1.0]
  • Score weights must sum to exactly 100

If any check fails, a ValueError is raised with a descriptive message.