Skip to content

Configuration Reference

All parameters are fields of the DataClusteringConfig dataclass. Create an instance, override the defaults you need, and pass it to DataClusteringAdvisor.

from fabric_warehouse_advisor import DataClusteringAdvisor, DataClusteringConfig

config = DataClusteringConfig(
    warehouse_name="MyWarehouse",
    min_row_count=500_000,
    verbose=True,
)

advisor = DataClusteringAdvisor(spark, config)
result = advisor.run()

Connection Parameters

Parameter Type Default Description
warehouse_name str "" Required. The Fabric Warehouse name used in the Spark connector's three-part naming.
workspace_id str "" Workspace GUID. Only needed for cross-workspace access.
warehouse_id str "" Warehouse item GUID. Only needed for cross-workspace access.

Threshold Parameters

Parameter Type Default Description
min_row_count int 1_000_000 Minimum rows for a table to be analysed. Tables below this are skipped entirely.
large_table_rows int 50_000_000 Row-count threshold for the maximum table-size score. Tables at 10× this value get full points.
min_predicate_hits int 2 Minimum number of times a column must appear in WHERE predicates to be considered a candidate.
min_query_runs int 2 Minimum number of executions for a query to be treated as "frequently run" in Query Insights.

Cardinality Classification

Parameter Type Default Description
low_cardinality_upper float 0.001 Cardinality ratio below which a column is classified as Low.
high_cardinality_lower float 0.05 Cardinality ratio at or above which a column is classified as High.
low_cardinality_abs_max int 50 Absolute distinct-count ceiling — any column with ≤ this many distinct values is always classified as Low, regardless of the ratio.
cardinality_sample_fraction float 1.0 Fraction of the table to sample when the Spark fallback path is used. Ignored when T-SQL passthrough succeeds (which is the default). Must be in (0, 1.0].

How classification works

if approx_distinct <= low_cardinality_abs_max:
    → "Low"
elif ratio < low_cardinality_upper:
    → "Low"
elif ratio >= high_cardinality_lower:
    → "High"
else:
    → "Medium"

Where ratio = approx_distinct / total_rows.

Scoring Weights

The composite score is the sum of four weighted factors. The weights must sum to 100 — the config validates this at runtime and raises ValueError if they don't.

Parameter Type Default Description
score_weight_table_size int 30 Maximum points for the table-size factor.
score_weight_predicate_freq int 30 Maximum points for predicate frequency.
score_weight_cardinality int 25 Maximum points for column cardinality.
score_weight_data_type int 15 Maximum points for data-type support.

See Scoring for the detailed formulas.

Recommendation Parameters

Parameter Type Default Description
max_clustering_columns int 3 Warn when a table already has more clustered columns than this. Does not limit CTAS output.
min_recommendation_score int 40 Minimum composite score to surface a recommendation. Columns below this are labelled "Not recommended".
generate_ctas bool False When True, generate one CREATE TABLE ... AS SELECT DDL statement per recommended column. Set this to include ready-to-run DDL in the report.

Parallelism

Parameter Type Default Description
max_parallel_tables int 4 Maximum number of tables to estimate cardinality for in parallel during Phase 6. Each table gets a single batched APPROX_COUNT_DISTINCT query covering all its candidate columns. Higher values reduce wall-clock time but increase concurrent SQL sessions on the warehouse. Set to 1 to disable parallelism.

Scope Filtering

Parameter Type Default Description
schema_names list[str] [] Restrict analysis to specific schemas. Empty = all user schemas.
table_names list[str] [] Restrict analysis to specific tables. Each entry can be "table_name" (matches any schema) or "schema.table_name" (exact match). Empty list = all tables.

Examples:

# Only analyse tables in the 'sales' schema
config = DataClusteringConfig(
    warehouse_name="MyWarehouse",
    schema_names=["sales"],
)

# Analyse only these two tables
config = DataClusteringConfig(
    warehouse_name="MyWarehouse",
    table_names=["dbo.Orders", "FactSales"],
)

# Analyse only tables in the 'sales' schema (by listing them explicitly)
config = DataClusteringConfig(
    warehouse_name="MyWarehouse",
    table_names=["sales.FactOrders", "sales.FactReturns", "sales.FactLineItems"],
)

The filter applies to all phases — metadata collection, row counting, query pattern matching, cardinality estimation, and scoring.

Output Parameters

Parameter Type Default Description
verbose bool False When True, prints structured debug output for each phase including intermediate DataFrames, row counts, and predicate breakdowns. Useful for understanding what the advisor is doing.

Throttle Protection

Parameter Type Default Description
phase_delay float 1.0 Seconds to pause between phases to reduce HTTP 429 throttling from the Fabric control-plane API. Set to 0 to disable the delay.

Validation

The config is validated automatically when advisor.run() is called. The following checks are performed:

  • warehouse_name must be set to a non-empty value (not the placeholder "<your_warehouse_name>")
  • cardinality_sample_fraction must be in (0, 1.0]
  • Score weights must sum to exactly 100

If any check fails, a ValueError is raised with a descriptive message.