Skip to content

Data Clustering Advisor

The Data Clustering Advisor helps you answer a deceptively simple question: which columns in my Fabric Warehouse should I cluster?

Data Clustering is one of the most impactful performance levers in Microsoft Fabric Warehouse — it controls how data is physically organized on OneLake, which directly affects query speed and resource consumption. But choosing the right columns to cluster on isn't always obvious. That's where this advisor comes in.

What it does

The advisor analyses your actual query patterns (via Query Insights), combines them with table metadata and column cardinality estimates, and scores every candidate column from 0 to 100. You get a clear report telling you exactly what to cluster and why.

Quick Start

from fabric_warehouse_advisor import DataClusteringAdvisor, DataClusteringConfig

config = DataClusteringConfig(
    warehouse_name="MyWarehouse",
)

advisor = DataClusteringAdvisor(spark, config)
result = advisor.run()

# To experience all features and interactive capabilities, save the report and open it in a web browser
result.save("/lakehouse/default/Files/reports/report.html")
# Rich HTML report
displayHTML(result.html_report)

Warning

Execution time and CU consumption of the Data Clustering Advisor vary based on data volume, and column count. For optimal performance and minimal impact, we recommend running the Advisor during low-usage periods and outside peak concurrency windows.

Working with Results

Web Browser is recommended

The best way to visualize the report is to save it as HTML, which provides the full experience with rich features and interactivity.

Exploring Scores

# Spark DataFrame with per-column scores
display(result.scores_df)

Saving Reports

result.save("/lakehouse/default/Files/reports/dataclustering_report.html")
result.save("/lakehouse/default/Files/reports/dataclustering_report.md", "md")
result.save("/lakehouse/default/Files/reports/dataclustering_report.txt", "txt")

Persisting data to Delta table

result.scores_df.write.mode("overwrite").format("delta").saveAsTable(
    "yourschema.data_clustering_advisor_scores"
)

Documentation

Document Description
How It Works Detailed analysis of each phase in the pipeline execution lifecycle
Configuration Full parameter reference with defaults
Scoring Scoring formula, cardinality penalties, worked examples
Reports HTML, Text and Markdown report formats
Data Type Reference Supported types and limitations