Usually I start my blog posts with an interesting question I got, this time we will do the reverse. I want to talk about a question that isn’t asked often enough: “Should I use DirectQuery?”.
DirectQuery is a connectivity method in Power BI and SSAS where the modeler defines the model but doesn’t import any data. As no data is stored in the model any data needed for visualizations will be retrieved from the data source directly. Power BI will translate its internal queries into queries that get send to the underlying data sources when populating the data for visuals. This seems a great idea because it solves a lot of “problems” the BI industry has struggled with for years, it offers a couple of major benefits:
- No need to move any data, data stays at the source where it is created offering real time updates whenever changes are made.
- Ability to leverage the security as it is defined in the data source when using Single Sign On (SSO).
- You will not hit any memory limits in Power BI or SSAS so you can see all the raw data (all transactions)
Thus DirectQuery is often, at first glance, considered the holy grail for any BI project. Unfortunately some projects that get started with DirectQuery have to go back to the drawing board because they fail to recognize that many things normal in BI projects are now missing:
- The data source is a a transactional system and thus not a data warehouse, it doesn’t use a star schema. Creating a decent model then becomes a challenge, resulting in additional views that introduces more manageability and query complexity (which could lead to perf issues).
- The data source is build for OLTP scenario’s, meaning it is fast for the daily workload but now it also has to support analytical query patterns (e.g. dimensional joins). This might not perform as expected, especially when compared to data that is imported into Power BI.
- The data source might not be as clean as expected (it never is, is it?) resulting in data quality issues.
- The data source might be a MPP \ Big Data solution where high concurrency might be an issue (the reports are consumed by hundreds of users at the peak time). Remember a Power BI report consist of many visuals that all need to be updated resulting in multiple queries.
- The security as set up in the transactional data source might not be the same security as you expect for reporting (e.g. you cannot see individual transactions from your sales peers but you can see the grand total)
There are probably more pitfalls but these are the most common I have seen.
Now these can be tackled with enough investment, for example:
- Optimizing your SQL Server data source with in memory indexes designed for the query load the reporting will put on the system. You will need to know what you are doing here and leverage a experienced DBA.
- Instead of views you create new analytical tables to shape the data and add specific performance features. But remember you are moving the data, now just inside the same system.
- Use clusters specialized for interactive query performance (like serverless pools). (you are moving the data again, now just inside the same system)
- Leverage a fast data solution like HANA, SQL DW, Netezza, Spark, etc. In each case you will have to consider and test the load together with the amount of concurrency to determine if it is good enough for your scenario. For example when using SPARK concurrency is an issue, when using HANA and want to connect to BW you will have to use the BW OLAP BAPI’s where you get hit with the OLAP BAPI performance penalty.
This goes to show that using DirectQuery should not be an automatic for your project, it needs to be a well thought out consideration up front. You can make it work with the right planning. A great help, that you should read before any Direct Query project, is this very detailed whitepaper on DirectQuery.
One important aspect to take into consideration at the start of the project is the ROI for the work that you need to do to to make Direct Query work to the levels the business users accept it. Especially considering how “cheap” the alternative of loading the data into Power BI or SSAS is. The Power BI or SSAS data model is build to support analytical query patterns at the speed of thought. Ofcourse it does mean you have to make some trade offs:
- You will have to load data into Power BI or SSAS. This means the data is not “real time, but with smart partioning in SSAS you can keep your data up to date very frequently.
- By loading data you are limited to how much data you can load into the model, Power BI supports 1 GB files at the time of writing (soon to be increased) where SSAS can grow to multiple terabytes if you have enough memory. All data loaded into the model is compressed so you can expect to load between 3 to 10 times more raw data. Maybe for some projects you will not have enough memory, this will be a good discussion with your end users, they will want to see all the data but is this really necessary for your reporting? Alternatively a new interesting feature is cooking that will allow you to combine in-memory data with Direct Query to solve this problem (more on this later when it releases, it will change some of this blog post).
- You will have to replicate the security model, this might be a pain but it often turns out the data source security settings are much more complicated then needed then just for the reporting.
- You need a BI expert, but changes are good you already know him since the same Power BI desktop is used for your reports.
So I hope the take away from this blog post is to make sure you challenge your business users \ customers on their DirectQuery requirement and exercise due diligence at the start of the project to investigate all angels and look at the ROI. This blog post is based upon part of a presentation I regularly do that you can see here: https://www.youtube.com/watch?v=dDkPnZkqot0 (57 min).
Let me know what you think in the comments below, I would love to hear your experience.