Usually I start my blog posts with an interesting question I got, this time we will do the reverse. I want to talk about a question that isn’t asked often enough: “Should I use DirectQuery?”.
DirectQuery is a connectivity method in Power BI and SSAS where the modeler defines the model but doesn’t import any data. As no data is stored in the model any data needed for visualizations will be retrieved from the data source directly. Power BI will translate its internal queries into queries that get send to the underlying data sources when populating the data for visuals. This seems a great idea because it solves a lot of “problems” the BI industry has struggled with for years, it offers a couple of major benefits:
- No need to move any data, data stays at the source where it is created offering real time updates whenever changes are made.
- Ability to leverage the security as it is defined in the data source when using Single Sign On (SSO).
- You will not hit any memory limits in Power BI or SSAS so you can see all the raw data (all transactions)
Thus DirectQuery is often, at first glance, considered the holy grail for any BI project. Unfortunately some projects that get started with DirectQuery have to go back to the drawing board because they fail to recognize that many things normal in BI projects are now missing:
- The data source is a a transactional system and thus not a data warehouse, it doesn’t use a star schema. Creating a decent model then becomes a challenge, resulting in additional views that introduces more manageability and query complexity (which could lead to perf issues).
- The data source is build for OLTP scenario’s, meaning it is fast for the daily workload but now it also has to support analytical query patterns (e.g. dimensional joins). This might not perform as expected, especially when compared to data that is imported into Power BI.
- The data source might not be as clean as expected (it never is, is it?) resulting in data quality issues.
- The data source might be a MPP \ Big Data solution where high concurrency might be an issue (the reports are consumed by hundreds of users at the peak time). Remember a Power BI report consist of many visuals that all need to be updated resulting in multiple queries.
- The security as set up in the transactional data source might not be the same security as you expect for reporting (e.g. you cannot see individual transactions from your sales peers but you can see the grand total)
There are probably more pitfalls but these are the most common I have seen.
Now these can be tackled with enough investment, for example:
- Optimizing your SQL Server data source with in memory indexes designed for the query load the reporting will put on the system. You will need to know what you are doing here and leverage a experienced DBA.
- Instead of views you create new analytical tables to shape the data and add specific performance features. But remember you are moving the data, now just inside the same system.
- Use clusters specialized for interactive query performance (like serverless pools). (you are moving the data again, now just inside the same system)
- Leverage a fast data solution like HANA, SQL DW, Netezza, Spark, etc. In each case you will have to consider and test the load together with the amount of concurrency to determine if it is good enough for your scenario. For example when using SPARK concurrency is an issue, when using HANA and want to connect to BW you will have to use the BW OLAP BAPI’s where you get hit with the OLAP BAPI performance penalty.
This goes to show that using DirectQuery should not be an automatic for your project, it needs to be a well thought out consideration up front. You can make it work with the right planning. A great help, that you should read before any Direct Query project, is this very detailed whitepaper on DirectQuery.
One important aspect to take into consideration at the start of the project is the ROI for the work that you need to do to to make Direct Query work to the levels the business users accept it. Especially considering how “cheap” the alternative of loading the data into Power BI or SSAS is. The Power BI or SSAS data model is build to support analytical query patterns at the speed of thought. Ofcourse it does mean you have to make some trade offs:
- You will have to load data into Power BI or SSAS. This means the data is not “real time, but with smart partioning in SSAS you can keep your data up to date very frequently.
- By loading data you are limited to how much data you can load into the model, Power BI supports 1 GB files at the time of writing (soon to be increased) where SSAS can grow to multiple terabytes if you have enough memory. All data loaded into the model is compressed so you can expect to load between 3 to 10 times more raw data. Maybe for some projects you will not have enough memory, this will be a good discussion with your end users, they will want to see all the data but is this really necessary for your reporting? Alternatively a new interesting feature is cooking that will allow you to combine in-memory data with Direct Query to solve this problem (more on this later when it releases, it will change some of this blog post).
- You will have to replicate the security model, this might be a pain but it often turns out the data source security settings are much more complicated then needed then just for the reporting.
- You need a BI expert, but changes are good you already know him since the same Power BI desktop is used for your reports.
So I hope the take away from this blog post is to make sure you challenge your business users \ customers on their DirectQuery requirement and exercise due diligence at the start of the project to investigate all angels and look at the ROI. This blog post is based upon part of a presentation I regularly do that you can see here: https://www.youtube.com/watch?v=dDkPnZkqot0 (57 min).
Let me know what you think in the comments below, I would love to hear your experience.
Great post, totally concur on all the assessments here. Concurrency in particular is a major issue; would love to see some caching mechanisms built into DW to remediate things like choosing a slicer and having 12 visuals update in PowerBI.
Did you happen to see that SQL Pass keynote link that I linked above: https://blog.crossjoin.co.uk/2017/11/01/pass-summit-2017-day-1-bi-news/ using this you will be able to do exactly what you said directly in Power BI.
Hi Kasper,
Interesting assesment. How about levering all DAX statements? I thought that in DirectQuery mode not all DAX formulas can be used as they can not be translated to SQL statements. Can you confirm that?
That is a good point, it is one of the items I do mention in the youtube video and again this will be an interesting discussion when designing the project. Most DAX formula’s can be translated but the question is rather is this smart? For example you can add a calc column but that means it will be translated to SQL on every query. Do you want that? Or you add some business logic like a YoY% calculations using DAX, if you think what data is needed to get the results it needs to retrieve data from 2 years and aggregate it. Maybe you want to pre calculate it? Questions, Questions 🙂
A second issue that I find interesting is the use of Live connection towards SSAS. This seems to be the way to go as loading the data (again) in Power BI seems to be redundant. However, like Direct Query, Live Connection to SSAS is said to be used with care.
Yes Live connect is using the same engine as Power BI so you should only have to import once.
Thanks for the informative blog Kasper.
One thing I wonder about DirectQuery is whether there may be some “bugs” or maybe more accurately inefficiencies here and there that one is going to run into.
An example of what I’m talking about is something I’m seeing, where I have one or more visuals seemingly outputting a massive amount of redundancy into the SQL expression, in my case it is related to date filters and is outputting every possible value into a huuuuuge IN() statement. Rather than simply saying IS NULL or YEAR=2017 (according to the filter you have specified), it seems like instead it is making a list of all the distinct values.
More details here:
http://community.powerbi.com/t5/Desktop/DirectQuery-bug-all-possible-Date-filter-values-are-being-output/m-p/320455
I forwarded your link to some of our developers and it looks like they followed up with you on the thread.
Great post Kasper. I’ve recently been testing SSAS Tabular 2017 with Live Connection to Power BI where Power BI is located in Australia and the Source data is located elsewhere. And it is incredibly fast. I would say 1 sec at most waiting times.
I’m really enjoying SSAS 2017 because I’m able to leverage all my Power BI experience.
Great to hear!
Great post Kasper. In several projects I’ve seen that the team interested in DirectQuery underestimates the complexity and performance impact of the logic that has been built into the SQL views or partitions supporting the import model. In many cases, this layer implements transformations and joins on top of multiple views and this of course doesn’t translate well to report query performance with DirectQuery.
Two thoughts:
1) I’m a bit surprised you mention Azure SQL DW as a source even though you recommend testing load and concurrency. The SQL CAT team listed DirectQuery as an anti-pattern given a lack of plan caching and the data movement needed to satisfy some queries. (http://bit.ly/2zGIS8i) With 32 max concurrent queries I assume Azure SQL DW via DirectQuery would be a very rare implementation.
2) I would think the roadmap for SSAS would be a consideration too. For example, with DirectQuery a strategic priority, a project could go ahead and invest in DirectQuery with the expectation that performance will improve with future versions of SSAS. (Of course, the current state would have to be ‘acceptable’).
Happy New Year
I have seen SQL DW be successful with DirectQuery but only when used by a few business users at the time. And yes SSAS and Power BI get constant DirecdtQuery updates but you might have seen the 1 Trillion row demo that makes use of Cache and DQ as a mixed scenario. That will be very interesting.
Good article however I feel it just skips the main point. Direct Query is Live data – it updates through the day. Import only has a maximum of 8 uploads of data per 24 hours. For any sort of business that needs real time information it has to be Direct Query.
Hi Eliot, I believe that is why I mentioned in the article that if you really need real time you can do Direct Query and pay the price (the investment). But honestly often this requirement is not so strict as soon as you tell them/show them difference in cost and “experience”.
Besides when using SSAS you can refresh as often as you want using partitions and Power BI premium supports allows you to refresh 48 times per day.
Actually, I hate to use direct query. First, it is more difficult to develop since we can not see the data while using the direct query mode. Second, the performance of a report using direct query mode is out of control. It depends all on the backend server. Using the import model, we can utilize the state-of-art column based in-memory database engine. Are you kidding me, in 2017, you still use the db technology which was developed 30 years ago?. Last, do we need refresh datasets more than 8 times a day? Based on my experience, 80% of my projects based on daily counts.