Improve performance with DirectQuery by using Inner Joins

If you have ever looked at the queries generated by DirectQuery you will see that a query that retrieves data from multiple tables gets joined by a left outer join. This might not be what you would expect but this is a actually a solid design choice. The reason is that if your data would have any keys that would be missing due to referential integrity issues on either side you would never notice and data would get dropped. When using the SSAS storage engine instead of DirectQuery data with referential integrity will always be shown, an empty row in the dimension will automatically be added to the result set. Translating this to SQL a left outer join will make sure missing values will also be shown. Unfortunately a left outer join is usually not as performant as an inner join.

Luckily we give you an option to make a conscious decision and allow you to use inner joins instead. This makes a huge performance impact but again you have to make sure that you are ok with data being dropped if you have referential integrity problems. This usually is not a problem when you are using an actual curated data warehouse, it may be an issue when you are joining a less curated data source.

Let’s look at an example.  I created model in DirectQuery mode with a few tables into SSAS running on my laptop. I assume that on the average actual server the numbers you will see in this blog post will be blown out of the water Smile so don’t take them as a performance benchmark.  The FactOnlineSales table contains 2 billion rows to make it interesting:

image

Let’s create a PivotTable in Excel that joins 3 tables: DiscountAmount from FactOnlinesSales with RegionCountryname from DimGeography through the Dimcustomer table

image

The data is retrieved by using the following SQL Query:

SELECT
TOP (1000001) [t1].[RegionCountryName],SUM([t2].[DiscountAmount])
AS [a0]
FROM
(
(( SELECT [dbo].[FactOnlineSalesVBig].* FROM [dbo].[FactOnlineSalesVBig] ) AS [t2]

left outer join

( SELECT [dbo].[DimCustomer].* FROM [dbo].[DimCustomer] ) AS [t3] on
(
[t2].[CustomerKey] = [t3].[CustomerKey]
)
)
left outer join

( SELECT [dbo].[DimGeography].* FROM [dbo].[DimGeography] ) AS [t1] on
(
[t3].[GeographyKey] = [t1].[GeographyKey]
)
)

GROUP BY [t1].[RegionCountryName]

 

The duration is 4244 ms to execute the query and retrieve the results on my laptop:

image

And remember the fact table contains 2 billion rows.

Now lets switch to Inner joins. At the moment of writing this there is no UI in SSDT to do this yet but that will be added soon. We need to go the script file to change it (right mouse click on model.bim and view code)

and add a property on every relationship we want to use Inner join. The property is called “relyOnReferentialIntegrity” and needs to be set to true. The default is false.

This looks like this:

image

Now after saving this and reopening the designer (this reapplies the changes of my script to my workspace database). I then refresh my PivotTable in Excel.

Now we see the query sends inner joins instead:

SELECT
TOP (1000001) [t1].[RegionCountryName],SUM([t2].[DiscountAmount])
AS [a0]
FROM
(
(( SELECT [dbo].[FactOnlineSalesVBig].* FROM [dbo].[FactOnlineSalesVBig] ) AS [t2]

inner join

( SELECT [dbo].[DimCustomer].* FROM [dbo].[DimCustomer] ) AS [t3] on
(
[t2].[CustomerKey] = [t3].[CustomerKey]
)
)

inner join

( SELECT [dbo].[DimGeography].* FROM [dbo].[DimGeography] ) AS [t1] on
(
[t3].[GeographyKey] = [t1].[GeographyKey]
)
)

GROUP BY [t1].[RegionCountryName]

We immediately see a huge performance gain, we went from 4244 to 1144 ms.

image

Almost 4 times as fast…

Of course your millage might vary but setting this property should be done as often as possible, when your data permits it.

 

By the way this same feature is also available in Power BI Desktop:

image

  • Brett Powell

    This is great, thanks Kasper! I remember when it was added in the March update to PBI Desktop and have observed the performance benefit of inner join SQL queries (from PBI models) but haven’t tested with SSAS Tabular yet, simply because of the added scripting you describe. Quick question: Can you advise how or where you obtained a 2 billion row fact table? (I’d like to test design scenarios with different sizes of models – 50M, 500M, 2B, etc) Thanks, Brett

    • Simon Nuss

      Ditto. I remember reading a white paper from Marco a couple years ago where he demo’d ~4 billion. It would be nice to get a script/data source for this!

  • Steven Neumersky

    He could have inserted sample data himself. Here are some sample enlargement scripts from Hugo Kornelius:

    https://www.simple-talk.com/sql/t-sql-programming/generating-test-data-in-tsql/

  • What I did is just generate more data. I used the MecDataMart from the old BI VM and wrote a few sql scripts to duplicate fact data. This works pretty good.

    • Brett Powell

      Thanks Kasper. I’m going to use the free trial from Redgate for generating test data: http://tinyurl.com/huygpjv but may well go with SQL scripts as opposed to the $370/user license. I need to test with larger dimensions (e.g. 1-2M+ row customer and product tables) and want the dataset to be as realistic as possible for real world large client scenarios.