Improve performance with DirectQuery by using Inner Joins

By | September 6, 2016

If you have ever looked at the queries generated by DirectQuery you will see that a query that retrieves data from multiple tables gets joined by a left outer join. This might not be what you would expect but this is a actually a solid design choice. The reason is that if your data would have any keys that would be missing due to referential integrity issues on either side you would never notice and data would get dropped. When using the SSAS storage engine instead of DirectQuery data with referential integrity will always be shown, an empty row in the dimension will automatically be added to the result set. Translating this to SQL a left outer join will make sure missing values will also be shown. Unfortunately a left outer join is usually not as performant as an inner join.

Luckily we give you an option to make a conscious decision and allow you to use inner joins instead. This makes a huge performance impact but again you have to make sure that you are ok with data being dropped if you have referential integrity problems. This usually is not a problem when you are using an actual curated data warehouse, it may be an issue when you are joining a less curated data source.

Let’s look at an example.  I created model in DirectQuery mode with a few tables into SSAS running on my laptop. I assume that on the average actual server the numbers you will see in this blog post will be blown out of the water Smile so don’t take them as a performance benchmark.  The FactOnlineSales table contains 2 billion rows to make it interesting:

image

Let’s create a PivotTable in Excel that joins 3 tables: DiscountAmount from FactOnlinesSales with RegionCountryname from DimGeography through the Dimcustomer table

image

The data is retrieved by using the following SQL Query:

SELECT
TOP (1000001) [t1].[RegionCountryName],SUM([t2].[DiscountAmount])
AS [a0]
FROM
(
(( SELECT [dbo].[FactOnlineSalesVBig].* FROM [dbo].[FactOnlineSalesVBig] ) AS [t2]

left outer join

( SELECT [dbo].[DimCustomer].* FROM [dbo].[DimCustomer] ) AS [t3] on
(
[t2].[CustomerKey] = [t3].[CustomerKey]
)
)
left outer join

( SELECT [dbo].[DimGeography].* FROM [dbo].[DimGeography] ) AS [t1] on
(
[t3].[GeographyKey] = [t1].[GeographyKey]
)
)

GROUP BY [t1].[RegionCountryName]

 

The duration is 4244 ms to execute the query and retrieve the results on my laptop:

image

And remember the fact table contains 2 billion rows.

Now lets switch to Inner joins. At the moment of writing this there is no UI in SSDT to do this yet but that will be added soon. We need to go the script file to change it (right mouse click on model.bim and view code)

and add a property on every relationship we want to use Inner join. The property is called “relyOnReferentialIntegrity” and needs to be set to true. The default is false.

This looks like this:

image

Now after saving this and reopening the designer (this reapplies the changes of my script to my workspace database). I then refresh my PivotTable in Excel.

Now we see the query sends inner joins instead:

SELECT
TOP (1000001) [t1].[RegionCountryName],SUM([t2].[DiscountAmount])
AS [a0]
FROM
(
(( SELECT [dbo].[FactOnlineSalesVBig].* FROM [dbo].[FactOnlineSalesVBig] ) AS [t2]

inner join

( SELECT [dbo].[DimCustomer].* FROM [dbo].[DimCustomer] ) AS [t3] on
(
[t2].[CustomerKey] = [t3].[CustomerKey]
)
)

inner join

( SELECT [dbo].[DimGeography].* FROM [dbo].[DimGeography] ) AS [t1] on
(
[t3].[GeographyKey] = [t1].[GeographyKey]
)
)

GROUP BY [t1].[RegionCountryName]

We immediately see a huge performance gain, we went from 4244 to 1144 ms.

image

Almost 4 times as fast…

Of course your millage might vary but setting this property should be done as often as possible, when your data permits it.

 

By the way this same feature is also available in Power BI Desktop:

image