I get asked this question a lot: “for my company xyz we want to set up an enterprise gateway, what machines or hardware should we buy?”. I always reply with the famous consulting answer “it depends”, unfortunately there is no silver bullet here but there is some information I can provide to help make a informed decision.
But before we go and look at some recommendations let’s take a look at the different types of load the gateway can have.
- Cached data: Any data that gets imported into Power BI from an on premises source needs to be moved to the Power BI cloud service, this happens in a few (rough) steps:
- Data needs to be retrieved from the data source. The gateway needs to connect to the data source and collect all data it needs to populate the model. A good network connection to the data source is important. A lot of hops between the two machines might slow down fetching data.
- All data that gets imported into Power BI goes through the Power Query engine, any transformations defined during development of the model also needs be applied to the data at each refresh. Using Power Query gives you incredible flexibility and power, but here is where some of the wildcards come into play, if you use Power Query functions that can be delegated to the data source it won’t use much resources at the gateway, for example when using SQL Server as data source many operations can be translated into a SQL query and SQL will do most the heavy lifting, but when using Power Query to import data from a JSON file all operations need to happen in the Power Query engine, this needs much more CPU and memory.
- Now when the data is in the right shape it gets transferred from the gateway to Power BI. Again another good network connection between the gateway and the Power BI services is required. This also means of course a great internet connection.
- Live connect or Direct Query: Here only some data gets moved to Power BI, Power BI sends a query to the data source to request the data it needs to render the particular visual. The gateway is mostly in pass along mode where Power BI sends the query to the gateway who in turn sends it to the data source itself and then returns just the results to Power BI. There is some CPU needed for the routing but the most important thing here is a good connection to the Power BI service and the data source. As queries are “real time” you want this the connection to be as fast as possible, so reducing any additional hops in the network might be beneficial.
So this is the basis, but here is where “it depends” comes into play. As you can see there are many variables already in play but things will get a bit more complicated when you start factoring in more users. Usually many users will hit the gateway all at once with a combination of all of the above. So to determine the load you need to determine:
- How many data models are refreshing concurrently?
- What types of data is used for to refresh it? Straightforward connection to SQL or complicated parsing of JSON files using Power Query. This determines the CPU of the machine. The more cores the better. It could also potentially affect memory utilization as it could be that data needs to be read in memory to be parsed.
- How much data is moved? A 1 GB model or just a few MB’s. The higher bandwidth the better but also the CPU of the gateway machine. Also important to recognize is the amount of disk space needed, the gateway buffers the data on disk before submitting it back to Power BI, so if you are refreshing 10 100MB workbooks at the same time you need 1000MB of free disk space.
- How many users are hitting the data source directly through either Direct Query or Live connect
- Does the model (either in Power BI or in SSAS) use RLS? If so a query needs to be send for every user to the data source to make sure every user sees the right data. More network load on the gateway and on the underlying data source.
As you can see there are many variables that determine the load of the gateway machine. My personal recommendation is to start out with a nice sized machine (8 Cores, 8 GB and multiple GB network cards) and then start measuring using the build in performance counters as is described in the documentation. This allows you to measure the load of the machine and get a better understanding of what is going on, it is probably impossible to know what your users will do up front. In my personal experience the setup mentioned earlier will take you a long way even for the largest enterprise but again you mileage might vary. You will only know when you try it, measure and react to it.
Now there are some other things you can do early on if you want to build in a bit more “safety” to make sure your enterprise report doesn’t get pushed aside by a noisy neighbor. For example you could decide to set up a dedicated Gateway (or multiple) for your corporate wide reports, that will support the solutions build by your BI team and used by your CFO, you know what type of model this is and how many users will connect. Then you can have a separate (or multiple) gateway(s) that can be used by your business users to refresh their ad-hoc reports. This way you can isolate the “known” enterprise reports from the more agile data sources, but again without measurements you will keep guessing.
Hope this helps make a better decision on how to set up your Enterprise gateway.