Best practices of data modelling in Power BI

When building reports we should not only look if our numbers are correct or if we formatted our visuals correctly so they facilitate taking decisions but we should also take a good look if our report is developed with technical best practices. It has great influence on visuals loading speed, overall speed of report, and also dataset size which can end up as volume limitation based on your licensing. As an example, Power BI Pro can have dataset size of 1GB, uncompressed size of table 10 GB and data refresh time should be in 2 hours. What if we have to load data containing tens or hundreds of millions records?

Why data modelling is so important?

We can say that data modelling is working on foundations. The better it is done, easier to manage changes and also eases work on next steps. Very well thought data model will allow us to build our dataset easy and quick but also will shorten time required to introduce new analysts that will need to take report over. Outside conceptual work there are few steps that we should take when building our data model.

Limit number of columns in tables.

The most important and simplest question we should ask ourselves in the beginning. Are we really going to use all the columns that are in the source data? If they will become useful we can always add them. I know that many end users want to have all the columns ‘just in case they need them’ but when working with huge datasets when we are not sure if data model is too heavy we should resign from getting them into the model with accordance to the rule – columns not used in report should not be in model.

Define properly data types in columns.

When cleaning the data remember to properly define data types. Whenever possible use numbers and IDs. Other example would be storing SalesDate column formatted as date instead of text. Also think do you really need 10 decimal places when you are measuring sales in USD.

Separate facts from dimensions.

When separating facts from dimensions we are doing work that has a goal to build so called ‘dictionaries’ and implementing Star-Schema data model used in Power BI. Star schema means using relationships between fact and dimension tables. As an example take sales table as your fact table and product table as your dimension. In fact table you are going to need only product id which will have its reflection in dimension table of products. Products table will have all descriptive columns such as category, subcategory, descriptio, size, package etc. These information is unique for product and unnecessary to be repeated in fact table as it artificially blows up size of dataset. Of course it is easier to have table of 10 thousands products with their description instead having this description repeated among 20 million records. Except using already done dimension tables try to create your own if cannot identify them in database. Let’s take sales channel as an example – delivery or physical. Instead of saying 20 million times ‘delivery’ it will be easier to put ‘1’ for delivery and ‘0’ if physical and creating separate table that will have translation to these two ids. Just as products you can use customer tables. In fact tables we should not keep any contact / address data.

Sorting input data.

However it may sound weird, there is no particular rule – you should try sorting input data when loading to Power BI. Many times I saw that comparing different sort orders on columns I managed to reduce dataset size. It is good to take smaller sample for testing and check sorting impact on size of your .pbix file on data load.

Separate decimals from whole numbers into two columns with whole numbers.

Power BI treats whole numbers as the lightest data type. This is why we should use them if we want to make our dataset size lighter. What is interesting if we split amount column with decimals into two whole number columns like ‘SalesUSD’ and ‘SalesCents’ it will reduce your dataset size. It will not be a problem to merge them later on in our report because for aggregation you will use SUMX function instead SUM:
SUMX(

FactSales,

[Sales_USD] + ([Sales_Cents]/100)

).

In the end our dataset will become lighter but our calculations will remain fast.

Move your data sources to Power BI data flows.

If we plan our tables before then instead of loading them directly from data source (eg. database) we should move them to data flows in Power BI service as this can additionally reduce our dataset size. If we do that we need to bear in mind that refreshing must happen first on stage of data flow and next on dataset instead of only on dataset.

Turn off automatic time/intelligence format in Power BI.

When loading data Power BI Desktop by default creates in background data tables for all tables containing date columns. This is why you can automatically use date hierarchies. It is good practice to turn this feature off because it can highly influence dataset size (sometimes even 20%) and instead use calendar table that will be related to all fact tables.

Avoid creating calculated columns in DAX.

Calculated columns in DAX can be very heavy, especially if these are applied to fact tables. Always try to create them either in data source or Power Query. That way you will move the calculations to different engines (either source or query).

Summary

Using above techniques prove to reduce dataset size even by 90% and in my case loading fact table with approx. 200 millions of records did not weight more than 400 MB (fact table was filled only with 3 ids and around 15 different numbers for sales amount / quantity/ rebate amount). I highly recommend doing these tricks because ultimately they influence not only dataset size but also speed of doing calculations and facilitates change management.

Interesting article? Share further!

Leave A Comment