Power BI/Python/JSON

More power in Power BI with Python — Part II

Using Python scripts in Power BI queries, processing JSON with Python, querying data from the web

Henrik Massow

--

Why using Python in queries?

Power BI and Power Query provide a variety of query options “out of the box”.

a picture is worth a thousand words — query options in Power BI

So why use a Python script when Power BI was invented for queries of all kinds?

Stupid answer: Because we can!

💥 Stop! 💥

I’ve read this answer too often lately to find it funny. In any case, if we want to leave the path provided by the program and turn towards individual programming, we should have a more convincing answer, right?

In Power BI there are some scenarios in which the use of Python as a query can be useful.

  • There may already be a (complex) script that queries, sorts, groups, merges or converts data from certain sources. In this case, it makes sense using this script directly in Power BI instead of reinventing and recreate it with M and/or DAX in Power BI.
  • Python can help us with queries where Power BI or Power Query doesn’t really shine!

And yes, there are queries where the last point applies. Power BI struggles when

  • it is about data from the network,
  • it is about hierarchical data e.g. in JSON format…

So, let’s start with Python, JSON and Power BI!

The Epilog — Fund data from the web

For various reasons I needed the German tax data for the iShares EURO STOXX 50 ex-Financials UCITS ETF fund.

Disclaimer: I cannot and will not comment on the performance of the fund or the iShares company. The fund only serves as a hook for the possibilities of Power BI and Python.

This data for each iShares fund can be accessed via a link on the fund provider’s website.

The data then appear in a modal window. It does not matter here at all to understand the meaning of this data. It is simply a matter of percentages that the fund must calculate and publish on each trading day.

Data from web — Power BI fails … (I)

Anyone who has the idea of using the Power BI functionality “data from web” will quickly be disappointed.

Power BI finds all possible (layout) tables of the website, but nothing that has anything to do with our data.

The cause of this failure can be found easily with the help of the Chrome browser tools. See recording below.

It is easy to see that the page downloads the tax data asynchronously and only when the link is clicked. Only then a table with the fetched data is rendered in the modal window.

So Power BI has hardly a chance to find our data.

JSON data — Power BI fails … (II)

Now that we are already examining the source code of the page, we can continue and verify our theory about the failure of Power BI above.

Here we can see that the page has downloaded our tax data in JSON format. So, theory confirmed and new challenge accepted.

If the data is available in JSON format in our browser, damn it, then it must be possible to get this data into our Power BI.

And yes, a solution is right in front of us. We have the url here.

This gives us at least two options. Either we simply download the JSON file and then load it into Power BI …

… or we let Power BI retrieve the data for us. As we can see below, this also works!

The result aaData:List is exactly what we saw in the browser console: an object key called “aaData” and an array/list as value …

So how is the JSON file structured?

{"aaData":[
["0,0000", "0,4564", "17,3647", "0,45646", {
"display": "29.05.2017", "raw": 20171229}
],
...
...
...
]}

Bad design, I know, but we have to take what we can get.

Our target format is roughly the following.

0,0000   0,4564    17,3647    0,45646    29.05.2017
...
...
...

And this is where it quickly becomes cumbersome with Power BI …

First of all, we can convert our result into a table.

What does not help much …

… but allows us to drill down …

… but this also only shows us a list of lists that we can convert into a table …

By clicking on “Extract values …” we unfortunately run directly into …

…an error.

The problem is the date object in our array / list.

So back to the list and try another way …

The result is not an error this time, but each value is now in separate row (in one column) and our date object is kept as a record … We could hardly be further away from our target structure …

In fact, there are many ways to solve the problem here. Either we write one or more functions in M in the Power Query Editor, with which we split our list per line. Or we have to gradually create new columns in Power Query and split the original list in this way.

We could extract the date as follows:

We would then extract the rest of the list in the following way.

We could concatenate this new list into a string …

… in order to then split this character string to columns.

Boahhh! 💣 💣

💥Stop! 💥

Yes, with Power BI we have come very close to our goal, but there is still a lot to do. Column 1 has to be removed, the columns need headings, the numbers will have to be converted into real numbers not strings, the string representation of the date into a date…

… but that’s all a total mess, isn’t it?

From my point of view, we are clearly at a point where Python could be the better choice.

Query with Python

Let’s assume for the moment that we already have the JSON file available by manually downloading it.

In this case we can just click “Get Data” and select “Python script”.

Now a window appears and we can enter our code.

Here the code as text.

Yeah: a few lines of code! And we’re done!

In line 1 we import the Pandas library and in line 3 we define the path to our JSON file.

The work is ultimately done in line 5. Here our JSON file is read: thats what pdread_json(path) does. Further we immediately grab the data from the “column” with the name “aaData” and split it via “list comprehension”. Furthermore wie also grab the right part of our date object.

Finally, in line 7, we define the headers.

👏 Effectively 3–4 lines of code, and the result?

Everything is wonderfully organized, with headings, just beautiful, right? Just some lines of Python code. 👏

Excursus: Occasional problems

Well, somtimes problems can arise in several places here.

What sometimes leads to strange error messages are errors while reading the JSON file with pd.json_read(). If such errors occur, the following points should be checked:

  • Is the file path correct?
  • Is the JSON in that file syntactically valid? Check it via jsonlint.com.
  • If you still get strange error messages, it is possible that the file did not “arrive” correctly during the download. The only thing that helps in such situation is to copy the content of the file and paste it into a new, empty JSON file.

It can also happen that Power BI does not recognize our numerical values as numbers but strings, or recognizes them incorrectly. This often depends on the regional settings. Power BI does this step correctly for us in most cases, but non-English dates occasionally need a little “nudge”.

In the Power Query Editor, one of the ways to help ourselves in such cases is shown below.

Another mistake can be made with dates. Here we can see that dates in March were unfortunately not recognized as date.

Sometimes you can get such problems with foreign special letters (here the “ä”), which is because Python works with Unicode. What we need here is UTF-8. Fortunately, this can be cured quite easily in our Python code.

We can fix such problems by passing an additional parameter (encoding='utf8') to our read_json function.

And sometimes the data itself can cause trouble. Power BI is unable to work with our March here.

The cause of this problem has nothing to do with Python, but is due to the original data. While March is usually abbreviated as “Mrz” in Germany and Excel, Powerquery and other tools would recognize “Mrz” as March, iShares obviously thought “Mär” would also be a suitable abbreviation. 😝

At this point, too, I would not make the correction in Python, but simply insert a step in Power BI and replace “Mär” with “Mrz”.

I love it when a plan comes together.
-Movie quote for connoisseurs

Queries from the web directly with Python

Of course it’s also possible to download the data directly from the network using Python. So far I had assumed that the JSON file had already been downloaded.

There are several ways to fetch data from the web with Python, You could use the urllib module(s). However, I decided to use the Requests library, which we have to install at this point.

We are replacing our previous code as follows.

And below is the Version to copy and paste the code…

That’s all!

The rest stays the same.

Load more funds and make it less hacky

So far we could only load one fund, from which we had to get the URL beforehand in a very special way (Browser Dev Tools).

Of course we don’t want to do this for each fund and we don’t want to write a script for each fund.
The good thing is that the URL is always the same for the funds and we can read all further information directly from the browser URL bar.

https://www.ishares.com/de/privatanleger/de/produkte/251861/ishares-msci-europe-ucits-etf-acc-fund/

The part of the URL in italics is identical for all funds. Only the bold part at the back is different for each fund. We can simply copy this part for the funds that we need (Ctrl + C; Ctrl + V) and create a list with tuples from the fund name and the individual part of the URL.

Liste = [
("Fondsname 1", "/xxxxx/fonds-bezeichnung"),
("Fondsname 2", "/xxxxx/fonds-bezeichnung"),
...
]

In this way, we can dynamically generate the respective URL for all funds in the list, fetch tha data and finally combine them. I also added a column with the fund name.
With the following query we receive our tax data for three funds in this way.

Perfect! Loaded over 3,500 lines from the web with a script just a few lines long.

Potential for improvement: Of course I don’t want to hide the fact that there is room for improvement in our script as well as in the design of the table. We could first read the list of funds and URL parts from a separate list. This would enable us to add further funds to the list in the future without having to adjust our script.

We should also think about error handling. What if the data cannot be accessed or an error occurs …

We could also consider normalizing our table of values …

But what I think has become clear: for what and how we can use Python as a query.

--

--