Pages

Wednesday, May 2, 2018

Loading OECD Data With Python

One of the projects I have been juggling is building a package to import data from the OECD. The OECD helpfully provides an API which allows for custom queries into their large data sets: https://data.oecd.org/api/ (for free!). Unless you really like XML or JSON (which I do not), you want to find a wrapper for the query and download protocols. As a Python developer, the solution that worked best for me was the pandaSDMX Python library:  https://pandasdmx.readthedocs.io/en/latest/

The package is highly object oriented, which means that I probably should have read the documentation. However, I was able to grab the data I wanted with a bit of experimentation.

This was my initial stab of grabbing the LEI data and then dumping it into a CSV (based on code from a question on Stack Exchange). (The 'df' variable is a Pandas dataframe.)

oecd = pandasdmx.Request('OECD')
data_response = oecd.data(resource_id='MEI_CLI', key='all?startTime=2018')
df = data_response.write(data_response.data.series, parse_time=False)
df.to_csv('c:\\Temp\\test_lei.txt', sep='\t')


The OECD data provides a couple of challenges to work with.

The first issue is that the OECD does not provide some of the meta data navigation tools provided by the other public databases that pandaSDXM connects to. The data navigation calls done in the pandaSDMX documentation will not work on the OECD data (which is another reason I largely skipped over the documentation).

The next challenge with the OECD data is finding out where the data you want live. The leading indicators are found with the "Main Economic Indicators" (MEI) database, but that database is big, and the communication protocol used (SDMX-ML) is bloated. In order to keep the data transmission reasonable, you need to use a smaller subset of the database, which has its own database identifier (such as 'MEI_CLI'). Of course, trying to find that database identifier takes a bit of work on the web page: you need to find the database in question, and then extract the database code.

For example, the Composite Leading Indicators page is: https://www.oecd-ilibrary.org/economics/data/main-economic-indicators/composite-leading-indicators_data-00042-en

If you then go to "Data", you get sent to: http://stats.oecd.org/viewhtml.aspx?datasetcode=MEI_CLI&lang=en, where you can read off the 'MEI_CLI' code (or get the query for data by choosing the appropriate API export option). This is somewhat painful to do if you want data scattered across a variety of data sets.

I have not tested it yet, but the OECD API documentation shows an extremely useful option: you can query for data that has been changed since a particular date. For financial data that is not revised, one can just query for data after the last data points you have locally archived. However, this is not sufficient for economic data, as back history can be revised. According to the documentation, those revised data points will also be returned. This greatly reduces the burden for both sides for an automated update of data sets.

The major challenge of such work is not the raw query, rather marshaling the data so that it can be imported into a database. You need to map the series meta data to a whatever the retrieval mechanism used in your database (typically some internal ticker system). The reason why you want to use a single package for interfacing with all the data sources is that it easier to set up the mapping code, which will (hopefully!) work for all the supported data providers.

One reason I put this up was that finding this package took some time. Before I found pandaSDMX, I looked at a few other options, and all of them had defects. Since I did not want to waste my time seeing whether there were fixes for the problems I ran into, it may be that my decision to reject them was unfair. As a result, I will not name any names with regards to the other options.

(c) Brian Romanchuk 2018

20 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. Hey Brian,

    just a short comment and a question. First, thanks for this post, you saved my day. Second; I am using pandaSDMX for CLI data of specific countries but I am not sure hot to get the last known data... well I managed to get the data, I only have problems with extracting it into new dataframe.

    Any suggestions?

    Best regards, David

    ReplyDelete
    Replies
    1. Hello, I’m not sure I follow. If the issue is working with the pandas data objects, I’m not the person to ask. I still have not read the documentation; I just figured out how to extract the data I need, but I am mystified about their data structure format. I found that stack exchange had the recipes I needed for particular tasks, so I left it at that.

      Delete
  4. Hi
    Do you know if there is any way to make it work with standard pandas pandas_datareader library?

    ReplyDelete
    Replies
    1. I never tried. I think pandaSDMX is a pretty thin wrapper on top of pandas, so it’s just getting the query formatted for you. I got the queries to work, so I never dug deeper.

      Delete
  5. This comment has been removed by a blog administrator.

    ReplyDelete
  6. This comment has been removed by a blog administrator.

    ReplyDelete
  7. This comment has been removed by a blog administrator.

    ReplyDelete
  8. This comment has been removed by a blog administrator.

    ReplyDelete
  9. This comment has been removed by a blog administrator.

    ReplyDelete
  10. This comment has been removed by a blog administrator.

    ReplyDelete
  11. This comment has been removed by a blog administrator.

    ReplyDelete
  12. This comment has been removed by a blog administrator.

    ReplyDelete
  13. This comment has been removed by a blog administrator.

    ReplyDelete
  14. This comment has been removed by a blog administrator.

    ReplyDelete
  15. This comment has been removed by a blog administrator.

    ReplyDelete
  16. This comment has been removed by a blog administrator.

    ReplyDelete
  17. This comment has been removed by a blog administrator.

    ReplyDelete
  18. This comment has been removed by a blog administrator.

    ReplyDelete

Note: Posts are manually moderated, with a varying delay. Some disappear.

The comment section here is largely dead. My Substack or Twitter are better places to have a conversation.

Given that this is largely a backup way to reach me, I am going to reject posts that annoy me. Please post lengthy essays elsewhere.