Pages

Wednesday, May 1, 2019

Adding DBNomics To A Python Research Platform

I bought a new laptop, and I am now migrating my platform to the new computer. Rather than copy as-is, my plan is to clean up the code base. As a result, I have created a new GitHub repository that will contain the package: https://github.com/brianr747/platform. My initial plan is to do the international charts for my recessions book using the new system, and I will probably get the bulk of the data from DB.nomics (I discussed DB.nomics earlier here). This article explains what this platform aims to do, and what it took to integrate the DB.nomics interface. (Since I spent the day programming, this article is brief.)

What is the Platform?

The platform is a fairly thin wrapper on top of existing libraries. The final package will be built around a MySQL database, which will allow for analysis tools from a number of languages (for me, Python and R) to interact with the same data. The environment is designed for high analyst productivity.

I am unsure whether there is a comparable open source package. (Commercially, there are some options, but they tend to be expensive.) The amount of code I need should be relatively small; my objective is to use as much as possible of existing packages, but just with a friendly front end.

Since I will be using the package for my own analysis, I will put some development time into it. However, I will be doing the absolute minimum until the recession book is done. One possibility is that I will write a book on the development and maintenance of such systems (or alternatively, launch a Patreon for the project).

At the minimum, other people may be able to use my code as examples, and also if they wish to replicate any of my work. It may be that others want to extend the package to make it a more serious option. Although that is possible, I think it would need another experienced developer to refactor the code to make it more stable across platforms. (I will only attempt to get it to work on Windows.)

DB.nomics Example

I have two examples for DB.nomics; one that uses the dbnomics module to load a series, and the other uses the first pass of the platform module. 
The dbnomics Python module is easy to use; you just need to find the series code on the DB.nomics website, and call dbnomics.fetch_series(). (You can also use more advanced queries to load tables, or selected subsets of tables.) Probably most of the time is spent searching the website for the series.

However, the actual data returned from fetch_series needs to be manipulated.
  • The data come as a pandas DataFrame, which can be thought of as a spreadsheet. It contains a lot of series meta data, which is nice, but not needed most of the time. I have not worked with pandas, but my assumption is that it will be easiest to work with the pandas Series object.
  • Queries that return more than one series will have them all inhabiting the same DataFrame. Although it might be possible to work with the DataFrame,  in my experience, it will be easier to work with distinct time series objects.
  • The fetch_series function returns the string 'NA' instead of the numpy NaN (not a number), which needs to be cleansed from the data.
The usual way to handle this is to wrap the call to fetch_series into a front end function that does the clean up. 

Although this is the path of least resistance, your code base gets cluttered with a variety of front end functions as you add data sources -- including your own proprietary data.

What my platform aims to do unify all the front ends into a single interface. This allows for greater analyst efficiency.

Platform Example




The code above demonstrates the target workflow. The "myplatform" package is imported, and then logging enabled with the start_log() function. A time series is loaded by its identifier; it is the DB.nomics identifier string, with a 'D@' prepended. To fetch a series from FRED (when I add FRED support), it would be the same fetch function, but with an 'F@' in front of the FRED ticker.

As can be seen, there is a minimal amount of code. And the idea is to bury some sophisticated logic behind the fetch call.

In the code, we can see that I fetch the same series 10 times in a loop. This is obviously pointless here, but in practice we will often re-run the same program repeatedly, which means that we are
going to load the same series multiple times.

In order to not bombard DB.nomics with repeated queries, we save the downloaded data. Eventually, it will go onto a MySQL database, but for now, the series are being saved as individual tab-delimited text files. The target behaviour will work as follows.
  1. The user calls the fetch function. 
  2. If the series does not exist on the database, the platform goes to the provider (DB.nomics, in this case), and downloads and creates the series on the database.
  3. If the series already exists and is considered to be up-to-date, the series on the database is returned. If it is not up-to-date, the series will be first refreshed by querying the provider before returning to the user.
(At present, there are no checks for freshness of data; if the series exists in the "database" (a directory of text files, the text file is always used.)

We can see this behaviour in the log (found in a logs sub-directory).

Log File Contents (with some formatting fixes):

2019-05-01 19:41:36 INFO   Starting
2019-05-01 19:41:36    INFO   Fetching D@Eurostat/namq_10_gdp/Q.CP_MEUR.SCA.B1GQ.EL
2019-05-01 19:41:36    DEBUG  Starting new HTTPS connection (1): api.db.nomics.world:443
2019-05-01 19:41:37    DEBUG  https://api.db.nomics.world:443 
"GET /v22/series?observations=1&series_ids=Eurostat/namq_10_gdp/Q.CP_MEUR.SCA.B1GQ.EL&offset=0 HTTP/1.1" 200 None
2019-05-01 19:41:37    INFO   Writing to {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37    INFO   Load loop
2019-05-01 19:41:37    INFO   loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37    INFO   loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37    INFO   loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37    INFO   loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37    INFO   loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37    INFO   loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37    INFO   loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37    INFO   loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37    INFO   loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37    INFO   loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37    DEBUG  update_title_pos
2019-05-01 19:41:38    DEBUG  update_title_pos
2019-05-01 19:42:12    DEBUG  update_title_pos

We see only a single query to DB.nomics; the rest are queries to the text file.

(In case anyone wondered what the series was, it is quarterly Greek nominal GDP.)

Concluding Remarks

At this point, the platform code does not do a lot, which is not surprising since it is only a few hours of development. It will probably not get interesting until I add some more providers. Given the scope of the coverage of DB.nomics, it is unclear how pressing that need is.

(c) Brian Romanchuk 2019

2 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete

Note: Posts are manually moderated, with a varying delay. Some disappear.

The comment section here is largely dead. My Substack or Twitter are better places to have a conversation.

Given that this is largely a backup way to reach me, I am going to reject posts that annoy me. Please post lengthy essays elsewhere.