What is the Platform?
The platform is a fairly thin wrapper on top of existing libraries. The final package will be built around a MySQL database, which will allow for analysis tools from a number of languages (for me, Python and R) to interact with the same data. The environment is designed for high analyst productivity.
I am unsure whether there is a comparable open source package. (Commercially, there are some options, but they tend to be expensive.) The amount of code I need should be relatively small; my objective is to use as much as possible of existing packages, but just with a friendly front end.
Since I will be using the package for my own analysis, I will put some development time into it. However, I will be doing the absolute minimum until the recession book is done. One possibility is that I will write a book on the development and maintenance of such systems (or alternatively, launch a Patreon for the project).
At the minimum, other people may be able to use my code as examples, and also if they wish to replicate any of my work. It may be that others want to extend the package to make it a more serious option. Although that is possible, I think it would need another experienced developer to refactor the code to make it more stable across platforms. (I will only attempt to get it to work on Windows.)
DB.nomics Example
I have two examples for DB.nomics; one that uses the dbnomics module to load a series, and the other uses the first pass of the platform module.
- The file that uses the dbnomics API directly is https://github.com/brianr747/platform/blob/master/examples/ex20190501_dbnmomics.py.
- The file that uses my platform code is https://github.com/brianr747/platform/blob/master/examples/ex20190501_first_fetch.py
The dbnomics Python module is easy to use; you just need to find the series code on the DB.nomics website, and call dbnomics.fetch_series(). (You can also use more advanced queries to load tables, or selected subsets of tables.) Probably most of the time is spent searching the website for the series.
However, the actual data returned from fetch_series needs to be manipulated.
- The data come as a pandas DataFrame, which can be thought of as a spreadsheet. It contains a lot of series meta data, which is nice, but not needed most of the time. I have not worked with pandas, but my assumption is that it will be easiest to work with the pandas Series object.
- Queries that return more than one series will have them all inhabiting the same DataFrame. Although it might be possible to work with the DataFrame, in my experience, it will be easier to work with distinct time series objects.
- The fetch_series function returns the string 'NA' instead of the numpy NaN (not a number), which needs to be cleansed from the data.
The usual way to handle this is to wrap the call to fetch_series into a front end function that does the clean up.
Although this is the path of least resistance, your code base gets cluttered with a variety of front end functions as you add data sources -- including your own proprietary data.
What my platform aims to do unify all the front ends into a single interface. This allows for greater analyst efficiency.
Platform Example
The code above demonstrates the target workflow. The "myplatform" package is imported, and then logging enabled with the start_log() function. A time series is loaded by its identifier; it is the DB.nomics identifier string, with a 'D@' prepended. To fetch a series from FRED (when I add FRED support), it would be the same fetch function, but with an 'F@' in front of the FRED ticker.
As can be seen, there is a minimal amount of code. And the idea is to bury some sophisticated logic behind the fetch call.
In the code, we can see that I fetch the same series 10 times in a loop. This is obviously pointless here, but in practice we will often re-run the same program repeatedly, which means that we are
going to load the same series multiple times.
In order to not bombard DB.nomics with repeated queries, we save the downloaded data. Eventually, it will go onto a MySQL database, but for now, the series are being saved as individual tab-delimited text files. The target behaviour will work as follows.
Log File Contents (with some formatting fixes):
2019-05-01 19:41:36 INFO Starting
2019-05-01 19:41:36 INFO Fetching D@Eurostat/namq_10_gdp/Q.CP_MEUR.SCA.B1GQ.EL
2019-05-01 19:41:36 DEBUG Starting new HTTPS connection (1): api.db.nomics.world:443
2019-05-01 19:41:37 DEBUG https://api.db.nomics.world:443
"GET /v22/series?observations=1&series_ids=Eurostat/namq_10_gdp/Q.CP_MEUR.SCA.B1GQ.EL&offset=0 HTTP/1.1" 200 None
2019-05-01 19:41:37 INFO Writing to {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO Load loop
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 DEBUG update_title_pos
2019-05-01 19:41:38 DEBUG update_title_pos
2019-05-01 19:42:12 DEBUG update_title_pos
We see only a single query to DB.nomics; the rest are queries to the text file.As can be seen, there is a minimal amount of code. And the idea is to bury some sophisticated logic behind the fetch call.
In the code, we can see that I fetch the same series 10 times in a loop. This is obviously pointless here, but in practice we will often re-run the same program repeatedly, which means that we are
going to load the same series multiple times.
In order to not bombard DB.nomics with repeated queries, we save the downloaded data. Eventually, it will go onto a MySQL database, but for now, the series are being saved as individual tab-delimited text files. The target behaviour will work as follows.
- The user calls the fetch function.
- If the series does not exist on the database, the platform goes to the provider (DB.nomics, in this case), and downloads and creates the series on the database.
- If the series already exists and is considered to be up-to-date, the series on the database is returned. If it is not up-to-date, the series will be first refreshed by querying the provider before returning to the user.
(At present, there are no checks for freshness of data; if the series exists in the "database" (a directory of text files, the text file is always used.)
We can see this behaviour in the log (found in a logs sub-directory).
Log File Contents (with some formatting fixes):
2019-05-01 19:41:36 INFO Starting
2019-05-01 19:41:36 INFO Fetching D@Eurostat/namq_10_gdp/Q.CP_MEUR.SCA.B1GQ.EL
2019-05-01 19:41:36 DEBUG Starting new HTTPS connection (1): api.db.nomics.world:443
2019-05-01 19:41:37 DEBUG https://api.db.nomics.world:443
"GET /v22/series?observations=1&series_ids=Eurostat/namq_10_gdp/Q.CP_MEUR.SCA.B1GQ.EL&offset=0 HTTP/1.1" 200 None
2019-05-01 19:41:37 INFO Writing to {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO Load loop
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 INFO loading from {directory}\D_Eurostat_namq_10_gdp_Q_CP_MEUR_SCA_B1GQ_EL.txt
2019-05-01 19:41:37 DEBUG update_title_pos
2019-05-01 19:41:38 DEBUG update_title_pos
2019-05-01 19:42:12 DEBUG update_title_pos
(In case anyone wondered what the series was, it is quarterly Greek nominal GDP.)
Concluding Remarks
At this point, the platform code does not do a lot, which is not surprising since it is only a few hours of development. It will probably not get interesting until I add some more providers. Given the scope of the coverage of DB.nomics, it is unclear how pressing that need is.(c) Brian Romanchuk 2019
This comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDelete