R in Python

The good and the bad in Python

Like most other finance researchers who are not professional programmers, my use cases for Python are limited to the following three tasks (with the most commonly used Python packages for each task in the bracket):

  1. Clean/manipulate data (numpy/pandas)
  2. Do statistical analysis (statsmodels/scipy)
  3. Make graphs (matplotlib/seaborn)

Thanks to pandas and matplotlib, task (1) and (3) are a breeze. Both packages are well-documented and versatile. The real pain is for task (2). The go-to package in Python for regressions is statsmodels, which to be honest, is not great. While being sufficient for simple OLS, it is quite disappointing when it comes to slightly more advanced models. For example, recently I have run into strange convergence issues when trying to estimate a structural VAR model.

Use R in Python: the rpy2 package

You might say, well, let’s simply outsource task (2) to some other languages/packages specialized in statistics like Stata or R. But coding in two different languages/packages is cumbersome given data format differences which require tedious input/output operations between the two. Is it possible to stay in Python and, at the same time, benefit from better implemented statistical functions in, say, R?

Yes! rpy2 comes to rescue. It is a Python package which allows you to call R functions right in your Python script. Even better, it allows R functions to accept Python objects such as pandas DataFrame and numpy array on the fly so that you don’t need to manually convert between Python objects and R objects. All sounds too good to be true!

To give you a taste, below are some essential steps I have gone through in order to use the vars package in R to estimate a structural VAR model:

  • First, we need to install the needed R packages and it can be done directly through rpy2 in Python. Take the vars package (and its other dependencies) as an example below:1
    import rpy2.robjects.packages as rpackages
    from rpy2.robjects.vectors import StrVector
    utils = rpackages.importr('utils')
    packnames = ('nlme', 'lattice', 'zoo', 'MASS', 'strucchange', 'urca', 'lmtest', 'vars', 'sandwich')
    names_to_install = [x for x in packnames if not rpackages.isinstalled(x)]
    if len(names_to_install) > 0:
  • Then we need to activate the automatic conversion of pandas DataFrame object to R objects with the following codes:
    from rpy2.robjects import pandas2ri
  • Now we are basically all set! As for Python packages, we can simply use the same dot operation to call functions from the R package we have imported. Note that data is a pandas DataFrame and Amat & Bmat are numpy arrays but we don’t need to do anything about them. The returned results are in the format of rpy2’s ListVector which can be easily converted to pandas DataFrame or other Python objects by ourselves.
    from rpy2.robjects.packages import importr
    vars = importr('vars')
    res_var = vars.VAR(data, p=5, type='const')
    res_svar = vars.SVAR(x=res_var, estmethod='direct', Amat=Amat, Bmat=Bmat)

As you can see, it is pretty much a hassle-free process. Only getting the right set-up takes a bit time (But I have saved you from it!). Can we finally start to seriously consider ditching statsmodels in Python?

  1. Just one caveat though. We need to run the Python codes in Terminal, not in the IDE Console. ↩︎