dbbinsreg: summarize big data quickly
Author: James Brand
Date: 2026-01-06
Grant McDermott and I just
made some new additions to the R package dbreg,
which I’ve written about once already here.
Things are getting pretty exciting in that package – once you have a
(reasonably) robust regression tool, there is a lot you can layer on top
of that. Our latest additions now try to mimic a lot of the
functionality of the popular binsreg package. For
those who aren’t familiar, the motivating idea of binsreg
is simply that we often want to visualize the high-level relationship
between fields of the data by binning the data and creating a scatter
plots and/or smooth lines showing various conditional expectations.
Without diving into the details, it turns out that if you want to make
that plot while conditioning on other covariates, including fixed
effects, you have to do so carefully.
binsreg is the cutting edge implementation of this type
of analysis, based on this
paper. The idea is to bin the data and regress one variable on a
flexible piecewise polynomial in the other, along with whatever controls
you want to condition on. Their package gives users options to control
the flexibility of that polynomial and how continuous/smooth it is
between bins. This produces plots like this, which shows conditional
means and CIs for the relationship between trip distance and fare amount
in the NYC taxi dataset (a large dataset with many millions of taxi
rides, which we use in our tests and examples). The CIs are
imperceptibly small for short trips, but on the right end you can see
the uncertainty grow as there are far fewer long trips.

Two exciting things about our implementation, quickly.
Accuracy
First, we can match the binsreg package pretty well. The
graph below shows a lot of the simplest specifications in binsreg on
some toy data; the fit and line numbers in the
graph titles are the differences between binsreg and
dbbinsreg estimates, and the numbers c(p,s)
are what control the degree (p) and smoothness
(s) of the approximating polynomials. So, for example, the
top right graph (c(3,2)) is a cubic polynomial with
continuous first and second derivatives at the bin edges. 
Note: So far, I think the only gaps between our packages’ results here come from minor differences between algorithms for calculating quantiles (we’re a little more limited due to writing the logic in SQL).
Speed
Second, our speed is pretty good too! On small datasets with 4,000
rows, we are faster than binsreg here (though don’t take this too
seriously, this test is far from extensive). We have a few different
algorithms under the hood, which causes some but not all of that
variation.
On slightly
(100x) larger data, things still look good, and times are only about
double the times on the small dataset
Admittedly, our times are a
lot more variable (across speciications) than binsreg, but we stay in
the same ballpark most of the time and are sometimes faster. Finally, on
the NYC taxi dataset (a large dataset used in our tests and docs,
subsetted to January here), we can see that further scaling continues to
look totally reasonable. About 15 seconds to handle 15M rows!

That all said, this package is very much in its early days and there are likely bugs still hiding around corners we haven’t tested yet! We welcome any feedback/issues on the GitHub repo.