Vote for quantile plots! New planks in an old campaign Nicholas J. Cox Department of Geography 1 Quantile plots Quantile plots show ordered values (raw data, estimates, residuals, whatever) against rank or cumulative probability or a one-to-one function of the same. Tied values are assigned distinct ranks or probabilities. 2 10 Quantiles of Mileage (mpg)

20 30 40 Example with auto dataset 0 .25 .5 .75 Fraction of the data 1 3 quantile default In this default from the official command quantile,

ordered values are plotted on the y axis and the fraction of the data (cumulative probability) on the x axis. Quantiles (order statistics) are plotted against plotting position (i 0.5)/n for rank i and sample size n. Syntax was sysuse auto, clear quantile mpg, aspect(1) 4 Quantile plots have a long history Adolphe Quetelet Ronald Fisher 17961874 18901962 Sir Francis Galton 18221911 G. Udny Yule

Sir 18711951 all used quantile plots avant la lettre. In geomorphology, hypsometric curves for showing altitude 5 Quantile plots named as such Martin B. Wilk Gnanadesikan 19222013 2015 Ramanathan 1932 Wilk, M. B. and Gnanadesikan, R. 1968.

Probability plotting methods for the analysis of data. Biometrika 55: 117. 6 A relatively long history in Stata Stata/Graphics User's Guide (August 1985) included do-files quantile.do and qqplot.do. Graph.Kit (February 1986) included commands quantile, qqplot and qnorm. Thanks to Pat Branton of StataCorp for this history. 7 Related plots use the same information Cumulative distribution plots show cumulative probability on the y axis. Survival function plots show the complementary probability.

Clearly, axes can be exchanged or reflected. distplot (Stata Journal ) supports both. Many people will already know about sts graph. 8 So, why any fuss? The presentation is built on a long-considered view that quantile plots are the best single plot for univariate distributions. No other kind of plot shows so many features so well across a range of sample sizes with so few arbitrary decisions. Example: Histograms require binning choices. Example: Density plots require kernel choices. Example: Box plots often leave out too much. 9 Whats in a name? QQ-plots Talk of quantile-quantile (Q-Q or QQ-) plots is also common. As discussed here, all quantile plots are also QQplots.

The default quantile plot is just a plot of values against the quantiles of a standard uniform or rectangular distribution. 10 User-written commands 11 NJC commands The main commands I have introduced in this territory are quantil2 (Stata Technical Bulletin) qplot (Stata Journal) stripplot (SSC) Others will be mentioned later. 12 quantil2 This command published in Stata Technical Bulletin

51: 1618 (1999) generalized quantile: One or more variables may be plotted. Sort order may be reversed. by() option is supported. Plotting position is generalised to (i a) /(n 2a + 1): compare a = 0.5 or (i 0.5)/n wired into quantile. 13 A now bizarre detail The truncated name quantil2 was enforced by the 8.3 filename.ext restriction of MS-DOS so that no Stata command defined by an .ado could have a name longer than 8 characters. 14 qplot The command quantil2 was renamed qplot and further revised in Stata Journal 5: 442460 and 471 (2005), with later updates: over() option is also supported.

Ranks may be plotted as well as plotting positions. The x axis scale may be transformed on the fly. recast() to other twoway types is supported. 15 stripplot The command stripplot on SSC started under Stata 6 as onewayplot in 1999 as an alternative to graph, oneway and has morphed into (roughly) a superset of the official command dotplot. It is mentioned here because of its general support for quantile plots as one style and its specific support for quantile-box plots, on which more shortly. 16 Standard uses 17

Comparing two groups is basic juxtaposed 0 .2 .4 .6 fraction of the data Domestic Foreign .8 1 10 10

20 Mileage (mpg) 30 quantiles of Mileage (mpg) 20 30 40 40 superimposed Domestic Car type Foreign

18 Syntax was stripplot mpg, over(foreign) cumulative centre vertical aspect(1) 10 10 20 M ile a g e (m p g ) 30 quantiles of Mileage (mpg) 20 30

40 40 qplot mpg, over(foreign) aspect(1) 0 .2 .4 .6 fraction of the data Domestic .8 1

Domestic Car type Foreign Foreign 19 Quantiles and transformations commute In essence, transformed quantiles and quantiles of transformed data are one and the same, with easy exceptions such as reciprocals reversing order. So, quantile plots mesh easily with transformations, such as thinking on logarithmic scale. For the latter, we just add simple syntax such as ysc(log). Note that this is not true of (e.g.) histograms, box plots

or density plots, which need re-drawing. 20 0 .2 .4 .6 fraction of the data Domestic .8 Foreign 1 10 10

Mileage (mpg) 20 quantiles of Mileage (mpg) 20 30 30 40 40 The shift is multiplicative, not additive? Domestic Car type

Foreign 21 multqplot (Stata Journal) multqplot is a convenience command to plot several quantile plots at once. It has uses in data screening and reporting. It might prove more illuminating than the tables of descriptive statistics ritual in various professions. We use here the Chapman data from Dixon, W. J. and Massey, F.J. 1983. Introduction to Statistical Analysis. 4th ed. New York: McGrawHill. 22 age (years) systolic blood pressure (mm Hg) 70

190 diastolic blood pressure (mm Hg) 112 52 90 42 33 130 120 110 23 90 0

.25 .5 .75 1 80 75 55 0 cholesterol (mg/dl) .25 .5

.75 1 0 height (in) 520 .25 .5 .75 1 .5 .75

1 weight (lb) 74 262 70 331 276 245.5 135 0 .25 .5

.75 1 68 67 180 62 108 163 147 0 .25 .5

.75 1 0 .25 23 multqplot details By default the minimum, lower quartile, median, upper quartile and maximum are labelled on the y axis so we are half-way to showing a box plot too. By default also variable labels (or names) appear at the top. More at Stata Journal 12:549561 (2012) and 13:640666 (2013). 24

Quantile smoothing 25 Raw or smoothed? Quantile plots show the data as they come: we get to see outliers, grouping, gaps and other quirks of the data, as well as location, scale and general shape. But sometimes the details are just noise or fine structure we do not care about. Once you register that values of mpg in the auto data are all reported as integers, you want to set that aside. You can smooth quantiles, notably using the Harrell and Davis method, which turns out to be bootstrapping in disguise. hdquantile (SSC) offers the calculation. 26 The reference Harrell, F.E. and Davis, C.E. 1982. A new distribution-free quantile estimator. Biometrika 69:

635640. 27 10 H-D quantiles of mpg 20 30 40 Some results 0 .2 .4 .6 fraction of the data

Domestic .8 1 Foreign 28 More could be said Some questions on the mpg example: Is the shift closer to multiplicative than additive? Would reciprocals be better, e.g. gallons per mile? Either way, quantile plots offer tools to precede or advise modelling of the data. 29 Distribution fitting

30 Fitting or testing named distributions Using quantile plots to compare data with named distributions is common. The leading example is using the normal (Gaussian) as reference distribution. Indeed, many statistical people first meet quantile plots as such normal probability plots. 31 Normal QQ-plots are a reasonable default Yudi Pawitan in his 2001 book In All Likelihood (Oxford University Press) advocates normal QQplots as making sense even when comparison with normal distributions is not the goal. 32

qnorm available but limited qnorm is already available as an official command but it is limited to the plotting of just one set of values. 33 Named distributions with qplot qplot has a general trscale() option to transform the x axis scale that otherwise would show plotting positions or ranks. For normal distributions, the syntax is just to add trscale(invnormal(@)) to scale plotting positions. @ is a placeholder for what would otherwise be plotted. invnormal() is Statas name for the normal quantile function (as an inverse cumulative distribution function). 34 40

10 quantiles of Mileage (mpg) 20 30 -2 -1 0 invnormal(P) Domestic 1 2 Foreign 35

A standard plot in support of t tests? This plot is suggested as a standard for two-group comparisons: We see all the data, including outliers or other problems. Use of a normal probability scale shows how far that assumption (read: ideal condition) is satisfied. The vertical position of each group tells us about location, specifically means. The slope or tilt of each group tells us about scale, specifically standard deviations. It is helpful even if we eventually use Wilcoxon-MannWhitney or something else. 36 What if you had paired values? Plot the differences, naturally. Nothing stops you plotting the original values too, but at some point the graphics should respect the pairing.

37 Different axis labelling? The last plot used a scale of standard normal deviates or z scores. Some might prefer different labelling, e.g. % points. mylabels (SSC) is a helper command, which puts the mapping in a local macro for your main command: mylabels 1 2 5 10(20)90 95 98 99, myscale(invnormal(@/100)) local(plabels) 38 40 10 quantiles of Mileage (mpg) 20 30 Foreign

Domestic 1 2 5 10 30 50 70 90 95 98 99 exceedance probability (%) 39 Syntax for that example sysuse auto, clear mylabels 1 2 5 10(20)90 95 98 99, myscale(invnormal(@/100)) local(plabels) qplot mpg, over(foreign) trscale(invnormal(@)) aspect(1) xla(`plabels') xtitle(exceedance probability (%)) xsc(titlegap(*5)) legend(pos(11) ring(0) order(2 1) col(1))

40 Other named distributions? There are many, many named distributions for which customised QQ-plot commands could be written. I am guilty of programs for beta, Dagum, Dirichlet , exponential, gamma, generalized beta (second kind), Gumbel, inverse gamma, inverse Gaussian, lognormal, Singh-Maddala and Weibull distributions. But a better approach when feasible is to allow a distribution to be specified on the fly. 41 Harold Jeffreys suggested that error distributions are more like t distributions with 7 df than like Gaussians. 1939/1948/1961. Theory of probability. Oxford

University Press. Ch.5.7 1938. The law of error and the combination of observations. Philosophical Transactions of the Royal Society, Series A 237: 231271 Sir Harold Jeffreys 18911989 County Durham man established that the Earths core is liquid pioneer Bayesian 42 plotted for probability in [0.001, 0.999] 6 4 kurtosis 5

2 t 7 df 0 -2 -4 -6 -3 -2 -1 0 normal 1 2 3

kurtosis 3 43 How to explore? Simulate with rt(7,) and samples of desired size. trscale(invt(7, @)) sets up x axis scale on the fly. 44 2 3 4 5 6

7 8 9 5 0 -5 5 0 -5 quantiles of t7 -5 0 5

1 -2 0 2 4 -2 0 2 4 normal deviates

-2 0 2 4 45 2 3 4 5 6 7

8 9 5 0 -5 5 0 -5 quantiles of t7 -5 0 5 1

-4 -2 0 2 4 -4 -2 0 2 t with 7 df 4 -4 -2 0 2

4 46 Box plot hybrids 47 Adding a box plot flavour Earlier we saw how extremes and quartiles could be made explicit on the y axis of a quantile plot. They are the minimal ingredients for a box plot. age (years) systolic blood pressure (mm Hg) 70

190 52 90 42 80 75 130 120 110 33 23 90 0

.25 .5 .75 1 55 0 cholesterol (mg/dl) .25 .5 .75 1

0 height (in) 520 Clearly we can also flag cumulative probabilities 0(0.25)1 on the corresponding x axis scale. diastolic blood pressure (mm Hg) 112 .25 .5 .75

1 .5 .75 1 weight (lb) 74 262 70 331 68 67

276 245.5 135 180 163 147 62 0 .25 .5 .75 1 108

0 .25 .5 .75 1 0 .25 48 Tracing the box In multqplot by default the box is shown as part of a double set of grid lines.

age (years) 70 This helps underline that half of the points on a box plot are inside the box and half outside, a basic fact often missed in interpreting these plots, even by experienced researchers. 52 42 33 23 0

.25 .5 .75 1 49 Quantile-box plots Emanuel Parzen introduced quantile-box plots in 1979. Nonparametric statistical data modeling. Journal of the American Statistical Association 74: 105131. His original examples were not especially impressive, perhaps one reason they

have not been more widely emulated. Emanuel Parzen 19292016 50 Boston housing data Here for quantile-box plots we use data from Harrison, D. and Rubinfeld, D.L. 1978. Hedonic prices and the demand for clean air. Journal of Environmental Economics and Management 5: 81102. https:/archive.ics.uci.edu/ml/datasets/Housing Number of Figures in original paper: 1 Number of Figures showing raw data: 0 51 Median value of owner-occupied homes in $1000

10 20 30 40 0 stripplot MEDV, over(CHAS) vertical cumulative centre box cumprob aspect(1) 50 Broad contrast and fine structure 0 1 tract bounds Charles River

52 Some quirks in that dataset % residential land zoned lots >25000 sq. ft 100 % non-retail business acres per town 27.74 18.1 9.69 5.19 12.5 0 .46 0

.25 .5 .75 1 0 accessibility to radial highways .25 .5 .75 1 full-value property-tax rate per $10,000

24 711 666 330 279 5 4 1 187 0 .25 .5 .75

1 0 .25 .5 .75 1 53 Bits and pieces 54 Ordinal (graded) data Ordinal (graded) data can be shown with quantile

plots too. Such data might alternatively be plotted against the midpoints of the corresponding probability intervals. The updated qplot code for this midpoint option will be available with Stata Journal 16(3) (2016). Statistical discussion was given in Stata Journal 4: 190215 (2004), Section 5. 55 0 .2 .4 .6 fraction of the data .8 1 1 1

quantiles of Repair Record 1978 2 3 4 quantiles of Repair Record 1978 2 3 4 5 5 Foreign Domestic -4 -2 0 logit(P)

2 4 56 qplot rep78, aspect(1) over(foreign) midpoint recast(connect) trscale(logit(@)) xsc(titlegap(*5)) legend(pos(11) ring(0) col(1) order(2 1)) As mentioned, this is not yet supported publicly, as of July 2016. 57 Differences of quantiles Plotting differences of quantiles versus their mean or versus plotting position is often a good idea. cquantile (SSC) is a helper program. Much more was said on this at Stata Journal 7: 275279 (2007). 58

Words from the wise 59 Graphs force us to note the unexpected; nothing could be more important. John Wilder Tukey 19152000 Using the data to guide the data analysis is almost as dangerous as not doing so. Frank E. Harrell Jr 60 Questions?

61 All graphs use Stata scheme s1color, which I strongly recommend as a lazy but good default. This font is Georgia. This font is Lucida Console. 62