Oddbean

> you need to normalize data to its underlying scaling pattern, or you can't see the patterns in the visualisation Correct. In a data visualization, there is a projection from the domain of the data (time, dollars, etc.) onto the range of values supported by the visual media (pixels, centimeters, etc.). Choosing the function to apply (log/linear) and parameters (base for log, offset, scale) are arbitrary and made for aesthetic reasons. > patterns appear in a cluster of data when you normalize it Agreed that the pattern you see depends on the function applied and the parameters you choose. It is my claim that the choice of base and linear scaling parameter are functionally equivalent. The remainder of this post explains how. Focusing on the Y axis, and assuming you have chosen log scale, here are the parameters you can choose: - B - base for log function - m - scale factor for linear projection - c - offset for linear projection The linear projection here is from log price to screen pixels. So the total function from price to Y coordinate is: f(p) = m * logB(p) + c Let’s consider the algebraic impact of choosing a different base, B’ for a different price projection function, f’: f’(p) = m * logB’(x) + c What is the relationship between f and f’, visually? Let’s find out. It is a rule of logarithms that one can compute a value in a new base according to this formula: logB(x) = logA(x) / logA(B) So for us, that means that: logB’(x) = logB(x) / logB(B’) Substituting this into our price projection function f: f’(p) = m * logB(x) / logB(B’) + c Refactoring: f’(p) = m * logB(x) / logB(B’) + c * logB(B’) / logB(B’) f’(p) = (m * logB(x) + c * logB(B’) ) / logB(B’) f’(p) = (m * logB(x) + c + c * (logB(B’) - 1) ) / logB(B’) Substituting our original f definition: f’(p) = (f(p) + c * (logB(B’) - 1) ) / logB(B’) Refactoring: f’(p) = f(p) / logB(B’) + c * (logB(B’) - 1) / logB(B’) Since B, B’ and c are all constants, what this last formula shows is that f’ is a linear projection of f. That is, it fits the form y=mx+b. I hope none of the above is controversial (unless I’ve made a mistake in the math). What does this mean for us, in a data visualization context? As noted earlier, visualizing a function requires projecting from the domain of the data into the range of pixel values. Putting log aside, this means, at minimum, picking a scaling factor and an offset. These values are arbitrary and chosen for aesthetic effect. So irrespective of whether we use f or f’, we’ll end up linearly scaling the values to project them into pixel space using arbitrary, aesthetically chosen parameters. If we have the same aesthetic intent in both cases, we will select projection parameters that yield identical graphs. The parameter values we pick will be different, but the pixel values will be the same (by definition, since we have the same aesthetic intent for both bases). This is what I mean when I say the graphs are identical. The only way in which they differ is by a linear scaling function, and we control arbitrary linear scaling parameters. I hope this is clear. The choice of log base and the choice of linear scaling factor are in the same category of arbitrary visualization parameters. Moreover, they have the same effect. If you squish vertically by choosing a higher base, you can stretch vertically to counterbalance that choice by choosing a larger scale factor.