Friday, June 29, 2012

Lies, Damn Lies, And Now Statistics

We've been at the game at least since Democritus, and god knows I've made it a pet subject on the blog over the years. At what point, we wonder, does a pile of stuff instead become collection of tiny things? When is it appropriate to break something up and consider it in terms of little constituent parts, and when are the small-parts contributions better thought of as a collective whole? Once you get past the philosophical wankery (and let's face it, I'm unlikely to), then I suppose there are really only two practical answers to that: when you can't avoid the quantized nature of things, and when the math is easier. I think it's a little bit funny that a few centuries of science based on analog mathematics (with continuous, nicely differentiable functions) finally coughed up everything digital (breaking it up into approximate chunks), and I have often found it fascinating how the same systems can be usefully described as discrete, continuous, and then discrete again, depending on what scale you dial in at, or what you're trying to prove. Electronics, for example: you start with quanta (electrons), which average out to make analog structures (semiconductor devices, let's say), and then put those together to make a logic network that'll do it all bitwise, allowing only ones or zeroes (I'm using one to write). And sometimes your semiconductor theory gives you localized states to deal with; sometimes the analog nature of a transistor or diode is important. I think one reason that things like macroeconomics and evolution appeal to me is they're large-scale ensemble effects that are logical extensions of (well, evolution is anyway), but seemingly independent phenomena from, the things that make them up, which in those cases are our very lives.

Maybe you'll forgive me for dipping into this well yet again. I had to sit through a weeklong industrial statistics class earlier this month, and this is the sort of thing that I was daydreaming about (well, once I got tired of thinking up wiseass comments and imagining people naked). It was an effort to fuzz over the whole mind-crushing boredom of it all.

Most people loathe stats for the terminally dull math it throws at you, and that's a reputation that's probably deserved, but at least digging through the justifications and proofs has a way of adding a kind of legitimacy of knowledge. Getting through it makes you feel like a smart person. That's not the class I took the other week. There we were training with a computer program to run through all the equations behind the scenes--elegantly enough if you stick to the problems it was designed for (but what kind of engineer would I be if I did that?)--and the practical application got taught without drumming up even the mathematical gravitas you'd need to count back change. It's a well-oiled teaching method that got across how to use a mathematical tool without an underlying idea of how the math might work, and okay, knowing how to use it is the take-home you'd want even if you did take the time to watch the gears turning, and he did a good job of getting across what he tried to. But it's a special kind of tedium to spend a 40 hour week absorbing the knowledgable huckster routine from someone you're pretty sure isn't as smart as you. Christ, it reminded me of those long ago nights of sitting through driver's ed.

(Full disclosure: I had a stats class back in college that taught nothing of perceivable relevance whatsoever. It taught some math, but I didn't learn any of that either, or at least none of it stuck in my head beyond the final. I didn't feel the least bit smart, but still got an A. Not sure how that happened.)

Anyway, the dorky daydreams. It struck me that when you hit that border between chunky and creamy, where you can't really decide whether to count things up or do clean math on some variable, then that is exactly where you have "statistics." Implying a distribution function is exactly the point when you know damn well the data consists of tallied events but you're going to call it a smooth curve anyway, and statistical analysis is supposed to be what tells you if that's worth doing and how legal it really is, when things go one way or the other. In the manufacturing world, one primary concern is sampling and measurement: it's an important question whether you can compare results from measurements that will vary, that is, whether the data are really telling you anything. We're all used to thinking like this, but most scientists I've known aren't terribly rigorous about considering error in the experimentation and data-gathering, although then again, we are usually more about understanding relationships that come from somewhere.  More curve fitting, fewer t-tests.

Statistical understanding gets buried under a lot of science and engineering anyway, without always thinking about it as such.  One advantage of spending a decade and a half as a technical whore is that I got exposed to a variety of interesting fields and thinking (a disadvantage is that I got to be a whole lot better at bullshitting my way around ideas than studying and implementing them).  The very basis of band theory, which is used liberally to design solid state devices, is a smooth approximation of entities that are known to be discrete, assigning effective properties, imagining a continuous density of electrical states, generating a smooth probability function to populate them.  When you can't quite get away things so easily, when you have to admit you're counting electrons or photons, or doing signal processing in general, then you have to fall back to the statistics.  It's interesting to consider how shot noise will plague you in low-signal collection (when just a few electrons are passing through, they are less likely to be representative), or derivations of signal to noise on a larger scale, and any kind of diagnostics will also require a statistically-derived decision based on the quantity of signal.  I spent a few inadequate efforts in past years thinking about the implications of size distributions of small particles, and I'm getting lately into something like that again.  It's a case when it's not just the size of the the little guys govern properties, but the shape of the distribution will affect what you measure too, and if spread out, it'll behave much differently than if they're all the same size.  You might call this a property of your sample, or you might call it the properties of differently-behaving individuals.

But it's important to remember that science and statistics aren't really the same thing. One remarkable part of that course (and, I think this is the hollow heart of certain fields which rely on statistics, some of which happen to control the world), is that you can sometimes get to feel you are evaluating things astutely, while knowing fuck-all about what's going on down there. I think that makes certain kinds of people feel very smart indeed, but it scares the shit out of me. When you can study something in detail while remaining relatively ignorant of it, you have a good opportunity to lie to yourself and others.

I'll leave the economist-bashing aside today and note that as researchers, if we're chasing something like the scientific method, then we have some working assumptions and models going in. We have some prior experience, sometimes whole fields of it, of how things tend to relate, might relate, or fucking well better relate. One of the most annoying things that got pushed in the class, and I know is used in industrial research, is the development of "models" though statistical design of experiments.  The idea of that is to throws a bunch of ingredients together in a way to best infer dependencies, which is a neat scientific tool, and sometimes exactly the right one, but the problem is that it also offers no real understanding.  It is meant to address the what, but utterly leaves off the why.  [I feel better about things like evolutionary algorithms, where a solution is chased down through randomly mutated generations, and maybe you don't know the intimate details inside there either, but it's a really clever approach at that higher level of granularity]  If it's formulation work you're doing, then you end up doing chemistry with a completely optional understanding of, well, chemistry, and this just annoys me on some level. You really should have some fundamental understanding of how materials are known to interact. The instructor called these sorts of insights, a little dismissively, as "local knowledge," but if it's science, the local knowledge is what you are getting at.

No comments: