大数据的近因偏差烦恼(上)

词汇语 人气:2.82W

You may be familiar with the statistic that 90% of the world’s data was created in the last few years. It’s true. One of the first mentions of this particular formulation I can find dates back to May 2013, but the trend remains remarkably constant. Indeed, every two years for about the last three decades the aMount of data in the world has increased by about 10 times – a rate that puts even Moore’s law of doubling processor power to shame.

大数据的近因偏差烦恼(上)

全世界90%的数据都是最近几年生成的,人们对这个结论可能已经耳熟能详。尽管我能找到的这个说法的最早出处是在2013年5月,但是,这种趋势却始终未曾发生变化。事实上,过去30年间,每隔两年,全球总数据量就会增长大约10倍——这让计算机行业的摩尔定律相形见绌。

One of the problems with such a rate of information increase is that the present moment will always loom far larger than even the recent past. Imagine looking back over a photo album representing the first 18 years of your life, from birth to adulthood. Let’s say that you have two photos for your first two years. Assuming a rate of information increase matching that of the world’s data, you will have an impressive 2,000 photos representing the years six to eight; 200,000 for the years 10 to 12; and a staggering 200,000,000 for the years 16 to 18. That’s more than three photographs for every single second of those final two years.

信息爆炸所带来的问题之一在于,即便和不久之前相比,当前的信息量规模都会大到不可思议的程度。假如有一本信息影集代表了你从婴儿到成年的前18年人生,并且照片数量的增长速度和全球数据量保持一致,如果头两年你只有两张照片,那么从6岁到8岁的两年间你就会有两千张照片,从10岁到12岁有20万张,从16岁到18岁则有惊人的2亿张,这意味着在16-18岁期间你每秒钟就会拍3张照片。

This isn’t a perfect analogy with global data, of course. For a start, much of the world’s data increase is due to more sources of information being created by more people, along with far larger and more detailed formats. But the point about proportionality stands. If you were to look back over a record like the one above, or try to analyse it, the more distant past would shrivel into meaningless insignificance. How could it not, with so many times less information available?

当你回过头去以更长远的眼光来看待事物时,你会发现,你有太多太多近期的的事件,而较早的数据和事件是那么的稀少。当然,全球数据不能这样简单类比。全球数据增长的主要原因在于更多的人口产生了更多信息源,以及更大的和更复杂详细的信息结构。然而,如果试图回顾或分析与上文所述影集类似的历史记录,你会发现一个相同点,越遥远的历史所留下的信息和记录就会越稀少。怎么会发生这种事情呢?

Here’s the problem with much of the big data currently being gathered and analysed. The moment you start looking backwards to seek the longer view, you have far too much of the recent stuff and far too little of the old. Short-sightedness is built into the structure, in the form of an overwhelming tendency to over-estimate short-term trends at the expense of history.

这就是目前大数据采集分析中存在的一项弊端。无论你在哪一个时间点开始回顾历史,都会遇到同一个麻烦:近期数据的数量远远超过远期历史数据,由此,这个分析系统会过度重视短期趋势而忽略长期趋势,从而受到短视的困扰。

To understand why this matters, consider the findings from social science about ‘recency bias’, which describes the tendency to assume that future events will closely resemble recent experience. It’s a version of what is also known as the availability heuristic: the tendency to base your thinking disproportionately on whatever comes most easily to mind. It’s also a universal psychological attribute. If the last few years have seen exceptionally cold summers where you live, for example, you might be tempted to state that summers are getting colder – or that your local climate may be cooling. In fact, you shouldn’t read anything whatsoever into the data. You would need to take a far, far longer view to learn anything meaningful about climate trends. In the short term, you’d be best not speculating at all – but who among us can manage that?

为了理解这个问题的重要性,需要考虑社会科学中有关“近因偏差”(recency bias,又称近因效应)的研究发现。近因偏差是指:人们在判断事物发展趋势时,会认为未来事件将会和近期体验高度类似。这可以说是某种“可利用性法则”(availability heuristic)——不恰当地以最容易认知的信息来作为思考的基础。这还是一种普遍的心理学特征。举例来说,如果在你居住的地方,过去几年的夏季气温都很低,那么你可能会认为夏季气候正在变得更冷——或者说你当地的气候正在变冷。但是,你不应该只根据少量数据分析长期趋势。你需要有一个长远视角,才能认识真正有意义的气候趋势。短时期内,最好不进行任何猜测。不过,我们之中又有谁能真正做到这点呢?