In the middle of every scientific practice is a little thing called "Data".
Data, Measurement, Readings...Proof. Really, these are all just numbers and strings. Information Theory focuses on how information can be created and carried through the configuration of matter and energy; how information is an abstract that attaches itself to other structures. Information doesn’t exist by itself, it has to exist as an emergent property of something else. That thing must be something with enough complexity to retain that information without loss of meaning.
Many decades ago, I was a boy scout. One of the things we were taught was how to leave tracking messages behind us by arranging stones on the trail in a particular pattern. For anyone not in the know, those stones had no pattern. But, for those of us that knew the code those stones communicated the path forward, the answer we were looking for.
Information is not a static thing. Information is an informer, a position of guidance where to go. All the data in the world is worthless to you if you can't close the loop to pull information from that data. If your data isn't showing you how to do things, how to solve problems, then it's just more mass you’re taking responsibility for. Instead of discovering the meaning of those stones, you put them in your pocket in the hope of figuring it out later.
We live in a world now flooded with data. Storage prices have plummeted, throughput rates are have accelerated near logarithmically, and it's now more pragmatic to record everything and determine what is useful later than take the risk of never having recorded it at all.
No matter how much we record data, whatever meaning there is to be found remains untapped until someone comes along with the right questions to ask of it.
Just like those carefully-arranged rocks by the side of the road, you can move bits and bytes around the universe, creating new data. But, pulling the information out of it requires a little more nuance. True, we've done a lot with Big Data the last decade, and many organizations are assembling truly massive data lakes urged on by plummeting storage costs, regulatory requirements to store events enterprise-wide, and the ability to datamine new insights from these broad collections of data. In the space of a few years we transformed these data repositories from a burden to a benefit – the new wisdom quickly became that if you weren't continually going back to leverage what was waiting to be discovered in all that data, it was just the digital equivalent of a vast warehouse of filing cabinets taking up space and gathering dust.
It's now been a decade since Big Data captured the imagination of business. In that decade things have evolved and the capabilities that were previously almost theoretical have lowered their cost of entry to where machine-learning software, and the requisite hardware, are available off the shelf to individuals as well as organizations. In this evolution of accessibility lies the opportunity to adjust how we think of these vast stores of data. In short, improvements in CPU/GPU power and algorithmic implementation are opening up the world to treat our data as a river, not a lake - we no longer need to wait for it to be collected, before we can analyze it in the context of the whole.Incoming data can be compared to what has already been collected and treated more like telemetry. Instead of comparing things meticulously against existing data, that existing data acts as training sets for machine learning to make decisions about new data according to what has been encountered before – approaching being able to examine things in near realtime against existing observed patterns, by varying the sample-size we examine that incoming stream of data with.
At a large enough scale, the discrete data we collect stops behaving like a set of databases and far more like a set of signals.The density and variance of data start to lend themselves to analytics processes more suited for analog data than digital. Individual records cease to be as important as the change-over-time between them.
And that differentiation, where "discrete data is counted, but continuous data is measured," is the potential for us to starting swinging the pendulum back. Taking a step away from the status quo of data lakes that remain largely unexamined except in the most agile and capable of organizations to using machine-learning systems to place emphasis on the flow of our 'data rivers' as opposed to retrospective analysis on monolithic data lakes. We can take a cue from Continuous-Data Analysis and actually do Continuous Data-Analysis.
Don't believe me? Let's just look at that big data mainstay - plain old system log data. Even the simplest operation doesn't exist as a single log entry now, but as several in differing levels of verbosity, usually all aggregated into single channels for ingestion. There's a good reason we still use the radio terminology "signal-to-noise ratio" when examining our data – extracting the signal from incoming data, based on past context is becoming ever more vital. That source of what has been useful yesterday informs what may become useful tomorrow.
Gone are the days when closed-world information models could suffice for the analytics use cases we envision and encounter today. Every data source brings its own perspective and source-specific readings. Open-world models embrace the ingestion of unplanned, unexpected data structures containing previously-unseen types of readings. Comparing them to existing stored data patterns through ML can open up data ingestion to the same kind of capabilities that humans have. For instance, have you ever noticed how you can hear a sentence that uses a word entirely new to you, yet still understand the meaning of that word in the context of the rest of the sentence and your experience with similarly-structured sentence? You don't treat the word as a discrete thing, but as part of a continuous stream of information. Similarly, when reasoning over the meaning of a spoken sentence, one does not wait until the sentence is finished to start processing the meaning of what is being said from memory.
Big-data analysis is intrinsically limited by its emphasis on statistical analysis as a core driver–counting things and then comparing them to other counts. It does not identify complex patterns, often only trends. "Correlation is not Causation” and the more information lies in vector values more than scalar values. A common mockery of simple analytics in the security space calls out the difficult of "finding the APT in a bar chart."
Perhaps..."any piece of data, is far less interesting than its neighbors" ? We're on the threshold now of a movement towards stronger applications of machine learning, machine reasoning, and artificial intelligence. They can now start challenging the human brain's greatest ability – our superior real-time pattern-matching ability. Human pattern matching compares things rapidly against a rapidly-chosen selective set of memories to find a matching classification for a broad swathe of audio, visual, semantic and semiotic elements. This on-the fly classification of things allows us to make on-the-fly decisions and adjustments to said classifiers as new information is added.
We're continually analyzing things against what we know. We're not just collecting new memories, but re-classifying those memories based on new signals we encounter. The feedback loop between collector and collected is continuous and where the 'meta-game' of comprehensively-measured systems come into play. For example, what is the 'shape' of data being collected over time, what sources are producing more interesting data at present? We've spent a lot of time investing in infrastructure and effort to record what our systems are telling us. But until recently, we haven't had the cognitive subtlety in those systems to take account of how they are saying it. Analyzing pre-recorded data after the fact is the difference between reading a book and holding a conversation. If big data is reading the script, Continuous Data Analysis is directing the film.