To Go From Big Data to Big Insight, Start With a Visual
I think so. In my role as the Scholar-in-Residence at The New York Times R&D Lab, I am collaborating with one of the world's most advanced digital R&D teams to figure out how we can draw actionable insights from big data.
How big? Massive: We are documenting every tweet, retweet, and click on every shortened URL from Twitter and Facebook that points back to New York Times content, and then combining that with the browsing logs of what those users do when they land at the Times. This project is a relative of the widely noted Cascade project. Think of it as Cascade 2.0.
We're doing this to understand and predict when an online cascade or conversation will result in a tidal wave of content consumption on the Times, and also when it won't. More importantly we are interested in how the word-of-mouth conversation drives readership, subscriptions, and ad revenue; how the Times can improve their own participation in the conversation to drive engagement; how we can identify truly influential readers who themselves drive engagement; and how the Times can then engage these influential users in a way that complements the users' own needs and interests. Do it, and we can turn that statistical analysis, as you'll see below, into elegant, artistic real time data streams.
Handling the streams, archiving the sessions and storing and manipulating the information are in themselves herculean tasks. But the even bigger challenge is transforming beautiful, big data into actionable, meaningful, decision-relevant knowledge. We've found that visualization is one of the most important guideposts in this search for knowledge, essential to understanding where we should look and what we should look for in our statistical analysis.
For example, here are three visualizations that have helped us gain knowledge. They show cascades of the tweets and retweets as lines and dots about three different Times articles over time, combined with the click-through volume on each article synced in time and displayed as a black graph under each cascade. Each panel tells a different story about engagement with the content.
For the first article, there is a sizable Twitter conversation and several large spikes in traffic. But the click-through volume seems independent of the Twitter conversation: The largest spike in traffic, highlighted in blue, occurs when there is very little Twitter activity. In this case, a prominent link on a blog or a news story that referred to the story, rather than the Twitter conversation itself, is probably driving the traffic.
On the second article, the Twitter conversation is intense. There are many, tweets and retweets of the article — yet the article itself gets very little traffic. People are talking about the article on Twitter, but not reading it. This sometimes happens when the main message of an article sparks a debate or a conversation that can happen without the content of the article being that important, for example, when a timely piece of news contains little analysis or editorial content, or when the conversation or debate gets away from the article and evolves its own independent content.
In the third and final article, an intense Twitter conversation moves in lockstep with engagement. As people tweet and retweet the article, their followers are clicking through and engaging with the content itself. This tight relationship between the online conversation and the website traffic is most pronounced when the three "influencers" tagged in the figure inspire the two largest spikes in traffic over the engagement lifecycle of the article.
With just these three data visualizations, we've gained understanding in important nuances about so-called virality. The relationship between online word-of-mouth conversations and engagement isn't as simple as something just "going viral." Different patterns emerge with different types of content.
Still, the visuals cannot tell the whole story. We see some clear correlations here, but complex conditional dependencies and temporal and network autocorrelation make it necessary to build more sophisticated causal statistical models that will generate true, reliable insights about word-of-mouth influence.
What these visuals do help with is getting us to know where to look and what questions to ask of the data. That is, we can't build the more complex models until we know the most suitable places for building them. These visuals give us some of that inisght.
Cascade 2.0 will be built on sophisticated analytics, and it will require data visualization. Asking important questions and avoiding unnecessary ones is essential to moving forward effectively and efficiently with big data. Without visualization, we are much less efficient in getting to the questions whose answers teach us something. That's why visualizing data must be one of the most important tools for data scientists. It is our torch in a thick, dark forest.