table than what mos
When we design visual information, we have many tools at our disposal to increase the likelihood that our audience receives the message we want them to receive. Gestalt theory helps us accomplish this by explaining key aspects of how we perceive images, and specifically how we identify the figure (the message part of the image) and separate it from the ground (the background or context on which the figure rests).
Strictly speaking, it’s more correct to say that the figure is what emerges from the whole image, and that it is not truly separate from the ground. Saying that “the whole is greater than the sum of its parts” is a familiar way of expressing this concept, and one goal of learning information design it to understand how to create that whole from many smaller parts. In gestalt psychology, this whole is referred to as totality: when we perceive an image, we examine all of its components (subconsciously first, then consciously), weigh them against each other, identify the important ones, and decide how the parts, working together, convey a message.
Clearly, understanding gestalt theory is important for anyone who wants to master information design, not to mention visual design disciplines such as instructional design and user-interface design. But this understanding is important even when all we want to do is create a simple graphic such as a bar chart or an annotated screenshot. Four main principles of gestalt-based design provide this understanding and the tools we need to use it in practice:
- Emergence
- Reification
- Multistability
- Invariance
Don’t let the names scare you. In this article, I’ll show how these principles are simple, yet highly relevant in technical communication—but I’ll also critique them and provide guidance on how you can use the theory in your daily work.
Emergence in Visual Design
Emergence is how we perceive images as a whole rather than assembling them by consciously interpreting their component parts. For example, when we recognize the face of a friend, we don’t consciously examine each of their facial features and compare the resulting list against a mental checklist. Similarly, we can rely on viewers to subconsciously recognize that something is a screenshot of a menu rather than one of a dialog box. That being said, viewers must make some conscious effort to go beyond this recognition: they must determine which specific menu they’re examining (e.g., File versus Edit), and must figure out why that is relevant.
When we design a graphic, several thought exercises let us help viewers move from recognition (“this is a screen shot of a menu”) to understanding (“this shows me the location of the Track Changes feature”). To support this understanding, we must implement the details of the image in a way that facilitates emergence. For example, consider the task of presenting changes in a company’s income over time using a graph. Five simple steps help us choose an appropriate design:
- Identify information that provides context, and label it clearly. “Income” and “time” provide the context, and traditionally, we place them on the graph’s vertical and horizontal axes, respectively.
- Identify the data we will present in that context. Here, the data is a series of income values, each associated with a specific date.
- Clearly label the axes of the graph and follow standard conventions to help viewers determine the meaning. For example, in English time moves from left to right along the horizontal axis.
- Identify the minimum amount of visual information required to communicate the meaning of the data. A bar graph works, but a line graph is more efficient because it eliminates unnecessary information (e.g., the width and color of the bars) that competes with the message (i.e., the incomes) for the viewer’s attention.
- To focus attention on the part of the image that contains the message, point out the important things. We can do this using arrows to point at specific incomes or we can use subtler tools such as using black to indicate profits and red to indicate losses.
We can use the same approach in other technical graphics, such as illustrations of a digital camera that we’re documenting. A drawing can be more effective than a photograph because it eliminates detail that obscures the context and focuses attention on the remaining details, which are what convey our message. For example, the color of the camera and the texture of its surface are unimportant; the locations of the power button and the focus knob are. Progressively eliminating textures and colors and lines from the image until all the remains are the lines that are essential to recognize “this is a camera” (i.e., until that recognition emerges) is a useful approach. With practice and experience, we can learn to start closer to that minimum and waste less time deleting unnecessary details. This approach to visual design is often described as minimalism. For more information and copious examples of how it works, there’s no better resource than Edward Tufte’s books The Visual Display of Quantitative Information and Envisioning Information.
Reification
Figure 1: An example of how reification causes an image to emerge
Reification is what happens when we perceive an abstract concept such as “digital camera” based on the sensory cues received by our eyes (light reflected by the object); these cues magically transform into an abstract concept (“something I can use to take photos”). That image in the mind’s eye is clearly not the physical object we’re looking at; when we see the camera, it doesn’t suddenly appear inside our head. (Ouch!) To make this more concrete, consider how we can make viewers recognize a rectangle without actually drawing a rectangle (Fig. 1):
Figure 2: Reification of a drop shadow to create a three-dimensional effect.
There is no rectangle in the image, but the sensory cues provided by the image almost force us to infer its existence. Does that seem a bit abstract? Consider this: In technical communication, creating such rectangles without actually drawing them is a common task. Whenever we lay out a page, we must convince readers there are one or more columns of text, each of which exists as a separate image. We could certainly draw a thick black rectangle around each block of text to identify it, but we almost never do. That’s because reification does the work for us. Returning for a moment to the concepts of figure and ground, the white space between the columns of text functions as the ground, and the text itself functions as the figure. We have not painted the white space on the page using white paint or its pixel equivalent, yet it nonetheless creates an imagined barrier that separates the two blocks of text. Similarly, when we use a “drop shadow” to create the impression of a 3D image (Fig. 2), the black box that we use to create the shadow is real, but the 3D image is not: the page remains as flat as any other page. Yet viewers reify that box to interpret it as a shadow, and use that knowledge to understand that we want them to imagine that the image is floating above the page.
To create an effective image, we should follow two reification-related steps to understand how to create the image:
- We must understand the explicit signals that will cause reification to occur. Actual shapes, created with clearly visible lines, are an obvious example.
- We must also take advantage of implicit signals. White space is an obvious example. We should preserve white space when it conveys a message, and eliminate it when it obscures that message.
Avoid Multistability in Visual Design
Figure 3: Examples of (top) how a line drawing can be multistable and (bottom) how to clearly indicate which of two possible stable images we want the viewer to see.
Multistability is the concept that some graphics may appear to alternate between two stable images when the sensory clues we provide are ambiguous. A line drawing of a cube is inherently multistable because the crossing lines make it difficult to understand which lines represent foreground and which represent background (Fig. 3):
Multistability generally isn’t a good thing, so we should avoid it wherever possible because it makes our message ambiguous. In the example of the cube, eliminating lines in the background (i.e., lines that we could not see from our current position in front of the cube) is a good solution. If the presence and position of those background lines is important, we can indicate their existence using lighter tones or a gradient fill that makes the lines fade into the background.
Invariance
A visual image is invariant if we continue to interpret it as the same object even if we change its orientation (by rotation), its size (by magnification), or its position (by relocation). However, images are not equally recognizable in all orientations, sizes, and positions. For example, text is easily recognizable as text, no matter its orientation; this is why we recognize that sideways text on the vertical axis of a graph is still text. But if we flip a figure caption to create its mirror image, reduce the type size to 4 points, and move the caption far from the figure it describes, we make it unnecessarily difficult for reader to read and understand that text.
Designing a visual image based on the assumption that invariance is sufficient for clarity leads to communications failures. For instance, a page number won’t be instantly recognized as a page number if it is vertically centered on the page, between two columns of text. The number does not change just because we changed its position. But because it does not appear at the top or bottom of the page, outside the main text area, we won’t recognize it as a page number until we notice that it appears on every page and increases moving from left to right. Invariance is not sufficient because our audience uses learned conventions (e.g., that page numbers appear at the top or bottom of a page) to understand visual information. Unless we have a good reason to ignore or subvert those expectations, we should stick with what they already know.
If you’re thinking that invariance relates to consistency, you’re right: visual information must be externally and internally consistent. External consistency means that the information follows the conventions the rest of the world uses to present a given type of information, such as in my example of the location of page numbers. Internal consistency means that within a document or graphic or Web page, each type of information must be consistently (invariantly!) labeled: all headings of a given level must be boldfaced (or not), centered (or not), and use the same font (typeface, size, and emphasis); bulleted lists should use the same bullets, and numbered lists should use the same numbering scheme. Compare, for example, the two following lists:
Good:
- The first item.
- The second item.
- The third item.
Bad:1. The first item. - 2—The second item.
3) The third item.
Grouping related things supports gestalt
Many aspects of visual design are implicit in gestalt theory, and it’s worth making them explicit. The four gestalt principles I’ve described help us create more effective visuals by helping us choose the most effective clues for a message and an audience. One of the most important application of these principles involves grouping to indicate which parts of an image work together to serve a single function, and which parts serve a different function. For a printed page, white space separates words from each other, separates one paragraph from the next, and separates the columns of text into groups. Appropriate use of white space (i.e., good typography) lets readers read without paying attention to anything other than the meaning of words; bad use of white space forces readers to consciously separate adjacent letters, adjacent words, or adjacent lines, and makes reading difficult.
Although there are various “laws” of how grouping works in gestalt psychology, it’s better to think of them as proven “best practices” or guidelines for effective visual design. The gestalt laws of grouping include the following:
- Closure: We intuitively fill in gaps in the visual information and draw inferences about what the gaps mean. Despite the gaps between dashes, a dashed line remains instantly recognizable as a line. Despite the gaps between letters within words and between words within sentences, we recognize words as words and sentences as sentences.
- Similarity: We intuitively group things that are similar in size, shape, color, or intensity. As I noted earlier, this is why consistency of heading styles is important.
- Proximity: Viewers create groups from things that are close together and consider group members to be related; conversely, they separate things that are farther apart into separate groups.
- Symmetry: Symmetrical designs group things in ways that lead readers to believe that the groups operate as a single unit. This is why lines or columns of text on a page are typically balanced, parallel, and of equal width. It’s also why symbols such as arrows used to point at several features of a diagram should all be the same size (symmetrical in thickness and length): arrows that are more visually prominent suggest that the features they point to are more important.
- Continuity: We tend to extrapolate from patterns to see where they lead. We learn to look for page numbers at a specific location on a page, and changing that position disrupts that learned skill. If we change the meaning of a symbol between two images, readers who learned the first meaning will apply that knowledge to the next image, even if we warn them about the change of meaning. I frequently catch scientist authors making this mistake when they interpret their own inconsistently designed graphs.
- Common fate: Things moving in the same direction are perceived as being part of the same group. For example, a graph that compares two trends will show both trends moving from left (earlier) to right (later) instead of reversing that order for one trend. Dialog boxes group tabs on only one side (typically at the left or top) to emphasize that they all have the same fate (i.e., clicking on them will always display options beside that tab).
Figure 4: An example of grouping based on physical position and visual characteristics to separate radio buttons from checkboxes.
Combinations of these principles let us establish visually and functionally consistent designs. For instance, the similarity principle is why checkboxes in a dialog box all look alike but differ visually from radio buttons, whereas the proximity principle is why we place the checkboxes together but separate them from the group of radio buttons (Fig. 4). Symmetry and continuity explain why we align the members of each group to reinforce the message that they work together. When such design features are consistent, viewers can quickly recognize patterns and understand components of the design (e.g., checkbox = choose more than one option, radio button = choose only one option) and can use the visual information efficiently.
Putting the theory to the test
Gestalt theory provides a robust way to understand why certain visual design conventions have evolved and why certain design strategies have become best practices. But as the exceptions I’ve described demonstrate, we must use this theory as a tool for understanding how to design, not as a list of rules to follow blindly. Any theory, no matter how powerful it seems, should always be subjected to a reality check. Theory can lead us astray when we use it as a substitute for thinking through a problem, and doubly so when we use it as a substitute for taking enough time to understand how our audience sees, reads, and thinks.
References
Anon. 2011. Gestalt psychology. <http://en.wikipedia.org/wiki/Gestalt_psychology>
Hart, G. 2008. Much ado about nothing, part 1: the importance of white space. Intercom January 2008:36–37. <http://www.geoff-hart.com/resources/2008/white-space-1.htm>
Hart, G. 2008. Much ado about nothing, part 2: deconstructing a page. Intercom May 2008:38–39. <http://www.geoff-hart.com/resources/2008/white-space-2.htm>
Hart, G. 2008. Typography 101A: the role of white space in making lines of text readable. Intercom July/August 2008:30–31. <http://www.geoff-hart.com/articles/2008/typography-101A.htm>
Hart, G. 2008. Typography 101B: the role of white space in making words readable. Intercom December 2008:29–30. <http://www.geoff-hart.com/articles/2008/typography-101B.htm>
Hart, G. 2009. Typography 101C: the role of typeface choice in making text readable. Intercom February 2009: 48–49.<http://www.geoff-hart.com/articles/2009/typography-101c.htm>
Hart, G. 2010. Subjecting theory to a reality check. Intercom Sept./Oct. 2010:27–28. <http://www.geoff-hart.com/articles/2010/reality-check.htm>
Sorflaten, J. 2011. Designing naturally with gestalt in mind. UI Design Newsletter, September 2011. <http://www.humanfactors.com/downloads/sep11.asp#research1>
Tufte, E. 1983. The visual display of quantitative information. Graphics Press, Cheshire, Conn. 197 p.
Tufte, E. 1990. Envisioning information. Graphics Press, Cheshire, Conn. 126 p.
Big data! If you don’t have it, you better get yourself some. Your competition has it, after all. Bottom line: If your data is little, your rivals are going to kick sand in your face and steal your girlfriend.
There are many problems with the assumptions behind the “big data” narrative (above, in a reductive form) being pushed, primarily, by consultants and IT firms that want to sell businesses the next big thing. Fortunately, honest practitioners of big data—aka data scientists—are by nature highly skeptical, and they’ve provided us with a litany of reasons to be weary of many of the claims made for this field. Here they are:
Even web giants like Facebook and Yahoo generally aren’t dealing with big data, and the application of Google-style tools is inappropriate.
Facebook and Yahoo run their own giant, in-house “clusters”—collections of powerful servers—for crunching data. The necessity of these clusters is one of the hallmarks of big data. After all, data isn’t all that “big” if you could chew through it on your PC at home. The necessity of breaking problems into many small parts, and processing each on a large array of computers, characterizes classic big data problems like Google’s need to compute the rank of every single web page on the planet.
But it appears that for both Facebook and Yahoo, those same clusters are unnecessary for many of the tasks which they’re handed. In the case of Facebook, most of the jobs engineers ask their clusters to perform are in the “megabyte to gigabyte” range (pdf), which means they could easily be handled on a single computer—even a laptop.
The story is similar at Yahoo, where it appears the median task size handed to Yahoo’s cluster is 12.5 gigabytes. (pdf) That’s bigger than what the average desktop PC could handle, but it’s no problem for a single powerful server.
All of this is outlined in a paper from Microsoft Research, aptly titled “Nobody ever got fired for buying a cluster,” which points out that a lot of the problems solved by engineers at even the most data-hungry firms don’t need to be run on clusters. And why is that an issue? Because there are vast classes of problems for which clusters are a relatively inefficient—or even totally inappropriate—solution.
Big data has become a synonym for “data analysis,” which is confusing and counter-productive.
Analyzing data is as old as tabulating a record of all the Pharaoh’s bags in the royal granary, but now that you can’t say data without putting “big” in front of it, the—very necessary—practice of data analysis has been swept up in a larger and less helpful fad. Here, for example, is a post exhorting readers to “Incorporate Big Data Into Your Small Business” that is about a quantity of data that probably wouldn’t strain Google Docs, much less Excel on a single laptop.
Which is to say, most businesses are in fact dealing with what Rufus Pollock, of the Open Knowledge Foundation, calls small data. It’s very important stuff—a “revolution,” according to Pollock. But it has little connection to the big kind.
Supersizing your data is going to cost you and may yield very little.
Is more data always better? Hardly. In fact, if you’re looking for correlations—is thing X connected to thing Y, in a way that will give me information I can act on?—gathering more data could actually hurt you.
“The information you can extract from any big data asymptotically diminishes as your data volume increases,” wrote Michael Wu, the “principal scientist of data analytics” at social media analysis firm Lithium. For those of you who don’t normally think in data, what that means is that past a certain point, your return on adding more data diminishes to the point that you’re only wasting time gathering more.
One reason: The “bigger” your data, the more false positives will turn up in it, when you’re looking for correlations. As data scientist Vincent Granville wrote in “The curse of big data,” it’s not hard, even with a data set that includes just 1,000 items, to get into a situation in which “we are dealing with many, many millions of correlations.” And that means, “out of all these correlations, a few will be extremely high just by chance: if you use such a correlation for predictive modeling, you will lose.”
This problem crops up all the time in one of the original applications of big data—genetics. The endless “fishing expeditions” conducted by scientists who are content to sequence whole genomes and go diving into them looking for correlations can turn up all sorts of unhelpful results.
In some cases, big data is as likely to confuse as it is to enlighten.
When companies start using big data, they are wading into the deep end of a number of tough disciplines—statistics, data quality, and everything else that comprises “data science.” Just as in the kind of science that is published every day—and as often, ignored, revised, or never verified—the pitfalls are many.
Biases in how data are collected, a lack of context, gaps in what’s gathered, artifacts of how data are processed and the overall cognitive biases that lead even the best researchers to see patterns where there are none mean that “we may be getting drawn into particular kinds of algorithmic illusions,” said MIT Media Lab visiting scholar Kate Crawford. In other words, even if you have big data, it’s not something that Joe in the IT department can tackle—it may require someone with a PhD, or the equivalent amount of experience. And when they’re done, their answer to your problem might be that you don’t need “big data” at all.
So what’s better—big data or small?
Does your business need data? Of course. But buying into something as faddish as the supposed importance of the size of one’s data is the kind of thing only pointy-haired Dilbert bosses would do. The same issues that have plagued science since its inception—data quality, overall goals and the importance of context and intuition—are inherent in the way that businesses use data to make decisions. Remember: Gregor Mendel uncovered the secrets of genetic inheritance with just enough data to fill a notebook. The important thing is gathering the right data, not gathering some arbitrary quantity of it.
This week I’m attending STC Summit 2013, the annual conference of the Society for Technical Communication. I’ll blog about the sessions I attend, and give you some links to other news I hear about too. You’ll find my posts under the tag stc13 on this blog.
Michael Opsteegh is about to present a session called Planning and Creating Engaging Infographics. I’m delighted to be here, having survived the Atlanta Ghost Tour last night and just two hours’ sleep.
Introduction
Michael started by discussing the graph on the front page of the Wall Street Journal this morning. Like most of us, he looked at the chart but didn’t read the article. So the only information he got was from the infographic on the side of the page.
Infographics are a powerful way of making information accessible and showing the relationships between pieces of information. You can weave a story consisting of graphs, images and more.
This presentation will focus mostly on the presentation of data, rather than the maths. The focus is on planning and building charts, graphs and larger infographics.
Examples of infographics
We saw a number of examples, and Michael talked us through the plus and minus points.
Infographics can be very persuasive, and can convey a lot of information.A graph, for example, is easier to digest and remember than a lot of text.
Sometimes they are overused. As a result, some people don’t like them. Still, they’re overall very popular.
Infographics can also be fun. Michael showed us one based on a batman theme.
There’s also a lot of room for misrepresentation.
Uses other than selling products and services
You could use an infographic for your resume. A website called visualize.me will produce an infographic based on your LinkedIn profile. But Michael recommends that you do the infographic yourself, rather than ending up with one based on a template.
There’s an infographic showing the wealth gap in America. It incorporates videos and charts, showing what people think the income difference is versus the actual situation. Unfortunately, it’s not easy to see who created the infographic. If someone isn’t prepared to acknowledge they created an infographic, then it may be difficult to trust it.
Skills required to create infographics
The creation of an infographic involves several disciplines. Michael has combined them into three areas:
- Liberal arts: Your infographic needs to tell a story, and it needs to be interesting. Companies are looking more and more to creative people to differentiate their products and services.
- Social sciences: You need some knowledge of human behaviour and cognitive sciences. How your infographic will be received and how to convey the information.
- Mathematics. You need to recognise if you’ve misrepresented your data, and understand the basics.
What about graphic design? If you have the skills, that’s great. Otherwise, hire someone to do the design. You give them the information and the specification for what the chart should look like.
Tools
You need to be able to record your thoughts and ideas, and also questions you have. Michael finds Evernote very useful, because he can jot down notes wherever he is. Evernote syncs the notes from his phone, tablet, PC. You can also include photos, links, videos.
Excel is ubiquitous and powerful. Use it to sort your data and produce preliminary graphs, to help see what your information will look like. Use pivot tables to sort and filter data. Michael demonstrated how you can drill down into data via pivot tables, then generate a graph.
Illustrator or PhotoShop are useful, if you are going to design your own infographic. Michael recommends Illustrator, because it’s great for vector tools and also includes a graph tool.
Visualising data
Bar charts, which can be vertical or horizontal. These are good for comparing figures side by side.
Pie charts are OK for representing data as a whole, and the different percentages within them. But research shows that people aren’t capable of seeing the distinctions well. A doughnut chart is just like a pie chart, with the centre missing. This is even less useful than a pie chart, because you lose the angles at the centre. Bar charts are usually better.
Scatter charts are good for finding patterns in the data.
Line graphs are a little like scatter charts, except that you’re dropping the points at regular intervals.
An area chart is basically a line graph filled in. Good for demonstrating changes over time. The Wall Street Journal chart this morning is an example.
Venn diagrams show relationships between discrete objects. The overlap shows the shared parts.
Flow charts (pedigree charts) show hierarchy or workflows.
Pictograms or iconographs show set numbers. Michael showed a page with a number of figures of people. Each figure might represent 1 million, for example.
There are many other types, like radial charts and maps. See the Wall Street Journal’s guide to designing infographics. Also the Napkin Sketch Workbook by Don Moyer.
Research
This is a critical stage. You need reliable and accurate data before you can move forward.
Identify your sources: must be current, reliable, non-biased.
Get permission to use the data. If a company conducted the research, for example.
Editing
This is where you decide what story you’re going to tell, and how you will tell it. Be aware, as you’re editing, that people will call you out if they find an anomaly or if they want to view it in a different way. So, play with different ways of viewing the data. See if there’s another way to tell your story.
Look for outliers in your data, and see how they affect the message.
What about rounding your numbers? Make sure you round at the end, after you’ve plotted the data. If you do it before, it will skew the graph.
If you’re going to place charts side by side, make sure you’re not comparing apples and pears. Make sure you’re using the right figures to illustrate a point. For bar charts, always start the axis at zero. For other graphs, if you need to start elsewhere make it very clear.
If you’re missing data, you may still be able to create the infographic. If you’re missing more than 2 points out of 10, then your infographic will not be reliable. Look at the data that’s missing and decide if it affects the perception of your story.
Plotting
This is the most fun part. The point where you actually draw the infographic.
Make sure you’re staying true to the data. Remain aware of the maths involved.
If you’re plotting several graphs for the same infographic, you’ll need to wireframe them. A wireframe is basically a set of boxes or circles (in Illustrator) to represent where the bits of data will go. The advantage is that you can move the sections around, before actually drawing them. Look at where the infographic will appear, to decide whether it needs to be tall and thin, or wide and short. Make sure your dimensions are correct.
Reviewing
Make sure your infographic visually represents the data that it ought to. Get a couple of colleagues to take a look and give you feedback. Ask them if there’s anything that worries them.
Ethical considerations
Throughout the process, make sure you don’t misrepresent the data.
Remember: Correlation is not causation. Michael showed us to line graphs that could show that ice cream consumption leads to murder.
Make sure the story you are trying to tell needs telling, and that it will benefit the audience.
Accessibility
There was a lively discussion around accessibility. Michael recommends you put a textual description on the page, near the infographic. An alternative is the new “longdesc” attribute. Don’t use the “alt” attribute, as it’s intended for a short description.
Thanks Michael
Thank you for an informative introduction to infographics. I’m keen to get my hands dirty creating one!
An Error Occurred Setting Your User CookieThis site uses cookies to improve performance. If your browser does not accept cookies, you cannot view this site.
Setting Your Browser to Accept Cookies
There are many reasons why a cookie could not be set correctly. Below are the most common reasons:
- You have cookies disabled in your browser. You need to reset your browser to accept cookies or to ask you if you want to accept cookies.
- Your browser asks you whether you want to accept cookies and you declined. To accept cookies from this site, use the Back button and accept the cookie.
- Your browser does not support cookies. Try a different browser if you suspect this.
- The date on your computer is in the past. If your computer's clock shows a date before 1 Jan 1970, the browser will automatically forget the cookie. To fix this, set the correct time and date on your computer.
- You have installed an application that monitors or blocks cookies from being set. You must disable the application while logging in or check with your system administrator.
Why Does this Site Require Cookies?
This site uses cookies to improve performance by remembering that you are logged in when you go from page to page. To provide access without cookies would require the site to create a new session for every page you visit, which slows the system down to an unacceptable level.
What Gets Stored in a Cookie?
This site stores nothing other than an automatically generated session ID in the cookie; no other information is captured.
In general, only the information that you provide, or the choices you make while visiting a web site, can be stored in a cookie. For example, the site cannot determine your email name unless you choose to type it. Allowing a website to create a cookie does not give that or any other site access to the rest of your computer, and only the site that created the cookie can read it.
The Hidden Biases in Big Data
by Kate Crawford | 2:00 PM April 1, 2013
This looks to be the year that we reach peak big data hype. From wildly popular big data conferences to columns in major newspapers, the business and science worlds are focused on how large datasets can give insight on previously intractable challenges. The hype becomes problematic when it leads to what I call "data fundamentalism," the notion that correlation always indicates causation, and that massive data sets and predictive analytics always reflect objective truth. Former Wired editor-in-chief Chris Anderson embraced this idea in his comment, "with enough data, the numbers speak for themselves." But can big data really deliver on that promise? Can numbers actually speak for themselves?
Sadly, they can't. Data and data sets are not objective; they are creations of human design. We give numbers their voice, draw inferences from them, and define their meaning through our interpretations. Hidden biases in both the collection and analysis stages present considerable risks, and are as important to the big-data equation as the numbers themselves.
For example, consider the Twitter data generated by Hurricane Sandy, more than 20 million tweets between October 27 and November 1. A fascinating study combining Sandy-related Twitter and Foursquare data produced some expected findings (grocery shopping peaks the night before the storm) and some surprising ones (nightlife picked up the day after — presumably when cabin fever strikes). But these data don't represent the whole picture. The greatest number of tweets about Sandy came from Manhattan. This makes sense given the city's high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster. Very few messages originated from more severely affected locations, such as Breezy Point, Coney Island and Rockaway. As extended power blackouts drained batteries and limited cellular access, even fewer tweets came from the worst hit areas. In fact, there was much more going on outside the privileged, urban experience of Sandy that Twitter data failed to convey, especially in aggregate. We can think of this as a "signal problem": Data are assumed to accurately reflect the social world, but there are significant gaps, with little or no signal coming from particular communities.
While massive datasets may feel very abstract, they are intricately linked to physical place and human culture. And places, like people, have their own individual character and grain. For example, Boston has a problem with potholes, patching approximately 20,000 every year. To help allocate its resources efficiently, the City of Boston released the excellent StreetBump smartphone app, which draws on accelerometer and GPS data to help passively detect potholes, instantly reporting them to the city. While certainly a clever approach, StreetBump has a signal problem. People in lower income groups in the US are less likely to have smartphones, and this is particularly true of older residents, where smartphone penetration can be as low as 16%. For cities like Boston, this means that smartphone data sets are missing inputs from significant parts of the population — often those who have the fewest resources.
Fortunately Boston's Office of New Urban Mechanics is aware of this problem, and works with a range of academics to take into account issues of equitable access and digital divides. But as we increasingly rely on big data's numbers to speak for themselves, we risk misunderstanding the results and in turn misallocating important public resources. This could well have been the case had public health officials relied exclusively on Google Flu Trends, which mistakenly estimated that peak flu levels reached 11% of the US public this flu season, almost double the CDC's estimate of about 6%. While Google will not comment on the reason for the overestimation, it seems likely that it was caused by the extensive media coverage of the flu season, creating a spike in search queries. Similarly, we can imagine the substantial problems if FEMA had relied solely upon tweets about Sandy to allocate disaster relief aid.
Big data's signal problems won't disappear as the use of smartphones and other digital technologies increases. As the geographers Michael Crutcher and Matthew Zook noted after Hurricane Katrina, technologies are always differentially adopted, and "any divide in accessing digital technology is not a one-time event but a constantly moving target as new devices, software and cultural practices emerge." As we move into an era in which personal devices are seen as proxies for public needs, we run the risk that already existing inequities will be further entrenched. Thus, with every big data set, we need to ask which people are excluded. Which places are less visible? What happens if you live in the shadow of big data sets?
This points to the next frontier: how to address these weaknesses in big data science. In the near term, data scientists should take a page from social scientists, who have a long history of asking where the data they're working with comes from, what methods were used to gather and analyze it, and what cognitive biases they might bring to its interpretation (for more, see "Raw Data is an Oxymoron"). Longer term, we must ask how we can bring together big data approaches with small data studies — computational social science with traditional qualitative methods. We know that data insights can be found at multiple levels of granularity, and by combining methods such as ethnography with analytics, or conducting semi-structured interviews paired with information retrieval techniques, we can add depth to the data we collect. We get a much richer sense of the world when we ask people the why and the how not just the "how many". This goes beyond merely conducting focus groups to confirm what you already want to see in a big data set. It means complementing data sources with rigorous qualitative research. Social science methodologies may make the challenge of understanding big data more complex, but they also bring context-awareness to our research to address serious signal problems. Then we can move from the focus on merely "big" data towards something more three-dimensional: data with depth.
IN the 1960s, mainframe computers posed a significant technological challenge to common notions of privacy. That's when the federal government started putting tax returns into those giant machines, and consumer credit bureaus began building databases containing the personal financial information of millions of Americans.
STORY HIGHLIGHTS
- Bruce Schneier: Whether we like it or not, we're being tracked all the time on the Internet
- Schneier: Our surveillance state is efficient beyond the wildest dreams of George Orwell
- He says governments and corporations are working together to keep things that way
- Schneier: Slap-on-the-wrist fines notwithstanding, no one is agitating for better privacy laws
Editor's note: Bruce Schneier is a security technologist and author of "Liars and Outliers: Enabling the Trust Society Needs to Survive."
(CNN) -- I'm going to start with three data points.
One: Some of the Chinese military hackers who were implicated in a broad set of attacks against the U.S. government and corporations were identified because they accessed Facebook from the same network infrastructure they used to carry out their attacks.
Two: Hector Monsegur, one of the leaders of the LulzSac hacker movement, was identified and arrested last year by the FBI. Although he practiced good computer security and used an anonymous relay service to protect his identity, he slipped up.
Bruce Schneier
And three: Paula Broadwell,who had an affair with CIA director David Petraeus, similarly took extensive precautions to hide her identity. She never logged in to her anonymous e-mail service from her home network. Instead, she used hotel and other public networks when she e-mailed him. The FBI correlated hotel registration data from several different hotels -- and hers was the common name.
The Internet is a surveillance state. Whether we admit it to ourselves or not, and whether we like it or not, we're being tracked all the time. Google tracks us, both on its pages and on other pages it has access to. Facebook does the same; it even tracks non-Facebook users. Apple tracks us on our iPhones and iPads. One reporter used a tool called Collusion to track who was tracking him; 105 companies tracked his Internet use during one 36-hour period.
Increasingly, what we do on the Internet is being combined with other data about us. Unmasking Broadwell's identity involved correlating her Internet activity with her hotel stays. Everything we do now involves computers, and computers produce data as a natural by-product. Everything is now being saved and correlated, and many big-data companies make money by building up intimate profiles of our lives from a variety of sources.
Facebook, for example, correlates your online behavior with your purchasing habits offline. And there's more. There's location data from your cell phone, there's a record of your movements from closed-circuit TVs.
This is ubiquitous surveillance: All of us being watched, all the time, and that data being stored forever. This is what a surveillance state looks like, and it's efficient beyond the wildest dreams of George Orwell.
Sure, we can take measures to prevent this. We can limit what we search on Google from our iPhones, and instead use computer web browsers that allow us to delete cookies. We can use an alias on Facebook. We can turn our cell phones off and spend cash. But increasingly, none of it matters.
There are simply too many ways to be tracked. The Internet, e-mail, cell phones, web browsers, social networking sites, search engines: these have become necessities, and it's fanciful to expect people to simply refuse to use them just because they don't like the spying, especially since the full extent of such spying is deliberately hidden from us and there are few alternatives being marketed by companies that don't spy.
This isn't something the free market can fix. We consumers have no choice in the matter. All the major companies that provide us with Internet services are interested in tracking us. Visit a website and it will almost certainly know who you are; there are lots of ways to be tracked without cookies. Cellphone companies routinely undo the web's privacy protection. One experiment at Carnegie Mellon took real-time videos of students on campus and was able to identify one-third of them by comparing their photos with publicly available tagged Facebook photos.
Maintaining privacy on the Internet is nearly impossible. If you forget even once to enable your protections, or click on the wrong link, or type the wrong thing, and you've permanently attached your name to whatever anonymous service you're using. Monsegur slipped up once, and the FBI got him. If the director of the CIA can't maintain his privacy on the Internet, we've got no hope.
In today's world, governments and corporations are working together to keep things that way. Governments are happy to use the data corporations collect -- occasionally demanding that they collect more and save it longer -- to spy on us. And corporations are happy to buy data from governments. Together the powerful spy on the powerless, and they're not going to give up their positions of power, despite what the people want.
Fixing this requires strong government will, but they're just as punch-drunk on data as the corporations. Slap-on-the-wrist fines notwithstanding, no one is agitating for better privacy laws.
So, we're done. Welcome to a world where Google knows exactly what sort of porn you all like, and more about your interests than your spouse does. Welcome to a world where your cell phone company knows exactly where you are all the time. Welcome to the end of private conversations, because increasingly your conversations are conducted by e-mail, text, or social networking sites.
And welcome to a world where all of this, and everything else that you do or is done on a computer, is saved, correlated, studied, passed around from company to company without your knowledge or consent; and where the government accesses it at will without a warrant.
Welcome to an Internet without privacy, and we've ended up here with hardly a fight.
Follow @CNNOpinion on Twitter.
Join us at Facebook/CNNOpinion.
The opinions expressed in this commentary are solely those of Bruce Schneier.
Data companies are scooping up enormous amounts of information about almost every American. They sell information about whether you're pregnant or divorced or trying to lose weight, about how rich you are and what kinds of cars you have.
Regulators and some in Congress have been taking a closer look at these so-called data brokers — and are beginning to push the companies to give consumers more information and control over what happens to their data.
But many people still don't even know that data brokers exist.
Here's a look at what we know about the consumer data industry.
How much do these companies know about individual people?
They start with the basics, like names, addresses and contact information, and add on demographics, like age, race, occupation and "education level," according to consumer data firm Acxiom's overview of its various categories.
But that's just the beginning: The companies collect lists of people experiencing "life-event triggers" like getting married, buying a home, sending a kid to college — or even getting divorced.
Credit reporting giant Experian has a separate marketing services division, which sells lists of "names of expectant parents and families with newborns" that are "updated weekly."
The companies also collect data about your hobbies and many of the purchases you make. Want to buy a list of people who read romance novels? Epsilon can sell you that, as well as a list of people who donate to international aid charities.
A subsidiary of credit reporting company Equifax even collects detailed salary and paystub information for roughly 38 percent of employed Americans, as NBC news reported. As part of handling employee verification requests, the company gets the information directly from employers.
Equifax said in a statement that the information is only sold to customers "who have been verified through a detailed credentialing process." It added that if a mortgage company or other lender wants to access information about your salary, they must obtain your permission to do so.
Of course, data companies typically don't have all of this information on any one person. As Acxiom notes in its overview, "No individual record ever contains all the possible data." And some of the data these companies sell is really just a guess about your background or preferences, based on the characteristics of your neighborhood, or other people in a similar age or demographic group.
Where are they getting all this info?
The stores where you shop sell it to them.
Datalogix, for instance, which collects information from store loyalty cards, says it has information on more than $1 trillion in consumer spending "across 1400+ leading brands." It doesn't say which ones. (Datalogix did not respond to our requests for comment.)
Data companies usually refuse to say exactly what companies sell them information, citing competitive reasons. And retailers also don't make it easy for you to find out whether they're selling your information.
But thanks to California's "Shine the Light" law, researchers at U.C. Berkeley were able to get a small glimpse of how companies sell or share your data. The study recruited volunteers to ask more than 80 companies how the volunteers' information was being shared.
Only two companies actually responded with details about how volunteers' information had been shared. Upscale furniture store Restoration Hardware said that it had sent "your name, address and what you purchased" to seven other companies, including a data "cooperative" that allows retailers to pool data about customer transactions, and another company that later became part of Datalogix. (Restoration Hardware hasn't responded to our request for comment.)
Walt Disney also responded and described sharing even more information: not just a person's name and address and what they purchased, but their age, occupation, and the number, age and gender of their children. It listed companies that received data, among them companies owned by Disney, like ABC and ESPN, as well as others, including Honda, HarperCollins Publishing, Almay cosmetics, and yogurt company Dannon.
But Disney spokeswoman Zenia Mucha said that Disney's letter, sent in 2007, "wasn't clear" about how the data was actually shared with different companies on the list. Outside companies like Honda only received personal information as part of a contest, sweepstakes, or other joint promotion that they had done with Disney, Mucha said. The data was shared "for the fulfillment of that contest prize, not for their own marketing purposes."
Where else do data brokers get information about me?
Government records and other publicly available information, including some sources that may surprise you. Your state Department of Motor Vehicles, for instance, may sell personal information — like your name, address, and the type of vehicles you own — to data companies, although only for certain permitted purposes, including identify verification.
Public voting records, which include information about your party registration and how often you vote, can also be bought and sold for commercial purposes in some states.
Are there limits to the kinds of data these companies can buy and sell?
Yes, certain kinds of sensitive data are protected — but much of your information can be bought and sold without any input from you.
Federal law protects the confidentiality of your medical records and your conversations with your doctor. There are also strict rules regarding the sale of information used to determine your credit-worthiness, or your eligibility for employment, insurance and housing. For instance, consumers have the right to view and correct their own credit reports, and potential employers have to ask for your consent before they buy a credit report about you.
Other than certain kinds of protected data — including medical records and data used for credit reports — consumers have no legal right to control or even monitor how information about them is bought and sold. As the FTC notes, "There are no current laws requiring data brokers to maintain the privacy of consumer data unless they use that data for credit, employment, insurance, housing, or other similar purposes."
So they don't sell information about my health?
Actually, they do.
Data companies can capture information about your "interests" in certain health conditions based on what you buy — or what you search for online. Datalogix has lists of people classified as "allergy sufferers" and "dieters." Acxiom sells data on whether an individual has an "online search propensity" for a certain "ailment or prescription."
Consumer data is also beginning to be used to evaluate whether you're making healthy choices.
One health insurance company recently bought data on more than three million people's consumer purchases in order to flag health-related actions, like purchasing plus-sized clothing, the Wall Street Journal reported. (The company bought purchasing information for current plan members, not as part of screening people for potential coverage.)
Spokeswoman Michelle Douglas said that Blue Cross and Blue Shield of North Carolina would use the data to target free programming offers to their customers.
Douglas suggested that it might be more valuable for companies to use consumer data "to determine ways to help me improve my health" rather than "to buy my data to send me pre-paid credit card applications or catalogs full of stuff they want me to buy."
Do companies collect information about my social media profiles and what I do online?
Yes.
As we highlighted last year, some data companies record — and then resell — all kinds of information you post online, including your screen names, website addresses, interests, hometown and professional history, and how many friends or followers you have.
Acxiom said it collects information about which social media sites individual people use, and "whether they are a heavy or a light user," but that they do not collect information about "individual postings" or your "lists of friends."
More traditional consumer data can also be connected with information about what you do online. Datalogix, the company that collects loyalty card data, has partnered with Facebook to track whether Facebook users who see ads for certain products actually end up buying them at local stores, as the Financial Times reported last year.
Is there a way to find out exactly what these data companies know about me?
Not really.
You have the right to review and correct your credit report. But with marketing data, there's often no way to know exactly what information is attached to your name — or whether it's accurate.
Most companies offer, at best, a partial picture.
While Acxiom lets consumers review some of the information the company sells about them, New York Times reporter Natasha Singer discovered this summer that only a sliver of information is shared, including whether you have a prison record or bankruptcy filings.
When Singer finally received her report, all it included was a record of her residential addresses.
Some companies do offer more access. A spokeswoman for Epsilon said it allows consumers to review "high level information" about their data — like whether or not you're listed as making a purchase in the "home furnishings" category. (Requests to review this information cost $5 and can only be made by postal mail.)
RapLeaf, a company that advertises that it has "real-time data" on 80 percent of U.S. email addresses, says that it gives customers "total control over the data we have on you," and allows them to review and edit the categories (like "estimated household income" and "Likely Political Contributor to Republicans") that RapLeaf has connected with their email addresses.
How do I know when someone has purchased data about me?
Most of the time, you don't.
When you're checking out at a store and a cashier asks you for your Zip code, the store isn't just getting that single piece of information. Acxiom and other data companies offer services that allow stores to use your Zip code and the name on your credit card to pinpoint your home address — without asking you for it directly.
Is there any way to stop the companies from collecting and sharing information about me?
Yes, but it would require a whole lot of work.
Many data brokers offer consumers the chance to "opt out" of being included in their databases, or at least from receiving advertising enabled by that company. Rapleaf, for instance, has a "Permanent opt-out" that "deletes information associated with your email address from the Rapleaf database."
But to actually opt-out effectively, you need to know about all the different data brokers and where to find their opt-outs. Most consumers, of course, don't have that information.
In their privacy report last year, the FTC suggested that data brokers should create a centralized website that would make it easier for consumers to learn about the existence of these companies and their rights regarding the data they collect.
How many people do these companies have information on?
Basically everyone in the U.S. and many beyond it. Acxiom, recently profiled by the New York Times, says it has information on 500 million people worldwide, including "nearly every U.S. consumer."
After the 9/11 attacks, CNN reported, Acxiom was able to locate 11 of the 19 hijackers in its database.
How is all of this data actually used?
Mostly to sell you stuff. Companies want to buy lists of people who might be interested in what they're selling — and also want to learn more about their current customers.
They also sell their information for other purposes, including identity verification, fraud prevention and background checks.
If new privacy laws are passed, will they include the right to see what data these companies have collected about me?
Unlikely.
In a report on privacy last year, the Federal Trade Commission recommended that Congress pass legislation "that would provide consumers with access to information about them held by a data broker." President Barack Obama has also proposed a Consumer Privacy Bill of Rights that would give consumers the right to access and correct certain information about them.
But this probably won't include access to marketing data, which the Federal Trade Commission considers less sensitive than data used for credit reports or identity verification.
In terms of marketing data, "we think at the very least consumers should have access to the general categories of data the companies have about consumers," said Maneesha Mithal of the FTC's Division of Privacy and Identity Protection.
Data companies have also pushed back against the idea of opening up marketing profiles for individual consumers' inspection.
Even if there were errors in your marketing data profile, "the worst thing that could happen is that you get an advertising offer that isn't relevant to you," said Rachel Thomas, the vice president of government affairs at the Direct Marketing Association.
"The fraud and security risks that you run by opening up those files is higher than any potential harm that could happen to the consumer," Thomas said.
Related on HuffPost:
Microsoft principal researcher Kate Crawford (@katecrawford) gave a strong talk at last week’s Strata Conference in Santa Clara, Calif. about the limits of big data. She pointed out potential biases in data collection, questioned who may be excluded from it, and hammered home the constant need for context in conclusions. Video of her talk is embedded below:
Crawford explored many of these same topics in our interview, which follows.
What research are you working on now, following up on your paper on big data?
Kate Crawford: I’m currently researching how big data practices are affecting different industries, from news to crisis recovery to urban design. This talk was based on that upcoming work, touching on questions of smartphones as sensors, on dealing with disasters (like Hurricane Sandy), and new epistemologies — or ways we understand knowledge — in an era of big data.
When “Six Provocations for Big Data” came out in 2011, we were critiquing the very early stages of big data and social media. In the two years since, the issues we raised are even more prominent.
I’m now looking beyond social media to a range of other areas where big data is raising questions of social justice and privacy. I’m also editing a special issue on critiques of big data, which will be coming out later this year in the International Journal of Communications.
As more nonprofits and governments look to data analysis in governing or services, what do they need to think about and avoid?
Kate Crawford: Governments have a responsibility to serve all citizens, so it’s important that big data doesn’t become a proxy for “data about everyone.” There are two problems here: first is the question of who is visible and who isn’t represented; the second is privacy, or what I call “privacy practices” — because privacy means different things depending on where and who you are.
For example, the Streetbump app is brilliant. What city wouldn’t want to passively draw on data from all those smartphones out there, a constantly moving network of sensors? But, as we know, there are significant percentages of Americans who don’t have smartphones, particularly older citizens and those with lower disposable incomes. What happens to their neighborhoods if they generate no data? They fall off the map. To be invisible when governments make resource decisions is dangerous.
Then, of course, there’s the whole issue of people signing up to be passively tracked wherever they go. People may happily opt into it, but we’d want to be very careful about who gets that data, and how it is protected over the long term — not just five years, but 50 years and beyond. Governments might be tempted to use that data for other purposes, even civic ones, and this has significant implications for privacy and the expectations citizens have for the use of their data.
Where else could such biases apply?
Kate Crawford: There are many areas where big data bias is a problem from a social equity perspective. One of the key ones at the moment is law enforcement. I’m concerned by some of the work that seeks to “profile” areas, and even people, as likely to be involved in crime. It’s called “predictive policing” (more here). We’ve already seen some problematic outcomes when profiling was introduced for plane travel. Now, imagine what happens if you or your neighborhood falls on the wrong side of a predictive model. How do you even begin to correct the record? Which algorithm do you appeal to?
What are the things, as David Brooks listed recently, that big data can’t do?
Kate Crawford: There are lots of things that big data can’t do. It’s useful to consider the history of knowledge, and then imagine what it would look like if we only used one set of tools, one methodology for getting answers.
This is why I find people like Gabriel Tarde so interesting — he was grappling with ideas of method, big data and small data, back in the late 1800s.
He reminds us of what we can lose sight of when we go up orders of magnitude and try to leave small-scale data behind — like interviewing people, or observing communities, or running limited experiments. Context is key, and it is much easier to be attentive to context when we are surrounded by it. When context is dissolved into so many aggregated datasets, we can start getting mistaken impressions.
When Google Flu Analytics mistakenly predicted that 11% of the US had flu this year, that points to how relying on a big data signal alone may give us an exaggerated or distorted result (in that case, more than double the actual figure, which was between 4.5-4.8%). Now, imagine how much worse it would be if that data was all that health agencies had to work with.
I’m really interested in how we might best combine computational social science with traditional qualitative and ethnographic methods. With a range of tools and perspectives, we’re much more likely to get a three-dimensional view of a problem and be less prone to serious error. This goes beyond tacking on a few focus groups to big datasets, but conjoining deep, ethnographically-informed research with rich data sources.
What can the history of statistics in social science tell us about correlation vs causation? Does big data change that dynamic?
Kate Crawford: This is a gigantic question, and one that could be its own talk! With big datasets, it’s very tempting for researchers to engage in apophenia — seeing patterns where none actually exist — because massive quantities of data can point to a range of correlative possibilities.
For example, David Leinweber showed back in 2007 that data mining techniques could show a strong but spurious correlation between the changes in the S&P 500 stock index and butter production in Bangladesh. There’s
another great correlation between the use of Facebook and the rise of the Greek debt crisis.With big data techniques, some people argue you can get much closer to being able to predict causal relations. But even here, big data tends to need several steps of preparation (data “cleaning” and pre-processing) and several steps in interpretation (deciding which of many analyses shows a positive result versus a null-result).
Basically, humans are still in the mix, and thus it’s very hard to escape false positives, strained correlations and cognitive bias.
When I first started in the technology field some 35 years ago, we had a common acronym, KISS, which stood for “Keep It Simple, Stupid.”
The purpose of the phrase was to remind IT professionals that they needed to speak with people using basic business language. One’s ability to lose the IT jargon was critical, especially when presenting to business executives and managers. The message remains true today—most certainly for those CIOs who need to articulate the strategic value of their IT organizations to executives who want higher returns and lower costs.
But it’s not so simple.
We are being overwhelmed with the use of digital data to make decisions. You can’t read much these days without immediately seeing buzzwords like “Big Data,” “Business Analytics,” or “Business Intelligence.” Discussing process issues is passé; how to deal with data is the “in” thing for business discussions, especially at board meetings.
In a recent column, I cited some examples where data only adds complexity to decision making. More and more, I’m seeing other ways in which we’ve become increasingly reliant on data testing. For example, you can’t even get a basic call-center job without being asked to take a battery of online assessments that determine if you are a match.
These tests are really about identifying the existence of an outlier—a word that no analytics person wants to use. Human resources folks usually call it data matching, but from where I stand, they are simply eliminating people that differ from those that they hire—they eliminate those that deviate from the norm.
I recently saw an article on CareerBuilder.com that addressed the issue from the recruiting side. Rob Sentz, vice president of marketing for EMSI, highlighted the difficulties that Big Data can bring—and how it could lead to bad decisions.
“The biggest limit to big data is our ability to interpret it. People need to understand why they are using data. What is the end goal?” Sentz said. “Data is also like an assembly of facts, which aren’t necessary the same thing as truth. If facts are poorly interpreted, it could lead to the wrong conclusions.”
Consider some people who could be classified as outliers: Albert Einstein could not get an appointment as a professor, so he got a job in the Swiss patent office, where his deviancy from the norm led him to a miracle year of inventions that changed the scientific world. Winston Churchill, at age 65, was originally considered much too old and difficult to work with to be a prime minister; he went on to save England during World War II.
Outliers can change the world, yet in the world of Big Data, they would never have a chance—quants would cancel them out for diverging from the trend. Seems to me we could have lost some wars depending on such Big Data people.
Don’t get me wrong: Data is critical. But history suggests that it plays tricks on our ability to objectively understand all of the variables that are at play in the world. So be careful: Although many professionals tell you that the data is only one of many decision points, I have found that too many people rely too heavily on its information. But as we have seen, the data can lie!
Dr. Arthur Langer sits on three faculties at Columbia University and oversees executive masters programs in IT management. He is also founder and chairman of Workforce Opportunity Services, a nonprofit that helps companies build stronger talent pipelines by training underserved young adults and military veterans.
Not long ago, I was at a dinner with the chief executive of a large bank. He had just had to decide whether to pull out of Italy, given the weak economy and the prospect of a future euro crisis. The C.E.O.
by Drew Skau 1 year ago Filed Under: Design
Data Visualization is a relatively new field and as such, it has a lot of maturing to do. And part of that process is determining what is acceptable practice. At Visual.ly, we’ve decided that it is important to have a visible code of ethics, because it establishes a standard of quality, helps us garner trust from clients, users and viewers, and gives our team a sense of confidence and pride in their work.
But how do you develop a visualization-specific code of ethics? In many ways, visualization is similar to journalism. In fact, many – if not most – large newspapers have created dedicated visualization departments, which produce some of the highest-quality data visualizations we see today. That’s hardly coincidental. Much like journalists, data visualization professionals have to collect data and information and then represent it to the public in the most truthful way possible.
Such similarities make codes of ethics created for journalism very appropriate for the data visualization community. The Society of Professional Journalists’ code of ethics is a perfect fit for the general ideas behind ethical visualization.
But there are still some specifics that need to be covered. The visualization process involves several complex steps, and ethical procedures need to be practiced throughout, so that the final result is pure. We’ve outlined the three basic steps below.
1. Data collection
Data is pretty easy: data sources must be reliable and verifiable, attribution should be given whenever possible, dates should be included, etc. For more on finding reliable and verifiable sources, read our blog post on researching and sourcing infographics.
2. Data Analysis
This is where you find the “story” that goes into your visualization, and depending on what you are creating, the steps you take in your analysis can vary greatly. Sometimes, the data source is very simple and there isn’t much analysis necessary. Other times, the data has multiple complex stories in it, and the analysis must be done carefully to only find truths.
It is important to leave out assumptions and only look at what the source data actually shows. If you have to make some basic assumptions, and if these assumptions aren’t obviously visible in the finished product, you need to make them known with annotations. Because the data analysis happens behind closed doors, so to speak — a viewer can’t see what exactly it is that you did — this is the stage where the viewer needs to trust the presenter to have done their job well.
3. Design
The final stage is actually creating the visuals. Since the cognitive processes that make visualization work are still being researched, creating a comprehensive guide to ethical visualization is difficult. Still, we have plenty to work with to create a solid base of ethics requirements.
- When designing, try to accurately portray the data and analysis, using the visuals you choose.
- Be aware of things like the hierarchy of importance of visual properties and best labeling practices. Colors alone have a huge range of issues, from cultural meaning to isoluminance to colorblindness.
- To really do visualization responsibly, immerse yourself in the world of visualization. Do lots of reading on the subject, examine any visualization you see with a critical eye, and be open to criticism yourself.
At VisWeek2011, Jason Moore suggested a hippocratic oath for visualization. It is shown below as it appears on Robert Kosara’s blog. It is intended to be succinct and easy to remember, while still containing the essence of responsible visualization:
I shall not use visualization to intentionally hide or confuse the truth which it is intended to portray. I will respect the great power visualization has in garnering wisdom and misleading the uninformed. I accept this responsibility willfully and without reservation, and promise to defend this oath against all enemies, both domestic and foreign.
Drew Skau is a PhD Computer Science Visualization student at UNCC, with an undergraduate degree in Architecture.
Infauxgraphics, Beware!Like most people in the design community, I love infographics. Instead of having your eyes glaze over while looking at a set of numbers, they report and clarify data through a visualized narrative. To prove its popularity, infographics have started to appear all over: here, here, and here. Magazines such as Fast Company frequently feature infographics from other parts of the Web on their Twitter feed. Their formula seems to work: show an infographic; take a position; and write about it.
I agree that it’s tempting to want to write about an infographic that looks interesting, but it’s also really important to consider the way the data is presented to your readers.
In a recent post by Fast Company, Facebook is Winning Silicon Valley’s Talent War, the writer assumes that Facebook is “stealing the most talent away from others” in the Valley. Briefly looking over the infographic that’s shown in the article (below), I can see why he would make such an assumption:
Facebook has the most arrows pointing to it. Facebook’s notoriety may have also played a role in the author’s conclusion.
When I first saw this, I was a bit skeptical since it didn’t convey the data that it was trying to show. For example, the lines are all equal height even though their magnitudes differed from one another. According to Edward Tufte’s The Visual Display of Quantitative Information, this is known as “a distortion in a data graphic.”
A graphic does not distort if the visual representation of the data is consistent with the numerical representation. —Edward Tufte
Another problem was the “1 to 1″ connection between LinkedIn and Apple. Technically, a green arrow should be pointing towards LinkedIn from Apple and a cyan arrow towards Apple from LinkedIn since both of these companies have the same ratio. This would have upped LinkedIn from three inward facing arrows to four (one short of Facebook).
As mentioned in Fast Company’s post (in the comment section), I wanted to take a stab at redoing this infographic based on the points mentioned above.
For starters, I pulled the small set of numbers into a Google spreadsheet and right off the bat, the numbers didn’t support the article’s assumption that Facebook was winning the talent war. It turns out that LinkedIn, and NOT Facebook, was the top drawer of talent, even if it was only by a small margin.
I then pulled these numbers into a bar graph, which shows the breakdown of hires for each company.
According to the article,
Google is drawing most of its talent from the stodgy halls of Microsoft; Facebook, meanwhile, is drawing from other hot startups.
Based on the bar graph above, it is apparent that Facebook has drawn more employees “from the stodgy halls of Microsoft.” 25 more than Google to be exact. My assumption here is that in the original infographic, the red arrow that points from Microsoft to Google along with its close proximity might have visually deceived the writer.
The next step was to bring all this information together into the final infographic, which is very similar to the original graphic except that the lines are now proportionate to their respective magnitudes.
In the updated infographic, there is without a doubt that LinkedIn and Facebook are head-to-head in acquiring employees from other companies in the Valley. To better illustrate this, I removed all the other paths except for the ones associated with LinkedIn and Facebook.
Conclusion
As infographics become more prominent on the web, we should be mindful about what is being presented to us. No matter how pretty an infographic is, always question the way it positions the data for our consumption.Related Links
Yesterday, under the headline, "The saddest graph you'll see today," Dylan Matthews at the Washington Post published this infographic created by the Enliven Project to put the legal issues around rape, its prosecutions, and concerns about false accusations into perspective. The graphic quickly made the rounds on Twitter and Facebook, but unfortunately, while well-intentioned, it is also misleading in significant ways that can be used to undercut its basic message, which is sound: that false rape accusations are rare.
The persistent myth that false accusations are common makes it incredibly difficult for victims to get justice—the overwhelming threat of being accused of making it all up to cover up for one's slutty ways (see recently: Steubenville, Notre Dame, Cleveland) is enough to make women simply not report. Those who do report run a very high chance of never seeing a conviction, some because police drop the case on the slut-and-liar grounds and some because juries buy the defense attorney's claim that the victim bizarrely preferred being publicly accused of being a slut and liar to quietly forgetting about a night of forced sex.
Sadly, the graphic meant to set the record straight on false accusations only confuses matters. Three major problems jump out:
The graphic assumes one-rape-per-rapist. Looking at the above picture, one might start to get the impression that every other man you meet is a rapist. Nearly one in five women have been raped, according to the latest substantive government numbers, and infographics like this might make people conclude therefore that one in five men is a rapist. In reality, a much smaller (though still troubling) number—an estimated 6 percent of men—are rapists. Your average rapist stacks up six victims. That's hard to capture in an infographic, but could be clearer by just labeling the little dudes "rapes" instead of "rapists." After all, the fact that most rapists are repeat offenders drives home how troubling it is that victims can't find justice. If more rapists saw a jail cell the first time they raped someone, the number of victims would decline dramatically.
The graphic overestimates the number of unreported rapes. It's hard to measure how many rapes go unreported, because, duh, unreported. Making it even harder to get an accurate count, a lot of rape victims don't identify as rape victims, because it's so stigmatized. Still, improved public education has made it easier for rape victims to report. RAINN (the Rape, Abuse and Incest National Network), using government numbers, estimates that 54 percent of rapes go unreported. Tweaking the infographic to reflect this more conservative number wouldn't make the image less convincing, but it would make it more accurate.
The graphic overestimates the number of false accusations. This infographic is intended to drive home how rare false accusations are, and yet, because of a simple error, it overestimates how many actually occur. The problem is that the Enliven Project conflates "false reports," which only require the claim that a crime has happened, with "false accusations," which require fingering a supposed perpetrator. This might seem like a small thing, but this report from the National Center for the Prosecution of Violence Against Women, which focuses in part on teaching law enforcement to understand and root out false reports of rape, is very careful to warn against conflating the two. In its list of potential indicators of a false report, the Center specifically singles out the lack of a named perpetrator as something to look out for:
To summarize material developed by McDowell and Hibler (1987), realistic indicators of a false report could potentially include:
• A perpetrator who is either a stranger or a vaguely described acquaintance who is not identified by name. As previously discussed, most sexual assault perpetrators are actually known to their victims. Identifying the suspect is therefore not typically a problem. However, victims who fabricate a sexual assault report may not want anyone to actually be arrested for the fictional crime. Therefore, they may say that they were sexually assaulted by a stranger or an acquaintance who is only vaguely described and not identified by name.
Emphasis mine. According to the document, 2-8 percent of reported rapes are false, but the number that are false accusations is smaller. Women who make false reports want sympathy, and as victims of real rapes can tell you, accusing a real man usually gets you very little.
As I said above, the Enliven Project has the best intentions and they're on the right path. It is true that most rapes go unreported, that the public believes false accusations are exponentially more common than they actually are, and that a man's chances of being falsely accused of rape are incredibly small. All these things are important to convey, and an infographic is a great way to do it. Just fix the graphic, and the public will learn a lot.
This post includes a list of recommended (non-mandatory) readings for all my Introduction to Infographics and Data Visualization courses.
The main reading of those classes is my own book: The Functional Art: An Introduction to Information Graphics and Visualization.A good example of how to criticize and redesign a flawed chart
Robert Kosara/EagerEyesVisual encoding
Michael DubakovOn the differences between information-data-scientific visualization(s)
Sheila PontisData visualization for human perception
Stephen FewThe 8 hats of data visualization design
Andy KirkHow to become a data visualization expert: A recipe
Enrico BertiniHow to choose your graphic: Graphic Cheat Sheet
Billion Dollar GraphicsHow to choose the right chart
Carla UrionaA survey of powerful visualization techniques, from the obvious to the obscure
Reif LarsenInfographics and visualizations as tools for the mind
Alberto CairoBringing infographics and visualization to the mainstream
Alberto CairoEnding the infographic plague
Megan McArdleA Quick Illustrated History of Visualization
DataArtInteraction design for data visualizations
Lars GrammelThe Science of Information Visualization
Robert KosaraHow Much Data Do You Really Need?
Robert KosaraThe Explanatory Power of Data Points
Robert KosaraThe Three Types of Chart Junk
Robert KosaraUsing data visualization to find insights in data
Gregor AischAbout Nigel Holmes
Robert KosaraInterviews with Edward Tufte: one, and two
A history of dishonest Fox Charts
MediaMattersVisualizing Social Facts: Otto Neurath's ISOTYPE project
Frank HartmannWhen maps shouldn't be maps
Matt EricsonDroughts on deadline
Kevin QuealyUsing indexed charts to represent change
ChandooThe future of data visualization
Drew Skau10 things you can learn from The New York Times' data visualizations
Andy KirkHow Governments can better use data visualization
Jon SchwabishFast thinking and slow thinking visualization
Spatial AnalysisTower Graphics
Lulu PinneyThe process of creating data visualizations
Jan Willem TulpData art vs. data visualization: Why does a distinction matter?
Stephen FewData visualization: Clarity of Aesthetics (part 1, part 2, part 3)
Ben JonesStacked area chart vs. Line chart – The great debate
Andy KriebelWord Clouds considered harmful
Jacob HarrisHow to display headlines and intros in graphics
Storytelling with dataThe case for horizontal bar graphs
Storytelling with data(Now, a few about data journalism)
Computational Journalism reading list
Jonathan StrayOpen data journalism
Simon RogersIT professionals in the newsroom
George WrightSpeaking of Graphics: An Essay on Graphicacy in Science, Technology and Business
Paul J. LewiPRESENTATIONS/VIDEOS/PODCASTS
What makes a good data visualization? With Manuel Lima, Kaiser Fung, Jonathan Stray, and others
Me, one and two (listen to all the other ones, by the way)
Noah Iliinsky (who is author of a nice intro to visualization)
RECOMMENDED BOOKS
Besides my own The Functional Art, see List 1 and list 2
SOME BLOGS TO FOLLOW
In no particular order. I've just copied them from my RSS reader in the way they are (un)organized in there. Copy and paste in your browser:
http://visualisingdata.com/
http://eagereyes.org
http://flowingdata.com/
http://www.guardian.co.uk/news/datablog
http://infosthetics.com/
http://www.storytellingwithdata.com/
http://blog.visual.ly/
http://michaelbabwahsingh.com/
http://feltron.tumblr.com/
http://thedailyviz.com/
http://thewhyaxis.info/
http://www.perceptualedge.com/blog/
http://junkcharts.typepad.com/junk_charts/
http://www.excelcharts.com/blog/
http://blogs.forbes.com/naomirobbins/
http://www.visualcomplexity.com/vc/
http://fellinlovewithdata.com/
http://well-formed-data.net/
http://chartsnthings.tumblr.com/
http://lulupinney.co.uk
http://dataremixed.com/
http://storiesthroughdata.blogs.lincoln.ac.uk/
http://www.interactive-infographics.com/
16 April 2012 Last updated at 19:01 ET By Fiona Graham Technology of business reporter, BBC NewsSitting at your desk in the middle of the day, yet another email notification pops up in the corner of the screen, covering the figures you're trying to digest in the complicated spreadsheet in front of you.
Your laptop is open on the desk next to you with another set of figures you need - meanwhile you're frantically tabbing through different documents on the main screen.
You have a meeting in 20 minutes and you suddenly feel as if you're swimming in a sea of impenetrable data, and you're starting to sink.
Welcome to the 21st Century workplace, and "data overload".
Under siegeYou're not alone.
Dr Lynda Shaw is a neuroscience and psychology lecturer at Brunel University in the west of London.
"I've been interviewing a lot of senior businesspeople lately, and they're actually hiding... because they're frightened they're going to be asked questions they can't answer, so they're delaying making really quite important decisions," she says.
"When we're inundated with emails, Twitter, Facebook, social media, search engines like Google, it's as if we're expected to know more than we actually do, and we can't retain that level of information, that bombardment.
"When we feel overwhelmed we start to delay making decisions."
Dr Shaw says this is a symptom of the computer age.
"We've really seen this incredible amount of information flooding us constantly. The problem with information overload is really new to the human brain."
She says this ultimately has huge implications for us both personally, and in terms of business - with obvious implications for productivity.
"When we're in a stressful situation, cortisol, the stress hormone rises. One of the jobs of cortisol is to work with the neurotransmitters. So when it is up we experience memory loss, depression, high blood pressure "
And the rate at which we are bombarded with data on a daily basis is increasing exponentially.
According to Cisco's Visual Networking Index, average global IP traffic in 2015 will reach 245 terabytes per second, equivalent to 200m people streaming an HD movie at the same time every day.
Within the next three years, there will be nearly 15bn network connections via devices and nearly 3bn internet users - more than 40% of the world's population.
So short of switching off the PC and going out and doing something more interesting instead, what can we do about it?
Pretty as a pictureOne answer may lie with the way data is presented to us.
Continue reading the main storyData visualisation v text
- Individuals working with visual mapping techniques used on average 19% less cognitive resources
- They were 17% more productive and 4.5% better able to recall details than when using the equivalent traditional software
- Groups working together on a project used on average 10% less cognitive resources
- They were 8% more productive and recalled 6.5% more data when using visual mapping compared with traditional techniques
In a lab in Sussex a group of people have had their brainwaves scanned while completing a series of tasks, individually and in groups, to see if data visualisation - presenting information visually, in this case a series of mind maps - can help.
The results showed that when tasks were presented visually rather than using traditional text-based software applications, individuals used around 20% less cognitive resources. In other words, their brains were working a lot less hard.
As a result, they performed more efficiently, and could remember more of the information when asked later. Working in groups, they used 10% less mental resources.
The research was carried out by Mindlab International, an independent research company that specialises in neurometrics - the science of measuring patterns of brain activity through EEG, eye tracking and skin conductivity, which tracks emotions.
"The key reason we do the work that we do is that most of our decision making, yours and mine, goes on in the subconscious, or auto pilot or whatever we call it. Our cognitive brain can't actually deal with the bombardment of messages that are streamed to our bodies constantly all the time," says Duncan Smith, Mindlab International's managing director.
The research was commissioned by work management software specialists Mindjet, and used their MindManager software. All participants were familiar with both this and traditional text based word-processing software, email etc.
"We did expect that visual mapping would perform better purely and simply because this is the way the brain is wired up. We don't work as a filling cabinet, we don't work in a linear fashion," says Mr Smith.
"If you present data visually it has much more impact and the brain finds it much easier to process."
San Francisco-based Mindjet specialises in mind maps - diagrams that present ideas, words and any other form of data grouped round a central key theme. The company says 83% of Fortune 100 companies are using its products.
Mindjet's Chris Harman says the research was commissioned following a survey the company did at the end of 2011 which found two-thirds of people felt they were "drowning" in data.
"We thought we know the problem, what difference can we actually make?"
Visually stimulatingData visualisation is not limited to mind maps - the current vogue for infographics is another way to present information in a non-linear visual fashion.
Data visualisation expert David McCandless's Information is Beautiful website showcases good examples of data design.
Visual.ly gives designers a platform to upload and showcase work as well as providing tools to create your own. Google Fusion and d3.js create simple visual representations of data, Quantum GIS and OpenHeatMap use maps and data together. And there are many more.
Phillipa Cardinal is post-production manager at Discovery Europe.
A large part of her job involves consolidating and analysing data. To do this she uses Tableau Software, which lets her create data visualisations accessible from a central dashboard.
"To me it's quite obvious when I'm exploring the data, things just pop out at you that you might not see if it was in a text-based environment," she says.
"Being able to consolidate all these different bits of data onto a dashboard that we use for reporting upwards to our senior management team, you're able to really tell a story with it.
"You want to make sure that the time they spend looking at the data is used effectively."
Continue reading the main story“Start Quote
End Quote Dr Lynda SmithIf we can stop feeling overwhelmed ... we can actually start enjoying this information”
Francois Ajenstat is director of product management at Tableau Software. He says the benefits of data visualisation are obvious.
"The first is it can help you make sense of data - I think that's actually quite fundamental especially as the amount of data that is collected every single day is growing exponentially. I think we're collecting more data in the last year than has ever been created in history.
"How do you make sense of that? It's more than just getting a report, it's about being able to see it, and seeing it with your eyes and the visual element of your brain is actually very, very powerful.
"Seeing a number bars and a line you can infer very quickly what is going on versus if you just look at numbers."
Brain function specialist Dr Lynda Shaw says by using these tools and others to minimise the overload, the growth in data can be a positive thing for all of us.
"The visual brain is this incredibly flexible and adaptable design to help us see and remember and make sense of everything around us."
"If we can stop feeling overwhelmed ... we can actually start enjoying this information, and by enjoying it we might be able to increase our brain capacity because we're using it better. "
Bad design does irreversible harm to an infographic. It trivializes good content (for the rare visitor who chooses to wade through the ugliness and confusion instead of fleeing in horror at the sight of it) and reflects badly on you and your business.
There's no substitute for working with a good designer, but if that isn't an option here are seven tips to keep your design on track:
1) Edit your content first: People can fall in love with their content. There's a tendency by some, including your superior, to want to include everything they researched or wrote for a graphic. This can create horribly cluttered and unsuccessful visualizations, like trying to cram five pounds of garbage into a two-pound bag.
One reason this congestion occurs is the graphic is written before it's designed and then all the text is crammed into a layout. A good way to keep it at a reasonable amount is to do a layout of the graphic first (using the tips 2 through 5 below) and then writing the text to fit the allocated space afterward.
2) Break your content down into sections: Every subject can be broken down into about three-to-five component parts making your graphic much easier to follow and digest. What would the sections be? What would the subsections be? Make an outline and write out each sub section's title (you can edit it later if it doesn't fit in the layout). Another cool way to break your content down is visually, using a mind map. I particularly like to use this free one, Bubbl.us
3) Organize your layout with a column grid: Most layout programs like Microsoft Publisher, Adobe InDesign and even Word allow you to make some sort of a lined grid that will help give structure to your graphic. Don't make a grid with too many lines (like graph paper) because it will be overwhelming to design with, but don't make too few lines, either, like only three. You want enough to give you some options for carving up the space.
View image
View image
(Notice how the boxes lock into the grid.)4) Lay out your graphic in a logical way, and keep it simple. The goal of your design should be to take the reader by the hand and lead them easily through the information. Too many people try to show off their fun, creative side in a graphic, and that can often be fatal for easy navigation. In Western cultures, we tend to read from the top left of a graphic (where headlines often live) and work our way down to the bottom right (where tiny source lines live).
5) Make the topic's main point the largest element: Think about what a poster's job is. It's to grab someone's attention and give them an idea about what the content is about. This can be true for an infographic. It's a good idea to make the main point you're trying to get across the dominant element in your graphic, but make sure you have enough content in this section to justify its size.
View image
(One of my Newsweek graphics that uses a large U.S. map as it's dominant image)6) Word art is not high art! Keep your type simple. A lot of the 'art' type that is available in some programs like Word and Powerpoint can cause howls of derisive laughter among your readers. And avoid cliché fonts like Comic Sans and Papyrus. All they really do is erode a graphic's credibility. When in doubt, go with Helvetica or Arial and use a hierarchy of sizes and weights. The headline can be the largest because it's where you want people's eyes to go first, then smaller section heads and then the body text. And keep text black. Leave color for the graphic elements.
View image
(The type sizes are a suggestion, only, and won't work for all graphics)7) Use color for a good reason: Just because there are a billion colors available in these graphics programs doesn't mean you should use them all. A good philosophy with color is to keep it minimal and to use it for guiding the reader to important information. Color the less important elements with muted colors, like grays and tans, and use one or two saturated (bright) colors as "accent colors" where you want the reader's eye to go. Consider your company's branding when choosing colors. Also, you want all your palette of colors to go together like a good outfit you'd wear, and you can find color families by searching terms like this on the web.
View image
(A graphic I made that uses a red accent color surrounded by muted colors.)PLEASE SHARE YOUR LATEST VISUALIZATION CHALLENGE WITH US IN THE COMMENTS SECTION. WHAT'S THE STORY BEHIND IT? Add a link if you like.
Follow Karl Gude on Twitter: www.twitter.com/karlgude
FOLLOW MEDIA
This week we looked at how to determine if what you think you’re seeing in your data is actually there. It was a warp speed introduction to some of the major truth-finding methods. Most of the ideas behind the methods are centuries or occasionally millennia old, but they were very much fleshed out in the 20th century.
“Figuring out what is true from what we can see” is called inference, and begins with a strong feel for how probability works, and what randomness looks like. Take a look at this picture (from the paper Graphical Inference for Infovis), which shows how well 500 students did on each of nine questions, each of which is scored from 0-100% correct.
Is there a pattern here? It looks like the answers on question 7 cluster around 75% and then drop off sharply, while the answers for question 6 show a bimodal distribution — students either got it or they didn’t.
Except that this is actually completely random synthetic data, drawn from a uniform distribution (equal chance of every score.) It’s very easy to make up narratives and see patterns that aren’t there — a human tendency called apohenia. To avoid fooling yourself, the first step is to get a feel for what randomness actually looks like. It tends to have a lot more structure, purely by chance, than most people imagine.
Here’s a real world example from the same paper. Suppose you’re interested to know if the pollution from the Texas oil industry causes cancer. Your hypothesis is that if refineries or drilling release carcinogens, you’ll see higher cancer rates around specific areas. Here’s a plot of the cancer rates for each county (darker is more cancer.) One of these plots is real data, the rest are randomly generated by switching the counties around. (click for larger.)
Can you tell which one is the real data? If you can’t tell the real data from the random data, well then, you don’t have any evidence that there is a pattern to the cancer rates.
In fact, if you show these pictures to people (look at the big version), they will stare at them for a minute or two, and then most folks will pick out plot #3 as the real data, and it is. This is evidence (but not proof) that there is a pattern there that isn’t random — because it looked different enough from the random patterns that you could tell which plot was real.
It’s part of the job of the journalist to understand the odds. In 1976, there was a huge flu vaccination program in the U.S. In early October, 14 elderly people died shortly after receiving the vaccine, three of them in one day. The New York Times wrote in an editorial,
It is conceivable that the 14 elderly people who are reported to have died soon aPer receiving the vaccination died of other causes. Government officials in charge of the program claim that it is all a coincidence, and point out that old people drop dead every day. The American people have even become familiar with a new statistic: Among every 100,000 people 65 to 75 years old, there will be nine or ten deaths in every 24-‐hour period under most normal circumstances.
Even using the official statistic, it is disconcerting that three elderly people in one clinic in Pittsburgh, all vaccinated within the same hour, should die within a few hours thereafter. This tragedy could occur by chance, but the fact remains that it is extremely improbable that such a group of deaths should take place in such a peculiar cluster by pure coincidence.
Except that it’s not actually extremely improbable. Nate Silver addresses this issue in his book by explicitly calculating the odds:
Assuming that about 40 percent of elderly Americans were vaccinated within the first 11 days of the program, then about 9 million people aged 65 and older would have received the vaccine in early October 1976. Assuming that there were 5,000 clinics nationwide, this would have been 164 vaccinations per clinic per day. A person aged 65 or older has about a 1-‐in-‐7,000 chance of dying on any particular day; the odds of at least three such people dying on the same day from among a group of 164 patients are indeed very long, about 480,000 to one against. However, under our assumptions, there were 55,000 opportunities for this “extremely improbable” event to occur— 5,000 clinics, multiplied by 11 days. The odds of this coincidence occurring somewhere in America, therefore, were much shorter —only about 8 to 1
Silver is pointing out that the editorial falls prey to what might be called the “lottery fallacy.” It’s vanishingly unlikely that any particular person will win the lottery next week. But it’s nearly certain that someone will win. If there are very many opportunities for a coincidence to happen, and you don’t care which coincidence happens, then you’re going to see a lot of coincidences. You can see this effect numerically with even the rough estimation of the odds that Silver has done here.
Another place where probabilities are often misunderstood is polling. During the election I saw a report that Romney had pulled ahead of Obama in Florida, 49% to 47% with a 5.5% margin of error. I argued at the time that this wasn’t actually a story, because it was just too likely that Obama was actually still leading and the error in the poll was just that, error. In class we worked the numbers on this example and concluded that there was a 36% chance — so, 1 in 3 odds — that Obama was actually ahead (full writeup here.)
In fact, 5.5% is an unusually high error for a poll, so this particular poll was less informative than many. But until you actually run the numbers on poll errors a few times, you may not have a gut feel for when a poll result is definitive and when it’s very likely to be just noise. As a rough guide, a difference between two numbers of twice the margin of error is almost certain to indicate that the lead is real.
If you’re a journalist writing about the likelihood or unlikelihood of some event, I would argue that it is your job to get a numerical handle on the actual odds. It’s simply too easy to deceive yourself (and others!)
Next we looked at conditional probability — the probability that something happens given that something else has already happened. Conditional probabilities are important because they can be used to connect causally related events, but humans aren’t very good at thinking about them intuitively. The classic example of this is the very common base rate fallacy. It can lead you to vastly over-estimate the likelihood that someone has cancer when a mammogram is positive, or that they’re a terrorist if they appear on a watch list.
The correct way to handle conditional probabilities is with Bayes’ Theorem, which is easy to derive from the basic laws of probability. Perhaps the real value of Bayes’ theorem for this kind of problem is that it forces you to remember all of the information you need to come up with the correct answer. For example, if you’re trying to figure out P(cancer | positive mammogram) you really must first know the base rate of cancer in the general population, P(cancer). In this case it is very low because the example is about women under 50, where breast cancer is quite rare to begin with — but if you don’t know that you won’t realize that the small chance of false positives combined with the huge number of people who don’t have cancer will swamp the true positives with false positives.
Then we switched gears from all of this statistical math and talked about how humans come to conclusions. The answer is, badly if you’re not paying attention. You can’t just review all the information you have on a story, think about it carefully, and come to the right conclusion. Our minds are simply not built this way. Starting in the 1970s an amazing series of cognitive psychology experiments revealed a set of standard human cognitive biases, unconscious errors that most people make in reasoning. There are lots of these that are applicable journalism.
The issue here is not that the journalist isn’t impartial, or acting fairly, or trying in good faith to get to the truth. Those are potential problems too, but this is a different issue: our minds don’t work perfectly, and in fact they fall short in predictable ways. While it’s true that people will see what they want to see, confirmation bias is mostly something else: you will see what you expect to see.
The fullest discussion of these startling cognitive biases — and also, conversely, how often our intuitive machinery works beautifully — is the book by one of the original researchers, Daniel Kahneman’s Thinking Fast and Slow. I also know of one paper which talks about how cognitive biases apply to journalism.
So how does an honest journalist deal with these? We looked at the method of competing hypotheses, as described by Heuer. The core idea is ancient, and a core principle of science too, but it bears repetition in modern terms. Instead of coming up with a hypothesis (“maybe there is a cluster of cancer cases due to the oil refinery”) and going looking for information that confirms it, come up with lots of hypothesis, as many as you can think of that explain what you’ve seen so far. Typically, one of these will be “what we’re seeing happened by chance,” often known as the null hypothesis. But there might be many others, such as “this cluster of cancer is due to more ultraviolet radiation at the higher altitude in this part of the country” or many other things. It’s important to be creative in the hypothesis generation step: if you can’t imagine it, you can’t discover that it’s the truth.
Then, you need to go look for discriminating evidence. Don’t go looking for evidence that confirms a particular hypothesis, because that’s not very useful; with the massive amount of information in the world, plus sheer randomness, you can probably always find some data or information to confirm any hypothesis. Instead you want to figure out what sort of information would tell you that one hypothesis is more likely than another. Information that straight out contradicts a hypothesis (falsifies it) is great, but anything that supports one hypothesis more than the others is helpful.
This method of comparing the evidence for different hypothesis has a quantitative equivalent. It’s Bayes’ theorem again, but interpreted a little differently. This time the formula expresses a relationship between your confidence or degree of belief in a hypothesis, P(H), the likelihood of seeing any particular evidence if the hypothesis is true, P(E|H), and the likelihood of seeing any particular piece of evidence whether or not the hypothesis is true, P(E)
To take a concrete example, suppose the hypothesis H is that Alice has a cold, and the evidence E is that you saw her coughing today. But of course that’s not conclusive, so we want to know the probability that she really does have a cold (and isn’t coughing for some other reason.) Bayes’ theorem tells us what we need to compute P(H|E) or rather P(cold|coughing)
Under these assumptions, P(H|E) = P(E|H)P(H)/P(E) = 0.9 * 0.05 / 0.1 = 0.45, so there’s a 45% chance she has a cold. If you believe your initial estimates of all the probabilities here, then you should believe that there’s a 45% chance she has a cold.
But these are rough numbers. If we start with different estimates we get different answers. If we believe that only 2% of our friends have a cold at any moment then P(H) = 0.02 and P(H|E) = 18%. There is no magic to Bayesian inference; it can seem very precise but it all depends on the accuracy of your models, your picture of how the world works. In fact, examining the fit between models and reality is one of the main goals of modern statistics.
There’s probably no need to apply Bayes’ theorem explicitly to every hypothesis you have about your story. Heuer gives a much simpler table-based method that just lists supporting and disproving evidence for each hypothesis. Really the point is just to make you think comparatively about multiple hypothesis, and consider more scenarios and more discriminating evidence than you would otherwise. And not be so excited about confirmatory evidence.
However, there are situations where your hypotheses and data are sufficiently quantitative that Bayesian inference can be applied directly — such as election prediction. Here’s a primer on quantitative bayesian inference between multiple hypotheses. A vast chunk of modern statistics — most of it? — is built on top of Bayes’ theorem, so this is powerful stuff.
Our final topic was causality. What does it even mean to say that A causes B? This question is deeper than it seems, and a precise definition becomes critical when we’re doing inference from data. Often the problem that we face is that we see a pattern, a relationship between two things — say, dropping out of school and making less money in your life — and we want to know if one causes the other. Such relationships are called correlations, and probably everyone has heard by now that correlation is not causation.
In fact if we see a correlation between two different variables X and Y there are only a few real possibilities. Either X causes Y, or Y causes X, or Z causes both X and Y, or it’s just random fluke.
Our job as journalists is to figure out which one of these cases we are seeing. You might consider them alternate hypotheses that we have to differentiate between.
But if you’re serious about determining causation, what you actually want is an experiment: change X and see if Y changes. If changing X changes Y then we can definitely say that X causes Y (though of course it may not be the only cause, and Y could cause X too!) This is the formal definition of causation as embodied in the causal calculus. In certain rare cases you can prove cause without doing an experiment, and the causal calculus tells you when you can get away with this.
Finally, we discussed a real world example. Consider the NYPD stop and frisk data, which gives the date and location of each of the 600,000 stops that officers make on the street every year. You can plot these on a map. Let’s say that we get a list of mosque addresses, and discover that we discover that there are 15% more stops than average within 100 meters of New York City’s mosques. Given the NYPD history of spying on muslims, do we conclude that the police are targeting mosque-goers?
Let’s call that H1. How many other hypothesis can you imagine that will also explain this fact? (We came up with eight in class.) What kind of information or data or tests would you need to do to decide which hypothesis is the strongest?
The readings for this week were:
Topics: education, guest, guide, infographics, learning, questions, technology
What Are Infographics?
I found many definitions of what Infographics are as well as explanations of how they are useful in a variety of settings. Here are a couple of the definitions I liked followed by their sources:
Information graphics or infographics are graphic visual representations of information, data or knowledge. These graphics present complex information quickly and clearly, such as in signs, maps, journalism, technical writing, and education. With an information graphic, computer scientists, mathematicians, and statisticians develop and communicate concepts using a single symbol to process information. (Wikipedia)
An umbrella term for illustrations and charts that instruct people, which otherwise would be difficult or impossible with only text. Infographics are used worldwide in every discipline from road maps and street signs to the many technical drawings. (PC Magazine)
An easy-to-read illustration that helps tell a story and makes data points easier to understand. And it doesn’t hurt when infographics are not only clear and straightforward but also beautiful and engaging. The aesthetic design draws the viewer in; the information helps the viewer analyze and understand the data being presented. (Visual.ly)
And finally, my favorite, an Infographic that explains what is an Infographic: by Hot Butter Studio.
These three examples do a nice job of defining what infographics are, but what is the value of an infographic in education? I’m glad you asked because this video does an excellent job of demonstrating how valuable they can be if they are used effectively in an educational setting.
The Value of Visualization from Column Five on Vimeo.
The Science Behind Infographics
Now that we have a basic understanding of what infographics are, This refers to the part of the video where the narrator asks the viewer to count the number of 7s in the number set. The video explains that comprehension becomes almost instant due to preattentive attributes, or “visual clues” like size, color, and orientation that the brain processes in 250 milliseconds (msec). This chart shows a list of preattentive attributes that infographics use to convey their message clearly.
According to research by Mark Smiciklas, author of the book The Power of Infographics: Using Pictures to Communicate and Connect with Your Audiences, “one of the primary reasons infographics work well as a communication tool can be linked to eyesight and the neurological connection of our eyes and brain.” He goes on to discuss how our brains are hard wired for infographics because vision directly accounts for 50% of our brain’s real estate. Since we are already built to consume information visually, infographics might be easier to process than pure text.
Smiciklas also notes that “Robert Lane and Dr. Stephen Kosslyn offer an explanation for what the brain sees when it comes to pictures vs. words. Each letter in a word is essentially a symbol. To read text, the brain needs to act as a decoder first, matching those letters with shapes stored in memory. From there the brain must figure out how all the letters fit together to form words, how words form sentences, and how sentences form paragraphs.
Although all this comprehension takes place in only a split second, relatively speaking, when compared to how the brain deals with images, the process requires considerably more mental effort.”
Infographics and Education
One line from the Value of Visualization video that stands out above all as the key to illustrate infographics and the value they add to instruction is, “your message is only as good as your ability to share it.” Using infographics in instruction is an innovative and engaging way to ensure that the message you are sharing is visually appealing and easily digested by your students.
They allow students to comprehend, interpret, and analyze complex information in a quick and clear manner. This combined with the brain research to support why infographics are so effective gives teachers a powerful new tool to use for teaching and learning.For more information on how to use infographics, checkout these additional resources from the New York Times blog called The Learning Network. Clicking on this link will give you access to posts on The Learning Network that have been tagged with Infographics.
Wait, it gets better! They take it one step further and elaborate on the Who, What, Where, When, Why, and How each of the infographics could be used in your classroom. I also encourage you to do some research of your own by simply typing in search words like “infographics in education” into a search engine.
You’ll find more sources of infographics then you can shake a stick at as well as websites where you or your students can make your own infographics. Happy hunting!
- If You Have Time David McCandless: The beauty of data visualization (18:17) “Helping students interpret visual representations of information – NYTimes.com.” The Learning Network – The Learning Network Blog – NYTimes.com. N.p., n.d. Web. 10 Oct. 2012. <http://learning.blogs.nytimes.com/2010/08/23/teaching-with-infographics-places-to-start/>.
- “Preattentive processing – InfoVis:Wiki.” Main Page – InfoVis:Wiki. N.p., n.d. Web. 10 Oct. 2012. <http://www.infovis-wiki.net/index.php/Preattentive_processing>.
- Rosenthal Tolisano, Silvia. “Infographics- What? Why? How?.” Langwitches Blob. N.p., n.d. Web. 10 Oct. 2012. <langwitches.org/blog/2010/06/16/infographics-what-why-how/>.
- Smiciklas, Mark. “INFOGRAPHICS AND THE SCIENCE OF VISUAL COMMUNICATION.” Social Media Explorer. N.p., n.d. Web. 10 Oct. 2012. <www.socialmediaexplorer.com/digital-marketing/infographics-and-the-science-of-visual-communication/>.
- “The Value of Visualization on Vimeo.” Vimeo, Your Videos Belong Here. N.p., n.d. Web. 10 Oct. 2012. <http://vimeo.com/29684853>.
- “What is an Infographic?.” Column Five: Infographics, Data Visualization and Motion Graphics. N.p., n.d. Web. 10 Oct. 2012. <http://columnfivemedia.com/what-is-an-infographic/>. MLA formatting by BibMe.org.
Great Reads From Edudemic Partners:
An exclusive look inside Ground Truth, the secretive program to build the world's best accurate maps.
![]()
Behind every Google Map, there is a much more complex map that's the key to your queries but hidden from your view. The deep map contains the logic of places: their no-left-turns and freeway on-ramps, speed limits and traffic conditions. This is the data that you're drawing from when you ask Google to navigate you from point A to point B -- and last week, Google showed me the internal map and demonstrated how it was built. It's the first time the company has let anyone watch how the project it calls GT, or "Ground Truth," actually works.
The company opened up at a key moment in its evolution. The company began as an online search company that made money almost exclusively from selling ads based on what you were querying for. But then the mobile world exploded. Where you're searching has become almost important as what you're searching. Google responded by creating an operating system, brand, and ecosystem in Android that has become the only significant rival to Apple's iOS.
And for good reason. If Google's mission is to organize all the world's information, the most important challenge -- far larger than indexing the web -- is to take the world's physical information and make it accessible and useful.
"If you look at the offline world, the real world in which we live, that information is not entirely online," Manik Gupta, the senior product manager for Google Maps, told me. "Increasingly as we go about our lives, we are trying to bridge that gap between what we see in the real world and [the online world], and Maps really plays that part."
This is not just a theoretical concern. Mapping systems matter on phones precisely because they are the interface between the offline and online worlds. If you're at all like me, you use mapping more than any other application except for the communications suite (phone, email, social networks, and text messaging).
Google is locked in a battle with the world's largest company, Apple, about who will control the future of mobile phones. Whereas Apple's strengths are in product design, supply chain management, and retail marketing, Google's most obvious realm of competitive advantage is in information. Geo data -- and the apps built to use it -- are where Google can win just by being Google. That didn't matter on previous generations of iPhones because they used Google Maps, but now Apple's created its own service. How the two operating systems incorporate geo data and present it to users could become a key battleground in the phone wars.
But that would entail actually building a better map.
***
The office where Google has been building the best representation of the world is not a remarkable place. It has all the free food, ping pong, and Google Maps-inspired Christoph Niemann cartoons that you'd expect, but it's still a low-slung office building just off the 101 in Mountain View in the burbs.
I was slated to meet with Gupta and the engineering ringleader on his team, former NASA engineer Michael Weiss-Malik, who'd spent his 20 percent time working on Google Mars, and Nick Volmar, an "operator" who actually massages map data.
"So you want to make a map," Weiss-Malik tells me as we sit down in front of a massive monitor. "There are a couple of steps. You acquire data through partners. You do a bunch of engineering on that data to get it into the right format and conflate it with other sources of data, and then you do a bunch of operations, which is what this tool is about, to hand massage the data. And out the other end pops something that is higher quality than the sum of its parts."
This is what they started out with, the TIGER data from the US Census Bureau (though the base layer could and does come from a variety of sources in different countries).
![]()
On first inspection, this data looks great. The roads look like they are all there and you've got the freeways differentiated. This is a good map to the untrained eye. But let's look closer. There are issues where the digital data does not match the physical world. I've circled a few obvious ones below.
And that's just from comparing the map to the satellite imagery. But there are also a variety of other tools at Google's disposal. One is bringing in data from other sources, say the US Geological Survey. But Google's Ground Truthers can also bring another exclusive asset to bear on the maps problem: the Street View cars' tracks and imagery. In keeping with Google's more data is better data mantra, the maps team, largely driven by Street View, is publishing more imagery data every two weeks than Google possessed total in 2006.*
Let's step back a tiny bit to recall with wonderment the idea that a single company decided to drive cars with custom cameras over every road they could access. Google is up to five million miles driven now. Each drive generates two kinds of really useful data for mapping. One is the actual tracks the cars have taken; these are proof-positive that certain routes can be taken. The other are all the photos. And what's significant about the photographs in Street View is that Google can run algorithms that extract the traffic signs and can even paste them onto the deep map within their Atlas tool. So, for a particularly complicated intersection like this one in downtown San Francisco, that could look like this:
Google Street View wasn't built to create maps like this, but the geo team quickly realized that computer vision could get them incredible data for ground truthing their maps. Not to detour too much, but what you see above is just the beginning of how Google is going to use Street View imagery. Think of them as the early web crawlers (remember those?) going out in the world, looking for the words on pages. That's what Street View is doing. One of its first uses is finding street signs (and addresses) so that Google's maps can better understand the logic of human transportation systems. But as computer vision and OCR improve, any word that is visible from a road will become a part of Google's index of the physical world.
Later in the day, Google Maps VP Brian McClendon put it like this: "We can actually organize the world's physical written information if we can OCR it and place it," McClendon said. "We use that to create our maps right now by extracting street names and addresses, but there is a lot more there."More like what? "We already have what we call 'view codes' for 6 million businesses and 20 million addresses, where we know exactly what we're looking at," McClendon continued. "We're able to use logo matching and find out where are the Kentucky Fried Chicken signs... We're able to identify and make a semantic understanding of all the pixels we've acquired. That's fundamental to what we do."
For now, though, computer vision transforming Street View images directly into geo-understanding remains in the future. The best way to figure out if you can make a left turn at a particular intersection is still to have a person look at a sign -- whether that's a human driving or a human looking at an image generated by a Street View car.
There is an analogy to be made to one of Google's other impressive projects: Google Translate. What looks like machine intelligence is actually only a recombination of human intelligence. Translate relies on massive bodies of text that have been translated into different languages by humans; it then is able to extract words and phrases that match up. The algorithms are not actually that complex, but they work because of the massive amounts of data (i.e. human intelligence) that go into the task on the front end.
Google Maps has executed a similar operation. Humans are coding every bit of the logic of the road onto a representation of the world so that computers can simply duplicate (infinitely, instantly) the judgments that a person already made.
This reality is incarnated in Nick Volmar, the operator who has been showing off Atlas while Weiss-Malik and Gupta explain it. He probably uses twenty-five keyboard shortcuts switching between types of data on the map and he shows the kind of twitchy speed that I associate with long-time designers working with Adobe products or professional Starcraft players. Volmar has clearly spent thousands of hours working with this data. Weiss-Malik told me that it takes hundreds of operators to map a country. (Rumor has it many of these people work in the Bangalore office, out of which Gupta was promoted.)
The sheer amount of human effort that goes into Google's maps is just mind-boggling. Every road that you see slightly askew in the top image has been hand-massaged by a human. The most telling moment for me came when we looked at couple of the several thousand user reports of problems with Google Maps that come in every day. The Geo team tries to address the majority of fixable problems within minutes. One complaint reported that Google did not show a new roundabout that had been built in a rural part of the country. The satellite imagery did not show the change, but a Street View car had recently driven down the street and its tracks showed the new road perfectly.
Volmar began to fix the map, quickly drawing the new road and connecting it to the existing infrastructure. In his haste (and perhaps with the added pressure of three people watching his every move), he did not draw a perfect circle of points. Weiss-Malik and I detoured into another conversation for a couple of minutes. By the time I looked back at the screen, Volmar had redrawn the circle with perfect precision and upgraded a few other things while he was at it. The actions were impressively automatic. This is an operation that promotes perfectionism.
And that's how you get your maps to look this this:
Some details are worth pointing out. In the top center quadrant, trails have been mapped out and coded as places for walking. All the parking lots have been mapped out. All the little roads, say, to the left of the small dirt patch on the right, have also been coded. Several of the actual buildings have been outlined. Down at the bottom left, a road has been marked as a no-go. At each and every intersection, there are arrows that delineate precisely where cars can and cannot turn.
Now imagine doing this for every tile on Google's map in the United States and 30 other countries over the last four years. Every roundabout perfectly circular, every intersection with the correct logic. Every new development. Every one-way street. This is a task of nearly unimaginable scale. This is not something you can put together with a few dozen smart engineers.
I came away convinced that the geographic data Google has assembled is not likely to be matched by any other company. The secret to this success isn't, as you might expect, Google's facility with data, but rather its willingness to commit humans to combining and cleaning data about the physical world. Google's map offerings build in the human intelligence on the front end, and that's what allows its computers to tell you the best route from San Francisco to Boston.***
It's probably better not to think of Google Maps as a thing like a paper map. Geographic information systems are a jump like the abacus to the computer. "I honestly think we're seeing a more profound change, for map-making, than the switch from manuscript to print in the Renaissance," University of London cartographic historian Jerry Brotton told the Sydney Morning Herald. "That was huge. But this is bigger."
The maps we used to keep folded in our glove compartments were a collection of lines and shapes that we overlaid with human intelligence. Now, as we've seen, a map is a collection of lines and shapes with Nick Volmar's (and hundreds of others') intelligence encoded within it.
It's common when we discuss the future of maps to reference the Borgesian dream of a 1:1 map of the entire world. It seems like a ridiculous notion that we would need a complete representation of the world when we already have the world itself. But to take scholar Nathan Jurgenson's conception of augmented reality seriously, we would have to believe that every physical space is, in his words, "interpenetrated" with information. All physical spaces already are also informational spaces. We humans all hold a Borgesian map in our heads of the places we know that we use to navigate and compute physical space. Google's strategy is to bring all our mental maps together and process them into accessible, useful forms.
Their MapMaker product makes that ambition clear. Project managed by Gupta during his time in India, it's the "bottom up" version of Ground Truth. It's a publicly accessible way to edit Google Maps by adding landmarks and data about your piece of the world. It's a way of sucking data out of human brains and onto the Internet. And it's a lot like Google's open competitor, Open Street Map, which has proven it, too, can harness the crowd's intelligence.
As we slip and slide into a world where our augmented reality is increasingly visible to us off and online, Google's geographic data may become its most valuable asset. Not solely because of this data alone, but because location data makes everything else Google does and knows more valuable.
Or as my friend and sci-fi novelist Robin Sloan put it to me, "I maintain that this is Google's core asset. In 50 years, Google will be the self-driving car company (powered by this deep map of the world) and, oh, P.S. they still have a search engine somewhere."
Of course, they will always need one more piece of geographic information to make all this effort worthwhile: You. Where you are, that is. Your location is the current that makes Google's giant geodata machine run. They've built this whole playground as an elaborate lure for you. As good and smart and useful as it is, good luck resisting taking the bait.
* Due to a transcription error, an earlier version of this story stated that Google published 20PB of imagery data every two weeks.
Data doesn’t invade people’s lives. Lack of control over how it’s used does.
What’s really driving so-called big data isn’t the volume of information. It turns out big data doesn’t have to be all that big. Rather, it’s about a reconsideration of the fundamental economics of analyzing data.
For decades, there’s been a fundamental tension between three attributes of databases. You can have the data fast; you can have it big; or you can have it varied. The catch is, you can’t have all three at once.
I’d first heard this as the “three V’s of data”: Volume, Variety, and Velocity. Traditionally, getting two was easy but getting three was very, very, very expensive.
The advent of clouds, platforms like Hadoop, and the inexorable march of Moore’s Law means that now, analyzing data is trivially inexpensive. And when things become so cheap that they’re practically free, big changes happen — just look at the advent of steam power, or the copying of digital music, or the rise of home printing. Abundance replaces scarcity, and we invent new business models.
In the old, data-is-scarce model, companies had to decide what to collect first, and then collect it. A traditional enterprise data warehouse might have tracked sales of widgets by color, region, and size. This act of deciding what to store and how to store it is called designing the schema, and in many ways, it’s the moment where someone decides what the data is about. It’s the instant of context.
That needs repeating:
You decide what data is about the moment you define its schema.
With the new, data-is-abundant model, we collect first and ask questions later. The schema comes after the collection. Indeed, big data success stories like Splunk, Palantir, and others are prized because of their ability to make sense of content well after it’s been collected — sometimes called a schema-less query. This means we collect information long before we decide what it’s for.
And this is a dangerous thing.
When bank managers tried to restrict loans to residents of certain areas (known as redlining) Congress stepped in to stop it (with the Fair Housing Act of 1968). They were able to legislate against discrimination, making it illegal to change loan policy based on someone’s race.
Home Owners’ Loan Corporation map showing redlining of “hazardous” districts in 1936.“Personalization” is another word for discrimination. We’re not discriminating if we tailor things to you based on what we know about you — right? That’s just better service.
In one case, American Express used purchase history to adjust credit limits based on where a customer shopped, despite his excellent credit limit:
Johnson says his jaw dropped when he read one of the reasons American Express gave for lowering his credit limit: “Other customers who have used their card at establishments where you recently shopped have a poor repayment history with American Express.”
We’re seeing the start of this slippery slope everywhere from tailored credit-card limits like this one to car insurance based on driver profiles. In this regard, big data is a civil rights issue, but it’s one that society in general is ill-equipped to deal with.
We’re great at using taste to predict things about people. OKcupid’s 2010 blog post “The Real Stuff White People Like” showed just how easily we can use information to guess at race. It’s a real eye-opener (and the guys who wrote it didn’t include everything they learned — some of it was a bit too controversial). They simply looked at the words one group used which others didn’t often use. The result was a list of “trigger” words for a particular race or gender.
Now run this backwards. If I know you like these things, or see you mention them in blog posts, on Facebook, or in tweets, then there’s a good chance I know your gender and your race, and maybe even your religion and your sexual orientation. And that I can personalize my marketing efforts towards you.
That makes it a civil rights issue.
If I collect information on the music you listen to, you might assume I will use that data in order to suggest new songs, or share it with your friends. But instead, I could use it to guess at your racial background. And then I could use that data to deny you a loan.
Want another example? Check out Private Data In Public Ways, something I wrote a few months ago after seeing a talk at Big Data London, which discusses how publicly available last name information can be used to generate racial boundary maps:
Screen from the Mapping London project.This TED talk by Malte Spitz does a great job of explaining the challenges of tracking citizens today, and he speculates about whether the Berlin Wall would ever have come down if the Stasi had access to phone records in the way today’s governments do.
So how do we regulate the way data is used?
The only way to deal with this properly is to somehow link what the data is with how it can be used. I might, for example, say that my musical tastes should be used for song recommendation, but not for banking decisions.
Tying data to permissions can be done through encryption, which is slow, riddled with DRM, burdensome, hard to implement, and bad for innovation. Or it can be done through legislation, which has about as much chance of success as regulating spam: it feels great, but it’s damned hard to enforce.
There are brilliant examples of how a quantified society can improve the way we live, love, work, and play. Big data helps detect disease outbreaks, improve how students learn, reveal political partisanship, and save hundreds of millions of dollars for commuters — to pick just four examples. These are benefits we simply can’t ignore as we try to survive on a planet bursting with people and shaken by climate and energy crises.
But governments need to balance reliance on data with checks and balances about how this reliance erodes privacy and creates civil and moral issues we haven’t thought through. It’s something that most of the electorate isn’t thinking about, and yet it affects every purchase they make.
This should be fun.
This post originally appeared on Solve for Interesting. This version has been lightly edited.
Related:
First, here are a few, well, data points: Big Data was a featured topic this year at the World Economic Forum in Davos, Switzerland, with a report titled "Big Data, Big Impact." In March, the federal government announced $200 million in research programs for Big Data computing.
Big Data, The Moving Parts: Fast Data, Big Analytics, and Deep Insight (Photo credit: Dion Hinchcliffe)
For those not familiar with the phrase, ”Big Data” is used to describe the acquisition, storage and analysis of large quantities of data. The search giant Google was one of the pioneers in this area and it is developed into an industry worth billions of dollars. Big Data and its uses also raise ethical concerns.
One common use of Big Data is to analyse customer data so as to make predictions that would be useful in conducting targeted ad campaigns. Perhaps the most infamous example of this is Target’s pregnancy targeting. This Big Data adventure was a model of inductive reasoning. First, an analysis was conducted of Target customers who had signed up for Target’s new baby registry. The purchasing history of these women was analysed to find patterns of buying that corresponded to each stage of pregnancy. For example, pregnant women were found to often buy lots of unscented lotion at the start of the second trimester. Once the analysis revealed the buying patterns of pregnant women, Target then applied this information to the buying patterns of women customers. Oversimplifying things, they were essentially using an argument by analogy: inferring that hat women not known to be pregnant who had X,Y, and Z patterns were probably pregnant because women known to be pregnant had X,Y, and Z buying patterns. The women who were tagged as probably pregnant were then subject to targeted ads for baby products and this proved to be a winner for Target, other than some public relations issues.
One interesting aspect of this method is that it does not follow the usual model of predicting a person’s future buying behavior from his/her past buying behavior. An example of predicting future buying behavior based on past behavior would be predicting that I would buy Gatorade the next time I went grocery shopping because I have been bought it consistently in the past. The analysis used by Target and other companies differs from this model by making inferences about the future behavior of customers based on their similarity to customers whose past buying behavior is known. For example, a store might see shifts in someone’s buying behavior that matches other data from people starting to get into fitness and thus predict the person was getting into fitness. The store might then send the person (and others like her) targeted ads featuring Gatorade coupons because their models show that such people buy more Gatorade.
This method also has an interesting Sherlock Holmes aspect to it. The fictional detective was able to use inductive logic (although he was presented as deducing) to make impressive inferences from seemingly innocuousness bits of information. Big Data can do this in reality and make reliable inferences based on what appears to be irrelevant information. For example, likely voting behavior might be inferred from factors such as one’s preferred beverage.
Naturally, Big Data can be used to sell a wide variety of products, including politicians and ideology. It also has non-commercial applications, such a law enforcement and political uses. As such, it is hardly surprising that companies and agencies are busily gathering and analyzing data at a relentless and ever growing pace. This certainly is cause for concern.
One ethical concern is that the use of Big Data can impact the outcome of elections. For example, analyzing massive amounts of data information can be acquired that would allow ads to be effectively crafted and targeted. Given that Big Data is expensive, the data advantage would tend to go to the side with the most money, thus increasing the influence of money on the outcome of elections. Naturally, the influence of money on elections is already a moral concern. While more spending does not assure victory, there is a clear connection between spending and success. To use but one obvious example, Mitt Romney was able to beta his Republican competitors in part by being able to outlast them financially and outspend them.
In any case, Big Data adds yet another tool and expense to political campaigning, thus making it more costly for people to run for office. This, in turn, means that those running for office will need even more money than before, thus making money an even greater factor than in the past. This, obviously enough, increases the ability of those with more money to influence the candidates and the issues.
On the face of it, it would seem unreasonable to require that campaigns go without Big Data. After all, it could be argued that this would be tantamount to demanding that campaigns operate in ignorance. However, the concerns about big money buying Big Data to influence elections could be addressed by campaign finance reform, which would be another ethical issue.
Perhaps the biggest ethical concern about Big Data is the matter of privacy. First, there is the ethical worry that much of the data used in Big Data is gathered without people knowing how the data will be used (and perhaps that it is even being gathered). For example, the customers at Target seemed to be unaware that Target was gathering such data about them to be analyzed and used to target ads.
While people might know that information is being collected about them, knowing this and knowing that the data will be analyzed for various purposes are two different things. As such, it can be argued that private data is being gathered without proper informed consent and this is morally wrong.
The obvious solution is for data collectors to make it clear about what the data will be used for, thus allowing people to make an informed choice regarding their private information. Of course, one problem that will remain is that it is rather difficult to know what sort of inferences can be made from seemingly innocuous data. As such, people might think that they are not providing any private data when they are, in fact, handing over data that can be used to make inferences about private matters.
If a business claims that they would be harmed because people would not hand over such information if they knew what it would be used for, the obvious reply is that this hardly gives them the right to deceive to get what they want. However, I do not think that businesses have much to worry about—Facebook has shown that many people are quite willing to hand over private information for little or nothing in return.
A second and perhaps the most important moral concern is that Big Data provides companies and others with the means of making inferences about people that go beyond the available data and into what might be regarded as the private realm. While this sort of reasoning is classic induction, Big Data changes the game because of the massive amount of data and processing power available to make these inferences, such as whether women are pregnant or not. In short, the analysis of seemingly innocuous data can yield inferences about information that people would tend to regard as private—or at the very least, information they would not think would be appropriate for a company to know.
One obvious counter to this is to argue that privacy rights are not being violated. After all, as long as the data used does not violate the privacy of individuals, then the inferences made from this data cannot be regarded as violating people’s privacy, even if the inferences are about matters that people would regard as private (such as pregnancy). To use an analogy, if I were to spy on someone and learn from thus that she was an alcoholic, then I would be violating her privacy. However, if I inferred that she is an alcoholic from publically available information, then I might know something private about her, but I have not violated her privacy.
This counter is certainly appealing. After all, there does seem to be a meaningful and relevant distinction between directly getting private information by violating privacy and inferring private information using public (or at least legitimately provided) data. To use an analogy, if I get the secret ingredient in someone’s prize recipe by sneaking a look at the recipe, then I have acted wrongly. However, if I infer the secret ingredient by tasting the food when I am invited to dinner, then I have not acted wrongly.
A reasonable reply to this counter is that while there is a difference between making an inference that yields private data and getting the data directly, there is also the matter of intent. It is, for example, one thing to infer the secret ingredient simply by tasting it, but it is quite another to arrange to get invited to dinner specifically so I can get that secret ingredient by tasting the food. To use another example, it is one thing to infer that someone is an alcoholic, but quite another to systematically gather public data in order to determine whether or not she is an alcoholic. In the case of Big Data, there is clearly intent to infer data that customers have not already voluntarily provided. After all, if the data had been provided, there would be no need to undertake an analysis in order to get the desired information. Thus, while the means do not involve a direct violation of privacy rights, they do involve an indirect violation—at least in cases in which the data is private (or at least intended to be private).
The solution, which would probably be rather problematic to implement, would involve setting restrictions on what sort of inferences can be made from the data on the grounds that people have a right to keep that information private, even if the means used to acquire it did not involve any direct violations of privacy rights.
You need to register in order to view this quiz.
Note: You must get at least of the answers correct to pass this quiz.
Note: You must get at least of the answers correct to pass this quiz.
You have not filled in all the answers to complete this quiz
Sorry, you have unsuccessfully completed this CME quiz with a score of
For CME Course: A Proposed Model for Initial Assessment and Management of Acute Heart Failure Syndromes
Indicate what changes(s) you will implement in your practice, if any, based on this CME course.
To view and print your certificate and access a summary of your CME courses go to My CME.
Infographics are all the rage these days. (There’s even an infographic to explain the phenomenon.)It makes sense. After all, we’re a visual species. Since our earliest days, images have captured our attention. They have been at the heart of storytelling, one of our first methods of expression and a fundamental tool for education.
Infographics, which are more detailed than photos and convey information more quickly than videos, tap into this visual learning style. They can prove especially powerful in press releases by extending the core message and highlighting the important components to bring the text to life.
Plus, they’re inviting.
According to a recent analysis of press releases by PR Newswire, the inclusion of multimedia assets significantly improves the number of views a message generates. In the age of social media, any advantage in grabbing a slice of your audience’s attention is worth seizing upon.
Infographics cut straight to the point, simplify complex information, and can wow the reader in an instant. As with any piece of content, an infographic must be relevant, interesting, and meaningful; it should not rely solely on eye-catching artwork, nor should the content be overwhelming. An effective infographic elicits an instant reaction and entices people to want to learn more.
What I enjoy most about an infographic is the creative flexibility it affords PR professionals. For example, the press release fits into a fairly standard format. When I’m given the task of writing one—and as a PR pro that happens often—I begin outlining the draft in my head, running through the checklist of elements.
Infographics, on the other hand, help us to detour from our usual template and color outside the lines. This doesn’t mean we have to be artists or designers, though it does require us to revisit our early days and think visually.
Although infographics have a place in almost any message, they are most useful when presenting:
• Survey results that may be cumbersome in a lengthy text format;
• Statistical data that can lose the fleeting interest of a reader;
• Comparison research that will have a more a dramatic effect with visuals;
• Messages targeted to multilingual audiences (images are a universal language, right?);
• Any other information that just isn’t sexy without graphical elements.
Numerous companies are taking advantage of infographics and including them in press releases. Here are three great examples:Hotels.com replaced lengthy lists of the top travel destinations by incorporating them into this graphic. (via)
Has information about food waste ever been so appealing as it is with Emerson’s infographic about the life cycle of food waste using a garbage disposer? (via)
And SC Johnson depicted the shift in consumer environment behaviors with a few simple images. (via)
Meryl Serouya is a marketing and communications associate at PR Newswire.
Popularity: This record has been viewed 38420 times.
PRDaily.com moderates comments and reserves the right to remove posts that are abusive or otherwise inappropriate.
The collective wisdom in press reports last week was that the USDA’s new “easy to understand” ChooseMyPlate image is “better” than the old pyramid. Well, that’s not saying much. But it’s also completely beside the point. Sure, it’s easy to poke fun at how bad the pyramid image was (and I had a ball doing so in my book), but just comparing images misses the larger issue: that the whole damn exercise of trying to educate the American public with a simple image is beyond pointless — it’s downright insulting.
But before I explain, allow me to get a few things about the new image off my chest. First, the website URL tells us a lot: ChooseMyPlate.gov. The words choose and choice — why are they ringing a bell? Oh yes, they’re favorites of the food industry, to remind us that it’s really all up to individuals to choose to eat a healthy diet, and that companies provide a wide range of choices for us each to choose from. Never mind that for too many Americans, the choices in their neighborhood range from McDonald’s to Burger King. That the government is using such a construction for dietary advice tells us that it doesn’t want to rub industry the wrong way by (God forbid) actually telling Americans how we should eat for optimum health.
Much has been made of how brave it was for USDA to depict half of the plate with fruits and vegetables. Yes, that does represent a significant departure from the past and I’m willing to give some credit here. But that victory is quickly overshadowed by two other, scientifically questionable recommendations: protein and dairy. As Marion Nestle pointed out, protein is not a food, it’s a nutrient, so the meat industry must be very happy to see it represented so prominently, as they have brainwashed the American public for decades into equating “meat” with “protein.” Most Americans eat way too much protein and certainly need no reminders.
But even more troubling is the placement of dairy as a circle image to the side, as if to say the government recommends that we all drink a glass of milk with every single meal, never mind those who are lactose-intolerant or simply choose not to consume dairy. It seems USDA could not make up its mind on whether to recommend food or nutrients on the plate. They recommend “protein” but then why is “dairy” and not “calcium” recommended? Ah, the politics of inconsistent messaging.
OK, now that my griping is out of the way, here’s why nothing that I just wrote even matters: Education alone will not improve dietary habits. The entire exercise of using an image (and a website, etc.) to educate the American public to get us to eat right is doomed to failure, as decades of history have already shown us. This concept is not specific to eating; it applies across the spectrum of public-health issues. To paraphrase my public-health colleague Harold Goldstein: There is not a single public-health crisis in history that has been solved with a brochure.
Name the health behavior you want to change: smoking, drinking, eating, wearing seat belts, wearing bike helmets, having safe sex, etc. — none of them can be changed with just education. Rather, policy is needed to change the environment that people live in to help them make healthier choices. Many articles and books have been written on this subject. Just ask any health educator how hard their job is, especially dietitians.
It’s going to take way more than a measly $2 million educational campaign to get Americans to fill up half their plates with fruits and vegetables. It’s going to take a massive overhaul of our agricultural policies, as is depicted in this handy pie chart from Physicians Committee for Responsible Medicine and as explained by writer Melanie Warner.
It’s also going to take addressing the billions of dollars the food industry spends on marketing each year to keep us from eating off of plates at all. (Perhaps a better image might have been a pizza box or a take-out carton?) It’s especially going to take massive political will to stop the food industry’s predatory marketing of junk food to children. Ironically, the federal government is currently asking for comments on proposed guidelines for food companies to change how they market to kids. Industry is up in arms over it, despite the fact that the guidelines will be completely voluntary. I could go on, but you get the idea.
So, I really don’t care if the new plate is “easier” or “better” than the old pyramid. Even if the plate was full of nothing but locally grown, organic, fresh produce, that image would only serve as a painful reminder to too many Americans that eating that way on a regular basis is sadly out of reach. Only policy can change that.
Originally posted on Edelman DigitalVisual storytelling is nothing new. We only need to look to the earliest signs of humanity for proof—simple paintings on the walls of caves tell the story that people are a visual tribe. Today, it seems, communications must be visual in order to be compelling, as well as to compete with the massive amount of information available to us at any given moment (even Google acknowledged this in 2001 by introducing image search). Whether it’s a web video, infographic, or illustration, visual assets can communicate a wealth of information rapidly, and in ways that our brains process differently than other, more traditional mediums.
The secret to producing these compelling, yet bite-sized morsels of information is having “visual literacy,” or being able to think in pictures. Don’t confuse this with being an artist or designer. Anyone can think visually—or learn to look at the world through this type of lens—and then work with a visual communicator (a designer or producer) to craft a digestible visual deliverable, which earns our time, attention and encourages us to take action.
As someone who thinks visually, I want to share five tips that I believe will work for anyone who is looking to communicate and influence through a medium that transcends the written word:
1. Empathize: See the world as a child
Most of us drew pictures before we began writing. But now that words dominate our communications, it’s possible we have to do some neurological re-wiring to take our brains back to that point where simple, elegant pictures help us tell stories. I recommend three steps: 1) Observe everything, especially the minute details. 2) Ask questions; especially the ones that make you feel unenlightened. 3) Resurrect your sense of exploration; in other words, re-ignite the curious portion of your brain. Children have a way of noticing the little things we take for granted. They are immensely curious and never lack for questions. Putting yourself in a more “child-like” mindset will set the stage for all kinds of thinking, including visual.
2. Memorize: Commit thoughts to memory
Words can be fleeting—they can at times be like the wind, but images often sear into our memory. To start the visual thinking process, it’s helpful to capture thoughts not just in words but also by simple pictures. Stick people and basic shapes are your biggest allies in this stage of transforming yourself to become a more visual communicator and we should never allow our fear of “drawing” get in the way. My friend, Dave Gray, a great visual thinker, draws better than I do, but I still scrawl down messy shapes when I do my most strategic thinking. What’s important is capturing a visual thought in the moment, not the artistic quality of what you are documenting visually.
3. Analyze: Take a step back
The first two steps are meant to open your mind and get you capturing visual thoughts while getting some creative juices flowing. If you’ve done this right, you’re going to be attached to your visual subject. This is where you need to take a step back. Look at the visual story you’re developing objectively. Are you focusing on form over function —is it compelling and worth sharing, is it objective or opinionated enough? Take a step back and think of yourself as the end audience–get feedback from others, but analyze that objectively as well.
4. Synthesize: Filter signal from noise
If you’re a word person, you might relate to this process as “editing,” but for really effective visual thinking I think a better word would be “synthesis.” Good synthesis involves taking a lot of information and distilling it down to a core set of thoughts fueled by an insight into what will connect with your viewer. This is where the “art”—for lack of a better word—comes into play. A word of warning: this takes practice. Being able to synthesize complex thoughts and boil them down to an essence means finding that “nugget” which will resonate. The only advice I can give here is that you’ll know it when you see it, and sometimes it’s more obvious than you think.
5. Visualize: See it, then do it
The final step is to think of the right visual model to help tell your story—and to execute it well. Focus on visual metaphors to tell your story. See the idea in your mind and then direct it so that that it comes to life. If you need help, hire a creative team and work with them to improve your visual thinking.
When I created the “agency ecosystem” (above) several years ago, the visual thinking started as circles in a four way Venn diagram. I thought that the circles looked like leaves, so I used the metaphor of a plant, which made the story even better because the roots served as a powerful metaphor to communicate foundational needs.
Final Thoughts
By now, the little voice in your head might be saying, “That’s great, but I’m not creative, I don’t think that way.” Ignore that voice. You may be on a path in life that has rewarded other parts of your brain, but we are all born with the ability to create. If you want to communicate visually, you have to think visually. You don’t need to be able to execute those ideas yourself, but you can practice the above steps to start the visual thinking process. I am not going to recommend you read any books to get you started—the reading may distract you from actually doing. My suggestion is to start by “drawing” out the things that you see as obstacles to thinking creatively (i.e., think clock if you don’t have the time), then develop a strategy for overcoming them.
Mo Zhou was snapped up by I.B.M. last summer, as a freshly minted Yale M.B.A., to join the technology company's fast-growing ranks of data consultants. They help businesses make sense of an explosion of data - Web traffic and social network comments, as well as software and sensors that monitor shipments, suppliers and customers - to guide decisions, trim costs and lift sales.
use as part of open access discussion
Can data save the world? Not on its own. As an age of technology-fueled transparency, open innovation and big data dawns around the world, the success of new policy won’t depend on any single chief information officer, chief executive or brilliant developer. Data for the public good will be driven by a distributed community of media, nonprofits, academics and civic advocates focused on better outcomes, more informed communities and the new news, in whatever form it is delivered.
Advocates, watchdogs and government officials now have new tools for data journalism and open government. Globally, there’s a wave of transparency that will wash over every industry and government, from finance to healthcare to crime.
In that context, open government is about much more than open data — just look at the issues that flow around the #opengov hashtag on Twitter, including the nature of identity, privacy, security, procurement, culture, cloud computing, civic engagement, participatory democracy, corruption, civic entrepreneurship or transparency.
If we accept the premise that Gov 2.0 is a potent combination of open government, mobile, open data, social media, collective intelligence and connectivity, the lessons of the past year suggest that a tidal wave of technology-fueled change is still building worldwide.
The Economist’s support for open government data remains salient today:
“Public access to government figures is certain to release economic value and encourage entrepreneurship. That has already happened with weather data and with America’s GPS satellite-navigation system that was opened for full commercial use a decade ago. And many firms make a good living out of searching for or repackaging patent filings.”
As Clive Thompson reported at Wired last year, public sector data can help fuel jobs, and “shoving more public data into the commons could kick-start billions in economic activity.” In the transportation sector, for instance, transit data is open government fuel for economic growth.
There is a tremendous amount of work ahead in building upon the foundations that civil society has constructed over decades. If you want a deep look at what the work of digitizing data really looks like, read Carl Malamud’s interview with Slashdot on opening government data.
Data for the public good, however, goes far beyond government’s own actions. In many cases, it will happen despite government action — or, often, inaction — as civic developers, data scientists and clinicians pioneer better analysis, visualization and feedback loops.
For every civic startup or regulation, there’s a backstory that often involves a broad number of stakeholders. Governments have to commit to open up themselves but will, in many cases, need external expertise or even funding to do so. Citizens, industry and developers have to show up to use the data, demonstrating that there’s not only demand, but also skill outside of government to put open data to work in service accountability, citizen utility and economic opportunity. Galvanizing the co-creation of civic services, policies or apps isn’t easy, but tapping the potential of the civic surplus has attracted the attention of governments around the world.
There are many challenges for that vision to pass. For one, data quality and access remain poor. Socrata’s open data study identified progress, but also pointed to a clear need for improvement: Only 30% of developers surveyed said that government data was available, and of that, 50% of the data was unusable.
Open data will not be a silver bullet to all of society’s ills, but an increasing number of states are assembling platforms and stimulating an app economy.
Results-oriented mayors like Rahm Emanuel and Mike Bloomberg are committing to opening Chicago and opening government data in New York City, respectively.
Following are examples of where data for the public good is already having an impact upon the world we live in, along with some ideas about what lies ahead.
Financial good
Anyone looking for civic entrepreneurship will be hard pressed to find a better recent example than BrightScope. The efforts of Mike and Ryan Alfred are in line with traditional entrepreneurship: identifying an opportunity in a market that no one else has created value around, building a team to capitalize on it, and then investing years of hard work to execute on that vision. In the process, BrightScope has made government data about the financial industry more usable, searchable and open to the public.
Due to the efforts of these two entrepreneurs and their California-based startup, anyone who wants to learn more about financial advisers before tapping one to manage their assets can do so online.
Prior to BrightScope, the adviser data was locked up at the Securities and Exchange Commission (SEC) and the Financial Industry Regulatory Authority (FINRA).
“Ryan and I knew this data was there because we were advisers,” said BrightScope co-founder Mike Alfred in a 2011 interview. “We knew data had been filed, but it wasn’t clear what was being done with it. We’d never seen it liberated from the government databases.”
While they knew the public data existed and had their idea years ago, Alfred said it didn’t happen because they “weren’t in the mindset of being data entrepreneurs” yet. “By going after 401(k) first, we could build the capacity to process large amounts of data,” Alfred said. “We could take that data and present it on the web in a way that would be usable to the consumer.”
Notably, the government data that BrightScope has gathered on financial advisers goes further than a given profile page. Over time, as search engines like Google and Bing index the information, the data has become searchable in places consumers are actually looking for it. That’s aligned with one of the laws for open data that Tim O’Reilly has been sharing for years: Don’t make people find data. Make data find the people.
As agencies adapt to new business relationships, consumers are starting to see increased access to government data. Now, more data that the nation’s regulatory agencies collected on behalf of the public can be searched and understood by the public. Open data can improve lives, not least through adding more transparency into a financial sector that desperately needs more of it. This kind of data transparency will give the best financial advisers the advantage they deserve and make it much harder for your Aunt Betty to choose someone with a history of financial malpractice.
The next phase of financial data for good will use big data analysis and algorithmic consumer advice tools, or “choice engines,” to make better decisions. The vast majority of consumers are unlikely to ever look directly at raw datasets themselves. Instead, they’ll use mobile applications, search engines and social recommendations to make smarter choices.
There are already early examples of such services emerging. Billshrink, for example, lets consumers get personalized recommendations for a cheaper cell phone plan based on calling histories. Mint makes specific recommendations on how a citizen can save money based upon data analysis of the accounts added. Moreover, much of the innovation in this area is enabled by the ability of entrepreneurs and developers to go directly to data aggregation intermediaries like Yodlee or CashEdge to license the data.
Transit data as economic fuel
Transit data continues to be one of the richest and most dynamic areas for co-creation of services. Around the United States and beyond, there has been a blossoming of innovation in the city transit sector, driven by the passion of citizens and fueled by the release of real-time transit data by city governments.
Francisca Rojas, research director at the Harvard Kennedy School’s Transparency Policy Project, has investigated the dynamics behind the disclosure of data by transit agencies in the United States, which she calls one of the most successful implementations of open government. “In just a few years, a rich community has developed around this data, with visionary champions for disclosure inside transit agencies collaborating with eager software developers to deliver multiple ways for riders to access real-time information about transit,” wrote Rojas.
The Massachusetts Bay Transit Authority (MBTA) learned from Portland, Oregon’s, TriMet that open data is better. “This was the best thing the MBTA had done in its history,” said Laurel Ruma, O’Reilly’s director of talent and a long-time resident in greater Boston, in her 2010 Ignite talk on real-time transit data. The MBTA’s move to make real-time data available and support it has spawned a new ecosystem of mobile applications, many of which are featured at MBTA.com.
There are now 44 different consumer-facing applications for the TriMet system. Chicago, Washington and New York City also have a growing ecosystem of applications.
As more sensors go online in smarter cities, tracking the movements of traffic patterns will enable public administrators to optimize routes, schedules and capacity, driving efficiency and a better allocation of resources.
Transparency and civic goods
As John Wonderlich, policy director at the Sunlight Foundation, observed last year, access to legislative data brings citizens closer to their representatives. “When developers and programmers have better access to the data of Congress, they can better build the databases and tools that let the rest of us connect with the legislature.”
That’s the promise of the Sunlight Foundation’s work, in general: Technology-fueled transparency will help fight corruption, fraud and reveal the influence behind policies. That work is guided by data, generated, scraped and aggregated from government and regulatory bodies. The Sunlight Foundation has been focused on opening up Congress through technology since the organization was founded. Some of its efforts culminated recently with the publication of a live XML feed for the House floor and a transparency portal for House legislative documents.
There are other horizons for transparency through open government data, which broadly refers to public sector records that have been made available to citizens. For a canonical resource on what makes such releases truly “open,” consult the “8 Principles of Open Government Data.”
For instance, while gerrymandering has been part of American civic life since the birth of the republic, one of the best policy innovations of 2011 may offer hope for improving the redistricting process. DistrictBuilder, an open-source tool created by the Public Mapping Project, allows anyone to easily create legal districts.
“During the last year, thousands of members of the public have participated in online redistricting and have created hundreds of valid public plans,” said Micah Altman, senior research scientist at Harvard University Institute for Quantitative Social Science, via an email last year.
“In substantial part, this is due to the project’s effort and software. This year represents a huge increase in participation compared to previous rounds of redistricting — for example, the number of plans produced and shared by members of the public this year is roughly 100 times the number of plans submitted by the public in the last round of redistricting 10 years ago,” Altman said. “Furthermore, the extensive news coverage has helped make a whole new set of people aware of the issue and has re framed it as a problem that citizens can actively participate in to solve, rather than simply complain about.”
Principles for data in the public good
As a result of digital technology, our collective public memory can now be shared and expanded upon daily. In a recent lecture on public data for public good at Code for America, Michal Migurski of Stamen Design made the point that part of the global financial crisis came through a crisis in public knowledge, citing “The Destruction of Economic Facts,” by Hernando de Soto.
To arrive at virtuous feedback loops that amplify the signals that citizens, regulators, executives and elected leaders inundated with information need to make better decisions, data providers and infomediaries will need to embrace key principles, as Migurski’s lecture outlined.
First, “data drives demand,” wrote Tim O’Reilly, who attended the lecture and distilled Migurski’s insights. “When Stamen launched crimespotting.org, it made people aware that the data existed. It was there, but until they put visualization front and center, it might as well not have been.”
Second, “public demand drives better data,” wrote O’Reilly. “Crimespotting led Oakland to improve their data publishing practices. The stability of the data and publishing on the web made it possible to have this data addressable with public links. There’s an ‘official version,’ and that version is public, rather than hidden.”
Third, “version control adds dimension to data,” wrote O’Reilly. “Part of what matters so much when open source, the web, and open data meet government is that practices that developers take for granted become part of the way the public gets access to data. Rather than static snapshots, there’s a sense that you can expect to move through time with the data.”
The case for open data
Accountability and transparency are important civic goods, but adopting open data requires grounded arguments for a city chief financial officer to support these initiatives. When it comes to making a business case for open data, John Tolva, the chief technology officer for Chicago, identified four areas that support the investment in open government:
- Trust — “Open data can build or rebuild trust in the people we serve,” Tolva said. “That pays dividends over time.”
- Accountability of the work force — “We’ve built a performance dashboard with KPIs [key performance indicators] that track where the city directly touches a resident.”
- Business building — “Weather apps, transit apps … that’s the easy stuff,” he said. “Companies built on reading vital signs of the human body could be reading the vital signs of the city.”
- Urban analytics — “Brett [Goldstein] established probability curves for violent crime. Now we’re trying to do that elsewhere, uncovering cost savings, intervention points, and efficiencies.”
New York City is also using data internally. The city is doing things like applying predictive analytics to building code violations and housing data to try to understand where potential fire risks might exist.
“The thing that’s really exciting to me, better than internal data, of course, is open data,” said New York City chief digital officer Rachel Sterne during her talk at Strata New York 2011. “This, I think, is where we really start to reach the potential of New York City becoming a platform like some of the bigger commercial platforms and open data platforms. How can New York City, with the enormous amount of data and resources we have, think of itself the same way Facebook has an API ecosystem or Twitter does? This can enable us to produce a more user-centric experience of government. It democratizes the exchange of information and services. If someone wants to do a better job than we are in communicating something, it’s all out there. It empowers citizens to collaboratively create solutions. It’s not just the consumption but the co-production of government services and democracy.”
The promise of data journalism
The ascendance of data journalism in media and government will continue to gather force in the years ahead.
Journalists and citizens are confronted by unprecedented amounts of data and an expanded number of news sources, including a social web populated by our friends, family and colleagues. Newsrooms, the traditional hosts for information gathering and dissemination, are now part of a flattened environment for news. Developments often break first on social networks, and that information is then curated by a combination of professionals and amateurs. News is then analyzed and synthesized into contextualized journalism.
Data is being scraped by journalists, generated from citizen reporting, or gleaned from massive information dumps — such as with the Guardian’s formidable data journalism, as detailed in a recent ebook. ScraperWiki, a favorite tool of civic coders at Code for America and elsewhere, enables anyone to collect, store and publish public data. As we grapple with the consumption challenges presented by this deluge of data, new publishing platforms are also empowering us to gather, refine, analyze and share data ourselves, turning it into information.
There are a growing number of data journalism efforts around the world, from New York Times interactive features to the award-winning investigative work of ProPublica. Here are just a few promising examples:
- Spending Stories, from the Open Knowledge Foundation, is designed to add context to news stories based upon government data by connecting stories to the data used.
- Poderopedia is trying to bring more transparency to Chile, using data visualizations that draw upon a database of editorial and crowdsourced data.
- The State Decoded is working to make the law more user-friendly.
- Public Laboratory is a tool kit and online community for grassroots data gathering and research that builds upon the success of Grassroots Mapping.
- Internews and its local partner Nai Mediawatch launched a new website that shows incidents of violence against journalists in Afghanistan.
Open aid and development
The World Bank has been taking unprecedented steps to make its data more open and usable to everyone. The data.worldbank.org website that launched in September 2010 was designed to make the bank’s open data easier to use. In the months since, more than 100 applications have been built using the data.
“Up until very recently, there was almost no way to figure out where a development project was,” said Aleem Walji, practice manager for innovation and technology at the World Bank Institute, in an interview last year. “That was true for all donors, including us. You could go into a data bank, find a project ID, download a 100-page document, and somewhere it might mention it. To look at it all on a country level was impossible. That’s exactly the kind of organization-centric search that’s possible now with extracted information on a map, mashed up with indicators. All of sudden, donors and recipients can both look at relationships.”
Open data efforts are not limited to development. More data-driven transparency in aid spending is also going online. Last year, the United States Agency for International Development (USAID) launched a public engagement effort to raise awareness about the devastating famine in the Horn of Africa. The FWD campaign includes a combination of open data, mapping and citizen engagement.
“Frankly, it’s the first foray the agency is taking into open government, open data, and citizen engagement online,” said Haley Van Dyck, director of digital strategy at USAID, in an interview last year.
“We recognize there is a lot more to do on this front, but are happy to start moving the ball forward. This campaign is different than anything USAID has done in the past. It is based on informing, engaging, and connecting with the American people to partner with us on these dire but solvable problems. We want to change not only the way USAID communicates with the American public, but also the way we share information.”
USAID built and embedded interactive maps on the FWD site. The agency created the maps with open source mapping tools and published the datasets it used to make these maps on data.gov. All are available to the public and media to download and embed as well.
The combination of publishing maps and the open data that drives them simultaneously online is significantly evolved for any government agency, and it serves as a worthy bar for other efforts in the future to meet. USAID accomplished this by migrating its data to an open, machine-readable format.
“In the past, we released our data in inaccessible formats — mostly PDFs — that are often unable to be used effectively,” said Van Dyck. “USAID is one of the premiere data collectors in the international development space. We want to start making that data open, making that data sharable, and using that data to tell stories about the crisis and the work we are doing on the ground in an interactive way.”
Crisis data and emergency response
Unprecedented levels of connectivity now exist around the world. According to a 2011 survey from the Pew Internet and Life Project, more than 50% of American adults use social networks, 35% of American adults have smartphones, and 78% of American adults are connected to the Internet. When combined, those factors mean that we now see earthquake tweets spread faster than the seismic waves themselves. Networked publics can now share the effects of disasters in real time, providing officials with unprecedented insight into what’s happening. Citizens act as sensors in the midst of the storm, creating an ad hoc system of networked accountability through data.
The growth of an Internet of Things is an important evolution. What we saw during Hurricane Irene in 2011 was the increasing importance of an Internet of people, where citizens act as sensors during an emergency. Emergency management practitioners and first responders have woken up to the potential of using social data for enhanced situational awareness and resource allocation.
An historic emergency social data summit in Washington in 2010 highlighted how relevant this area has become. And last year’s hearing in the United States Senate on the role of social media in emergency management was “a turning point in Gov 2.0,” said Brian Humphrey of the Los Angeles Fire Department.
The Red Cross has been at the forefront of using social data in a time of need. That’s not entirely by choice, given that news of disasters has consistently broken first on Twitter. The challenge is for the men and women entrusted with coordinating response to identify signals in the noise.
First responders and crisis managers are using a growing suite of tools for gathering information and sharing crucial messages internally and with the public. Structured social data and geospatial mapping suggest one direction where these tools are evolving in the field.
A web application from ESRI deployed during historic floods in Australia demonstrated how crowdsourced social intelligence provided by Ushahidi can enable emergency social data to be integrated into crisis response in a meaningful way.
The Australian flooding web app includes the ability to toggle layers from OpenStreetMap, satellite imagery, and topography, and then filter by time or report type. By adding structured social data, the web app provides geospatial information system (GIS) operators with valuable situational awareness that goes beyond standard reporting, including the locations of property damage, roads affected, hazards, evacuations and power outages.
Long before the floods or the Red Cross joined Twitter, however, Brian Humphrey of the Los Angeles Fire Department (LAFD) was already online, listening. “The biggest gap directly involves response agencies and the Red Cross,” said Humphrey, who currently serves as the LAFD’s public affairs officer. “Through social media, we’re trying to narrow that gap between response and recovery to offer real-time relief.”
After the devastating 2010 earthquake in Haiti, the evolution of volunteers working collaboratively online also offered a glimpse into the potential of citizen-generated data. Crisis Commons has acted as a sort of “geeks without borders.” Around the world, developers, GIS engineers, online media professionals and volunteers collaborated on information technology projects to support disaster relief for post-earthquake Haiti, mapping streets on OpenStreetMap and collecting crisis data on Ushahidi.
Healthcare
What happens when patients find out how good their doctors really are? That was the question that Harvard Medical School professor Dr. Atul Gawande asked in the New Yorker, nearly a decade ago.
The narrative he told in that essay makes the history of quality improvement in medicine compelling, connecting it to the creation of a data registry at the Cystic Fibrosis Foundation in the 1950s. As Gawande detailed, that data was privately held. After it became open, life expectancy for cystic fibrosis patients tripled.
In 2012, the new hope is in big data, where techniques for finding meaning in the huge amounts of unstructured data generated by healthcare diagnostics offer immense promise.
The trouble, say medical experts, is that data availability and quality remain significant pain points that are holding back existing programs.
There are, literally, bright spots that suggest what’s possible. Dr. Gawande’s 2011 essay, which considered whether “hotspotting” using health data could help lower medical costs by giving the neediest patients better care, offered another perspective on the issue. Early outcomes made the approach look compelling. As Dr. Gawande detailed, when a Medicare demonstration program offered medical institutions payments that financed the coordination of care for its most chronically expensive beneficiaries, hospital stays and trips to the emergency rooms dropped more than 15% over the course of three years. A test program adopting a similar approach in Atlantic City saw a 25% drop in costs.
Through sharing data and knowledge, and then creating a system to convert ideas into practice, clinicians in the ImproveCareNow network were able to improve the remission rate for Crohn’s disease from 49% to 67% without the introduction of new drugs.
In Britain, researchers found that the outcomes for adult cardiac patients improved after the publication of information on death rates. With the release of meaningful new open government data about performance and outcomes from the British national healthcare system, similar improvements may be on the way.
“I do believe we are at the beginning of a revolutionary moment in health care, when patients and clinicians collect and share data, working together to create more effective health care systems,” said Susannah Fox, associate director for digital strategy at the Pew Internet and Life Project, in an interview in January. Fox’s research has documented the social life of health information, the concept of peer-to-peer healthcare, and the role of the Internet among people living with chronic disease.
In the past few years, entrepreneurs, developers and government agencies have been collaboratively exploring the power of open data to improve health. In the United States, the open data story in healthcare is evolving quickly, from new mobile apps that lead to better health decisions to data spurring changes in care at the U.S. Department of Veterans Affairs.
Since he entered public service, Todd Park, the first chief technology officer of the U.S. Department of Health and Human Services (HHS), has focused on unleashing the power of open data to improve health. If you aren’t familiar with this story, read the Atlantic’s feature article that explores Park’s efforts to revolutionize the healthcare industry through better use of data.
Park has focused on releasing data at Health.Data.Gov. In a speech to a Hacks and Hackers meetup in New York City in 2011, Park emphasized that HHS wasn’t just releasing new data: "[We're] also making existing data truly accessible or usable,” he said, taking “stuff that’s in a book or on a website and turning it into machine-readable data or an API.”
Park said it’s still quite early in the project and that the work isn’t just about data — it’s about how and where it’s used. “Data by itself isn’t useful. You don’t go and download data and slather data on yourself and get healed,” he said. “Data is useful when it’s integrated with other stuff that does useful jobs for doctors, patients and consumers.”
What lies ahead
There are four trends that warrant special attention as we look to the future of data for public good: civic network effects, hybridized data models, personal data ownership and smart disclosure.
Civic network effects
Community is a key ingredient in successful open government data initiatives. It’s not enough to simply release data and hope that venture capitalists and developers magically become aware of the opportunity to put it to work. Marketing open government data is what repeatedly brought federal Chief Technology Officer Aneesh Chopra and Park out to Silicon Valley, New York City and other business and tech hubs.
Despite the addition of topical communities to Data.gov, conferences and new media efforts, government’s attempts to act as an “impatient convener” can only go so far. Civic developer and startup communities are creating a new distributed ecosystem that will help create that community, from BuzzData to Socrata to new efforts like Max Ogden’s DataCouch.
Smart disclosure
There are enormous economic and civic good opportunities in the “smart disclosure” of personal data, whereby a private company or government institution provides a person with access to his or her own data in open formats. Smart disclosure is defined by Cass Sunstein, Administrator of the White House Office for Information and Regulatory Affairs, as a process that “refers to the timely release of complex information and data in standardized, machine-readable formats in ways that enable consumers to make informed decisions.”
For instance, the quarterly financial statements of the top public companies in the world are now available online through the Securities and Exchange Commission.
Why does it matter? The interactions of citizens with companies or government entities generate a huge amount of economically valuable data. If consumers and regulators had access to that data, they could tap it to make better choices about everything from finance to healthcare to real estate, much in the same way that web applications like Hipmunk and Zillow let consumers make more informed decisions.
Personal data assets
When a trend makes it to the World Economic Forum (WEF) in Davos, it’s generally evidence that the trend is gathering steam. A report titled “Personal Data Ownership: The Emergence of a New Asset Class” suggests that 2012 will be the year when citizens start thinking more about data ownership, whether that data is generated by private companies or the public sector.
“Increasing the control that individuals have over the manner in which their personal data is collected, managed and shared will spur a host of new services and applications,” wrote the paper’s authors. “As some put it, personal data will be the new ‘oil’ — a valuable resource of the 21st century. It will emerge as a new asset class touching all aspects of society.”
The idea of data as a currency is still in its infancy, as Strata Conference chair Edd Dumbill has emphasized. The Locker Project, which provides people with the ability to move their own data around, is one of many approaches.
The growth of the Quantified Self movement and online communities like PatientsLikeMe and 23andMe validates the strength of the movement. In the U.S. federal government, the Blue Button initiative, which enables veterans to download personal health data, has now spread to all federal employees and earned adoption at Aetna and Kaiser Permanente.
In early 2012, a Green Button was launched to unleash energy data in the same way. Venture capitalist Fred Wilson called the Green Button an “OAuth for energy data.”
“It is a simple standard that the utilities can implement on one side and web/mobile developers can implement on the other side. And the result is a ton of information sharing about energy consumption and, in all likelihood, energy savings that result from more informed consumers.”
Hybridized public-private data
Free or low-cost online tools are empowering citizens to do more than donate money or blood: Now, they can donate, time, expertise or even act as sensors. In the United States, we saw a leading edge of this phenomenon in the Gulf of Mexico, where Oil Reporter, an open source oil spill reporting app, provided a prototype for data collection via smartphone. In Japan, an analogous effort called Safecast grew and matured in the wake of the nuclear disaster that resulted from a massive earthquake and subsequent tsunami in 2011.
Open source software and citizens acting as sensors have steadily been integrated into journalism over the past few years, most dramatically in the videos and pictures uploaded after the 2009 Iran election and during 2011′s Arab Spring.
Citizen science looks like the next frontier. Safecast is combining open data collected by citizen science with academic, NGO and open government data (where available), and then making it widely available. It’s similar to other projects, where public data and experimental data are percolating.
Public data is a public good
Despite the myriad challenges presented by legitimate concerns about privacy, security, intellectual property and liability, the promise of more informed citizens is significant. McKinsey’s 2011 report dubbed big data as the next frontier for innovation, with billions of dollars of economic value yet to be created. When that innovation is applied on behalf of the public good, whether it’s in city planning, transit, healthcare, government accountability or situational awareness, those effects will be extended.
We’re entering the feedback economy, where dynamic feedback loops between customers and corporations, partners and providers, citizens and governments, or regulators and companies can both drive efficiencies and leaner, smarter governments.
The exabyte age will bring with it the twin challenges of information overload and overconsumption, both of which will require organizations of all sizes to use the emerging toolboxes for filtering, analysis and action. To create public good from public goods — the public sector data that governments collect, the private sector data that is being collected and the social data that we generate ourselves — we will need to collectively forge new compacts that honor existing laws and visionary agreements that enable the new data science to put the data to work.
Photo: NYTimes: 365/360 – 1984 (in color) by blprnt_van, on Flickr
Related: