Data Journalism and Visualization with an Example

Guardian: Paul Bradshaw: “How to be a data journalist”

ProPublica: Jeff Larson: “The Rainbow Connection: How We Made Our CDO Connections Graphic” (tools mentioned: google-refine (formerly Gridworks), Raphaël, JSON)

Interested in data and visualizations?

Check out the Guardian’s Datablog, and while you are at it, read/watch the Guardian’s Simon Rogers interview with Jonathan Stray of Nieman Journalism Labs on the rise of data journalism and the tools they use.

Sean Blanda on Remixing the News

eMedia: Remix the News: “Remix the News: what news can learn from Last.fm and Pandora”: “there is no service that adequately customizes content to my tastes based on previous reading”

A good read with some important ideas. The only thing close I can think of is Google Reader’s recommendations which are based upon my clicking activity in Google Reader.

One of the commenters in Sean’s post added some thoughts about ‘intelligent serendipity’. ‘Intelligent Serendipity’ will be all important if we intend to help people get the news they need to hear, but might not be aware of it.

Some links on ‘intelligent serendipity’:

Jeff Jarvis: “Serendipity is unexpected relevance”

Chis Anderson: “What would it take to build a true “serendipity-maker”?”

Mathew Ingram: “In defence of newspapers and serendipity”

Inside Guardian.com: “The Random Guardian”

Somewhere in here is the news experience of the future. Helping people connect with what they are interested in, and helping them connect with what they would (should?) be interested in, but just aren’t aware of it yet. Isn’t that the essence of ‘news’?

Database related reads (and videos) for January 25, 2010

Lambda the Ultimate: Why Normalization Failed to Become the Ultimate Guide for Database Designers?

Generation 5:
Putting Freebase in a Star Schema

no:sql(east): video: Justin Sheehy is the CTO of Basho Technologies on Riak and more

ShopTalk Blog: Death to filesystems

The case for killing ‘WCM’ (Web Content Management)?

First, a disclaimer. The title refers to the term ‘WCM’, not the functionality implied by it.

WCM (Web Content Management) as defined by Wikipedia is a system that “allows non-technical users to make changes to a website with little training. A WCMS typically requires an experienced coder to set up and add features, but is primarily a Web-site maintenance tool for non-technical administrators.”

Sounds simple, but the definition is crazy expansive.

It’s so generic it enables a wide field of choices to claim they satisfy the need. Check out this list: Bricolage, Alfresco, Interwoven, ez Publish, Texpatten, MovableType, WordPress, Drupal, Jadu, Vignette, Day, Nuxeo, Radiant, typo, Fatwire, Clickability, Plone, SDL Tridion, ektron, it goes on and on. And the costs! From free to millions of dollars!

Couldn’t you consider page creation/site management tools like Dreamweaver in that definition? Sure you could. Many who think they want a CMS, really want one of these or a combination of one of these with a CMS. If you look at Google there are 2,370,000 hits for the combination of Dreamweaver and CMS.

WCM is thought of as a subset of ECM (Enterprise Content Management) concepts. ECM is defined by Wikipedia as “the technologies, strategies, methods and tools used to capture, manage, store, preserve, and deliver content and documents related to an organization and its processes. ECM tools allow the management of an enterprise level organization’s information.”

Among the list above are a few ECMs that have WCM functionality. Commonly mentioned are Alfresco, Interwoven, Vignette.

ECM is then considered a subset of CMS (Content Management System) concepts. A CMS is defined by Wikipedia as “a collection of procedures used to manage work flow in a collaborative environment.”

Referring to the above list, many can be called out as CMSes. In fact, all of them consider themselves such.

And then there are frameworks like Ruby on Rails, Grails, and Django. They just beg you to build your own.

With an alphabet soup like the above, no wonder so many get confused. WCM, in particular, is overloaded. So much so that there are some folks in the industry arguing to eliminate the acronym altogether!

Jon Marks says in “WCM is for Losers”:

I can already see the news headlines: LONDON, 2009 – SHOCK HORROR! WCM Geek Demands Death of term WCM. But it’s true. I’m of the camp that wished the term WCM would cease to exist.

Jon Marks concludes by saying:

But sadly, my prediction it isn’t going to happen. I’m just going to have to keep thinking of a WCMS as a tightly coupled hybrid of a content management system and a delivery framework. On the plus side, I’ll continue to make money out of poor customers that think a “WCM migration/replacement” doesn’t involve a complete site rewrite as they’re throwing the delivery baby out with the content bath water. Losers.

Deep within the comments on Jon Mark’s post, NPR’s Daniel Jacobson added:

In my posts about COPE, I tried to make a distinction between tools that capture content in a presentation-agnostic way and those that capture them for one (or more) specific presentation. I call the latter WPT (web-publishing tool), although Peter Monk’s Presentation Management System is in some ways a better term in that it is broader, covering systems that don’t just apply to the web.

For me, however, the future of the content management systems (CMS, or whatever acronym you want to give it) is in their ability to capture the content in a clean, presentation-independent way. Then, as Pie states, the content should be retrievable through a series of API’s, enabling the content to be distributed to any other platform. If available through an API in a truly portable, presentation-agnostic way, the system can then service any presentation layer.

Alfresco’s consulting lead for North America, Peter Monks, shares on his blog how difficult this is and looks for a new terminology in “The Case for Killing ‘WCM'”:

To start undoing the 15 years of mind share that the term “WCM” has enjoyed, it’s time to start thinking about new terminology that better describes these two functional categories. For several years I’ve been throwing around the terms “Content Production System” (CPS) and “Presentation Management System” (PMS), and in their COPE strategy NPR uses the terms “Content Management System” (CMS) and “Web Publishing Tool” (WPT).

Daniel Jacobson and Peter Monks are onto something. Jacobson wrote a piece for Programmable Web (“COPE: Create Once, Publish Everywhere”) I’ve linked to previously. The section “Build CMS, not WPT” makes some important distinctions:

COPE is the key difference between content management systems and web publishing tools, although these terms are often used interchangeably in our industry. The goal of any CMS should be to gather enough information to present the content on any platform, in any presentation, at any time. WPT’s capture content with the primary purpose of publishing web pages. As a result, they tend to manage the content in ways focused on delivering it to the web. Plug-ins are often available for distribution to other platforms, but applying tools on top of the native functions to manipulate the content for alternate destinations makes the system inherently unscalable. That is, for each new platform, WPT’s will need a new plug-in to tailor the presentation markup to that platform. CMS’s, on the other hand, store the content cleanly, enabling the presentation layers to worry about how to display the content not on how to transform the markup embedded within it.

True CMS’s are really just content capturing tools that are completely agnostic as to how or where the content will be viewed, whether it is a web page, mobile app, TV or radio display, etc. Additionally, platforms that don’t yet exist are able to be served by a true CMS in ways that WPT’s may not be able to (even with plug-ins).

COPE is an ecosystem and strategy. It is not an uber-CMS. Many of the vendors above claim their systems can provide you this almost-mythical beast. Indeed, many of them can, but it calls back to the point Jon Marks was making and a common mistake many trip into.

De-couple, break it down into separate systems systems (Presentation, API/Mashup/Data, and CMS) as Daniel Jacobson and Peter Monks suggest, and you gain much in flexibility. The trade off is flexibility’s evil twin – complexity. You have to be able to accept some complexity for the gains you draw in flexibility.

Lets take a look at a diagram Jacobson provided on NPR’s COPE:

Look at where the ‘CMS’ is in what NPR calls its Content Management Pipeline (what I call an ecosystem). Look at where the Presentation Layer is. Notice what feeds it. The API Layer capable of delivering content from multiple sources, including what the CMS feeds into, the Data Management Layer.

Take another look. This is an inkblot test.

Some see a diagram like this and see the whole thing as CMS or WCM. When these say ‘Content Management System’ or ‘Web Content Management System’ they are thinking of a singular application that performs all of the duties of all of the layers detailed in the diagram. In NPR’s case, this diagram shows the ‘Content Management System’ is part of a tier on the backend – where content is consumed, stored, maintained for reuse. It’s goals are as Jacobson points out. The CMS’s role constrained to a set of responsibilities and in order to do them it must integrate into a larger set of cooperating systems.

You might not need a system as comprehensive as NPR’s but you won’t know until you answer a few questions: “Where is it you want your business to go?”, “What is the Content Strategy?”, and importantly, “Show me how you do things now.” and “Let’s figure out a better way of getting this done.”. Starting from the simplest thing that can possibly work and allowing for evolution towards your end goal is always the way. For example, you could start with a combined Data Layer, API Layer, and Filtering Layer (what I call a “3 Box Content Management Ecosystem” below), and then decompose that into separate systems down the line. If you do have answers to these questions, and they resemble what NPR’s are, Jacobson has provided a great high level view of what this looks like. He deserves thanks for sharing it.

Related:

A friend at work passed along a great link, Blend Interactive’s “Thoughts on Content Management & Information Architecture”. I’ve linked to Gadgetopia.com, the official blog of Blend Interactive before, but this index is, as he suggests, “quite possibly the best single source of CMS-related questions, insights, etc. that I’ve ever found.” Bookmark it. Their “What Makes a Content Management System”? piece provides you with the best checklist of functionality to consider when looking at CMSes.

And lastly:

Sometimes I think I want to publish a series that describes the various layouts that define CMS systems in really, really simple terms because of the confusion. Here’s a first pass:

The ‘2 Box Content Management System’ – Presentation and content maintenance functionality using same software, with shared storage. (WordPress, Drupal)

The ‘3 Box Content Management Ecosystem’ – Presentation running its own software, content maintenance running its own software, with shared storage. (MovableType, WordPress and Drupal, ez Publish, Bricolage, Alfresco in some implementations)

The ‘4 Box Content Management Ecosystem’ – Presentation running its own software, content maintenance running its own software, a data tier for presentation, a data tier for content storage. (Alfresco in some implementations)

The ‘5 Box Content Management Ecosystem’ – Presentation running its own software, API/Mashup running its own software, data tier for presentation, content maintenance running its own software, a data tier for content storage. (Alfresco in some implementations, NPR’s COPE content pipeline).

And so on. Someday I might get around to it. The terminology soup is so oppressive and obscuring.

More from Daniel Jacobson on NPR’s content management ecosystem

Programmable Web: Daniel Jacobson: “Content Portability: Building an API is Not Enough”

Previous entries in the series:

Programmable Web: Daniel Jacobson: Content Modularity: More Than Just Data Normalization

Programmable Web: Daniel Jacobson: COPE: Create Once, Publish Everywhere

You can read much more from the NPR team on their blog at Inside NPR.org. A recent post on the blog from Jason Grosman that caught my attention was “What Happens When Stuff Breaks On NPR.org”.

Related:

Justin Cormack has some thoughts on the above series, in particular on content portablility, that are worth reading.

Also related to content portability (I think – okay – maybe a stretch – but is worthy to think about), is “Dive into history, 2009 edition”: “HTML is not an output format. HTML is The Format. Not The Format Of Forever, but damn if it isn’t The Format Of The Now.”

Also Related:

AIGA: Callie Neylan: Case Study: NPR.org

Think you have statistical chops? Help predict homicides in Philadelphia

The Analytics X Prize is “to use statistical techniques and any data sets you can find to predict where crime, specifically homicides, will occur in the city”.

Drew Conway at Zero Intelligence Agents has posted some of his progress so far using spacial regression.