Storm: the Hadoop of Realtime Stream Procesing

This presentation was great to get a peek at what Twitter’s Storm was about: YouTube: PyCon US 2012: Gabriel Grant:


Twitter Engineering: “A Storm is coming: more details and plans for release”

GitHub: Storm

Data Journalism and Visualization with an Example

Guardian: Paul Bradshaw: “How to be a data journalist”

ProPublica: Jeff Larson: “The Rainbow Connection: How We Made Our CDO Connections Graphic” (tools mentioned: google-refine (formerly Gridworks), RaphaĆ«l, JSON)

A Yahoo! Pipes influenced Python toolset for ETL

PyF looks interesting. In a similar vein is Ruffus and Orange (Orange looks impressive and has data analysis capability to boot).

What is ETL and CMS?

You’re a programmer with a task to retrieve information from some source, manipulate and message it, and to deploy it somewhere.

Like all things in programming, there is an acronym for that: “ETL”.

ETL stands for Extract, Transform, and Load. The Wikipedia page is pretty thorough in its summary of the topic and reviews many of the typical functions an ETL process needs to take to accomplish its task.

The problem is ETL doesn’t roll off the tongue so easy. The acronym provides a weak set of metaphors for programmers to map familiar concepts to.

Rafe Colburn provides a great mental model to apply when developing ETL scripts and applications. It’s one I follow, but have lacked the words to describe. Go read his post.

Here’s a thought to challenge you if you are a CMS developer, now that you have read the above – are whatever forms you build to enable people to contribute and manage content in a CMS a kind of ETL process? Does the Wikipedia description for “Extract, Transform, and Load” contain functions there that you would expect a CMS to encompass?

And speaking of CMS, Gadgetopia has a terrific article on what a CMS system is. It is difficult to be clarifying in a world where hype and acronyms get thrown about so much (like this very post!) – but the Gadgetopia piece certainly is. It helps outline the functionality you should expect from a CMS implementation.

NoSQL, Relational Database, ETL Link-a-rama for November 25th, 2009

Jon Moore: NoSQL East 2009 Redux

Dare Obasanjo: Building Scalable Databases: Perspectives on the War on Soft Deletes

Explain Extended: What is a relational database?

Explain Extended: What is the entity-relationship model?

Data Doghouse: Data Integration: Hand-coding Using ETL Tools

Data Doghouse: Data Integration: Hand-coding Using ETL Tools Part 2

Smart Data Collective: ETL tools: Don’t Forget About the Little Dogs

Smart Data Collective: Data Integration: Hand-coding Using ETL Tools

Communications of the ACM: Extreme Agility at Facebook

Dare Obasanjo: Facebook Seattle Engineering Road Show: Mike Shroepfer on Engineering at Scale at Facebook

Hive, Hadoop at Facebook, Yahoo

Engineering@Facebook: Hive – A Petabyte Scale Data Warehouse using Hadoop

Yahoo! Developer Blog: Announcing the Yahoo! Distribution of Hadoop

Reading up on ETL (Extract, Transform, Load) processing

Wikipedia: Extract, transform, load

Wikipedia: Talend Open Studio

Talend Open Studio: Tutorials

Manageability: Open Source ETL (Extraction, Transform, Load) Written in Java Data Migration Done Right

kJube: Vendors and tools – ETL

AlfrescoForge: ETL Connector

Talend job for Job Scheduler implement

High Scalability: How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data

NYTimes: Announcing the Map/Reduce Toolkit Andreas Kostyrka: Re: hadoop in the ETL process
Re: hadoop in the ETL process