Friday, December 11, 2015

Data Lake? More like the Data Junk Closet

Data Lake is the hottest term being pushed by consultant companies. But what is it?  In summary, it is a place where you store all the incoming data, in what ever form it might be in. For a tool like Hadoop, which was based on published Google papers, the overall intent was to capture data as quickly as possible, in its original form. Once loaded, you start to make sense of it, and effectively, provide structure. Makes sense then to use Hadoop. Another key feature of Hadoop is redundancy. The default setup for Hadoop is to store the data on three different nodes. So for on-premise, Hadoop is a good place to store data, and then organize it.

But, a "Data Lake"?  Well, it's a diplomatic term, and sounds good for business presentations. But perhaps it's really more like a data junk closet. I've also seen the term "Data Swamp" used. The place were you put all kinds of stuff - and tell yourself that you'll organize it tomorrow. Perhaps.

So which is it...

.. a sound and wise approach to storing data, or just a dustbin for that which will not be used?  That all depends on the organization. Most likely, it will become both, with only discipline minimizing the chaos.

A real purpose

As organizations start to move towards fast data, the data lake will finally take on a true need and purpose. Sensor data, social media feeds, weather data, traffic data feeds, and the list goes on. If it applies to your business, your or your competitor will start to capture it, store it in a data lake, and start to analyze how it impacts your business.

You can find an excellent write up of the uses of a data lake here:

No comments: