Archive for the ‘Grid Computing’ Category

Hadoop Summit: Facebook creates business intelligence tool called Hive

Thursday, March 27th, 2008

Hive was developed iteratively by a 2 or 3 person team (I think Jeff Hammerbacher was also involved) making it easy for business analysts to ask ad hoc questions of terabytes worth of logfile data by abstracting MapReduce into a SQL like dialect. Think of it as a data warehouse sitting on top of thousands of servers’ logfiles. Beneath the surface Hive leverages Hadoop and translates SQL-like imperatives into MapReduce jobs.

http://blog.blist.com/index.php/2008/03/26/hadoop-summit-best-in-show/

I like seeing SQL like dialects put on top of MapReduce operations. I’m working on my own… WesQL, j/k. :)

Hive is in use by ~40 people or ~25% of FaceBook’s engineering team (thus FaceBook’s engineering team size is 40*4 = 160). It stores a total of 22TB of compressed data, with ~200G daily increase.

http://parand.com/say/index.php/2008/03/25/hadoop-summit-notes/

Hive and it’s query language reminds me of WebQL except that it lacks strict MapReduce. Update: This model is similar to DryadLINQ “treats the data flow as a general graph instead of forcing it into map/reduce.” from parand.com.

the ql2 studio showing a graph of webql statement joins

Who doesn’t need an Amazon EC2 Search Engine AMI?

Friday, May 4th, 2007

Last year I was just observing the revolution when I spoke of inverting the CPU model into a rented hourly service for machine images. Now Amazon is asking “Who needs a search engine packed as an AMI? I say, who doesn’t? We’ve been watching desktop search happen in every company from Google, to Yahoo!, Microsoft and of course Apple.

I’ve been waiting for an announcement by a big iron chewing per-cpu licensed application provider that they’ll support a per-hour model for using their EC2 AMI’s, but I suspect they’ll avoid going that route to maintain their top notch support contracts and hardware company kickbacks.

When PowerSet and amazon announced a partnership open source implementations were sure to come soon after, and that was only a few months later.

We have the pieces of the puzzle with Hadoop, EC2, Lucene, Solr and the final piece, presentation, has been demonstrated by Open-Source Endeca in 250 Lines or Less. It’s going to get exciting in the next year.

Need a cheap MapReduce? Amazon EC2 and Hadoop is your answer.

Thursday, January 11th, 2007

It’s time to re-examine those long running batch jobs. Could you partition the data to allow for MapReduce? I bet you can. I know I’ve always wanted an affordable way to fire up 30 servers and run MapReduce operations against giant datasets, it’s confirmed; I’m a dork.

Tom White sent me a note this week to inform me that he had implemented a Hadoop file system on top of S3. This file system can be used as a full or partial replacement for HDFS, the Hadoop Distributed File System.

Because bandwidth between EC2 instances and data stored in S3 is not metered or billed, this is a very cost-effective way to process large amounts of data.

Hadoop Filesystem Using S3

Seattle Devs: Concurrent Software Development Talk Monday 12/11/2006

Friday, December 8th, 2006

Digipede Evangelista Kim Greenlee will be giving a talk on concurrency and software development at the .NET Developer Association Monday on the Microsoft campus. dan ciruli’s West Coast Grid

If I didn’t already have plans Monday, I’d be at this talk. Sure it’s over in Redmond, but it sounds really interesting.

With the advent of dual-core and dual-processor machines, concurrent software development is breaking onto the scene–creating a paradigm shift like we haven’t seen since the OO movement. Two technologies that will help you add concurrency to your applications are threads and grid objects. While you can find a lot of documentation about adding threading to your apps, there isn’t very much available to tell you how threads really work. I’ll explain that to you. I will also use the Digipede Network to introduce you to grid objects. Grid objects give you the ability to easily distribute application functionality on a compute grid.

http://www.netda.net/Event/EventNewsletter.asp?EventDate=12/11/2006

IBM developerWorks: Introduction to Grid Computing

Thursday, September 7th, 2006

Grid computing is a critical shift in thinking about how to maximize the value of computing resources. The technology is still fairly nascent, but here at the developerWorks grid computing zone, we’re publishing a steady stream of new articles, tutorials, resources, and tools to bring developers up to speed on this important cutting-edge technology.

developerWorks Introduction to Grid Computing

Looks like a great series of articles, go check ‘em out.

Amazon Elastic Compute Cloud (Amazon EC2) - Limited Beta

Friday, August 25th, 2006

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers. http://www.amazon.com/gp/browse.html?node=201590011

EC2 let’s you load an arbitrary number of OS images known as AMI’s (Amazon Machine Image is a configured Linux distro ready to run your application) and then get charged for the number of CPU hours required to complete your processing. Oh, this is something to watch. It’ll be very interesting to see how companies that currently only offer per-cpu licensing react to this; perhaps per-cpu-hour licensing or now could they easily flip the model inside out and easily sell their software as a service?

Why is the Digipede network good for Windows environments?

Friday, June 2nd, 2006

Answer? You already have a Windows environment and an IT staff that can work in it. Retraining your staff to manage a new OS or configure dedicated hardware/infrastructure for your computing needs is unreasonable for most IT departments. This is where the Digipede network shines.

My friend Matt Michie is a big fan of all things open source, has far more experience than I when it comes to writing MPI code and I believe he has actually worked with large scale clusters for parallel processing. All considered, I don’t think he understands the business decisions that drive my recommendations for the Digipede Network.

It would be a lot nicer if I could do Grid Computing on an OS that didn’t require a GUI and a video card. Matt

Matt, how does ‘nicer’ compare to cost effective?

Matt mentions energy:

Electricity is one of the biggest economic factors in large grids…. Matt

Why not use the wasted cycles on all those existing machines in your corporate network rather than increasing the total amount of energy used? If you want to install N computers in a data center, cool them all, power the switches and all the new hardware to save energy, go for it. Or you can just use your existing computers in what Digipede calls the “Desktop Grid Configuration”. How say you sir?

Oh, would you like a transitional solution? You should check out Hadoop which implements MapReduce in Java.

Ok, I’m picking on Matt here. He knows I’m not a big fan of working on Windows, but he should also be aware that there are good reasons behind most of my recommendations. :)