Cloud computing is the big thing now days, weather you are an app developer using EC2 or the Google APP engine, or a new company trying to build your own cloud product. If you are hosting, or using a platform it costs money. I hate to spend money, especially money that is for my company. If I find an idle server, I use it to 100% utilization (prior to the saturation point).
I needed to build a new application that periodically crawls a website to update various lists. Building a crawler is expensive, especially from scratch. First, you have to define the amount of lag that is allowed from the crawl copy and the real copy. Of course the Project Manager does not want any lag, all events must be caught and near real-time without overloading the source of the data-but I am not hating, it is a challenge. Next, what technology to use, what language to write the app in-what considerations are left to be defined? How does one crawl Gigs, Tera, and amount of data in a guarantee period? On top of that, how much additional hardware is this going to cost. In addition, to be a cloud it needs to have an API so app developers can set, and get consistent data in an expected period. This is a lot of freaking requirements.
Therefore, to solve this issue, I know that mySQL will store the data, but getting the data is the hard part. This is what is going to cost money, lots of it. I looked around common architectures and found that nothing would do what I wanted to do in a cost effective manner. So, I designed my own using Seti @ Home as the basis for the design.
Get to the point already Dathan:
I have turned ever user who views my applications into a collector, using idle bandwidth without knowing who is collecting the data. My user base on spare cycles will fetch a feed of my choosing, and sends that data to my servers without any personal information. Instead of using an Amazon or Google service, I have turned my user base into a cloud to service their needs.
It is rather awesome-I must say. I am able to service the needs of more than 60 million users at the cost of development time, and NO NEW HARDWARE. The cloud does not have to be a service provider-it can be the end user as long as the end user is not impacted by the requests. BTW the team that I manage is freaking awesome-they built my vision with trial and error and a hand waved spec.
Currently the system scales as long as there is enough end users. If I lose all my users then well I am boned, but to support the feeds all I need is 100K nodes at the current rate. With 60 million end nodes, I am cool.
Imagine if Google with Adsense used this install base to tell Google if the data has changed for an arbitrary web address. All it needs is a few people to hit the same url, inform Google that the web address in question has a different checksum, then at that point Google’s crawlers can go fetch it, instead of constantly crawling data that doesn't change. Google would be able to reduce overall server cost significantly, if it just knew what data has changed instead of guessing what data has changed.
By next years Velocity conferance I hope to have a full disclosure on what technology my team used, how my team get around cross domain issues, and how to compute checksums to validate the data.
PS - I designed this, with my team we made it much better and one person implemented it and owns the product from this point on.
9 comments:
So... This must be "malware" to some degree. If you offload the work to your consumers, they are being forced to perform the cloud functionality for you - at their cost. Not particularly an honest approach, eh?
mal - means bad in Latin. No identifiably information is of the end user. Plus this is a tool to service their needs. I.E. they are asking for this service - to service the needs for free they might be required to fetch some data of which they subscribe to.
Think of it as if I visited a webpage my toolbar is telling Yahoo or Google that said web page was visited. To use the toolbar I accept the fact that anonymous information is used by them.
I do like the idea. It's come up before, and the key concern has been data source integrity... or in short: trust.
You'd probably have to get a certain amount of the work duplicated to catch bogus input from your cloud. I think it's do-able, but you'll have to use some of your cycles to ensure evilness in the cloud can't poison the system. With sufficient numbers, no worries.
yea on top of that, sometimes to src of the data has what I like to call
slave lag, or some content is not realtime to certain countries. So, this case is a pain as well.
Cloud computing is good,but I don't make my key business on it. Security beyond everything.
How do you take care of a malicious user that doesn't like you using his computer to do your work and starts sending bogus data back?
@Guillaume:
a checksum is created and verified by multiple users in distinct geographic locations.
Also peeps can opt out of this, but in the day of reveal all your personal information on social networking sites, the vast majority of people do not care based on my stats.
awesome stuff dathan!
I think its a great idea, but you have to be sure of the data integrity. If you can find a way to prevent bogus data coming in then it'd be amazing.
Post a Comment