NoSQL: If Only It Was That Easy

The biggest thing in web apps since “rails can’t scale” is this idea that “your rdbms doesn’t scale.” This has gone so far as to be dubbed the coming of age for “nosql” with lots of blog posts and even a meetup. Indeed, there are many promising key-value stores, distributed key-value stores, document oriented dbs, and column oriented db projects on the radar. This is *definitely* a great thing for the web application scene and this level of variety will definitely open doors for organizations large and small in the near and long term.

However, along with these great tools, an attitude that “the rdbms is dead” has popped up, and while that may be true in the long run, in the short term, it’s definitely premature.

What is scaling?

First, lets get a couple things straight:

to scale (third-person singular simple present scalespresent participle scalingsimple past and past participle scaled)

  1. (transitive) To change the size of, maintaining proportion.
    We should scale that up by a factor of 10.
  2. (transitive) To climb.
    Hilary and Norgay were the first known to have scaled Everest.
  3. (intransitive) (computing) To tolerate significant increases in throughput or other potentially limiting factors.
    That architecture won’t scale to real-world environments.

The first thing we need to agree on is what it means to scale. According to our definitions above, it’s to change the size while maintaining proportions and in CS this usually means to increase throughput. What scaling isn’t: performance.

performance (plural performances)
The act of performing; carrying into execution or action; execution; achievement; accomplishment; representation by action; as, the performance of an undertaking of a duty.
In Computer science: The amount of useful work accomplished by a computer system compared to the time and resources used. Better Performance means more work accomplished in shorter time and/or using less resources

performance (plural performances)

  1. The act of performing; carrying into execution or action; execution; achievementaccomplishment; representation by action; as, the performance of an undertaking of a duty.
  2. In Computer science: The amount of useful work accomplished by a computer system compared to the time and resources used. Better Performance means more work accomplished in shorter time and/or using less resources

In reality, scaling doesn’t have anything to do with being fast. It has only to do with size. If your request takes 12 seconds, it doesn’t matter, it only matters that you can do 1 per second, 10 per second, 100 per second, 1000 per second, etc. of that 12 second query.

Now, scaling and performance do relate in that typically, if something is performant, it may not actually need to scale. Your upper limit may be high enough you don’t even need to worry about scaling. The problem with rails is not that it doesn’t scale (it happens to scale pretty easily), it’s that you have to scale it almost immediately. The problem with RDBMS isn’t that they don’t scale, it’s that they are incredibly hard to scale. Ssharding is the most obvious way to scale things, and sharding multiple tables which can be accessed by any column pretty quickly gets insane. Furthermore, you might be able to use something other than a RDBMS that you won’t need to scale because it’s more performant or efficient at doing the work you’re currently doing in a RDBMS.

So NoSQL then. . .

At my previous job, my co-workers and I evaluated most of the current nosql solutionsto varying degrees. All of the projects have been evaluated for both use as simple “tables” of data, such as storing a single type of key/value data, as well as a document db to be used as our primary data store. They have a curious data set which consists of a single set of parent objects with many child objects that relate 1-1 or 1-n with this set of objects (I call these primary objects). We then have a secondary set of objects that store changes made to both the primary objects that we primarily use for auditing. Our current db setup is standard master-slave replication in mysql with 1 master and up to 3 slaves depending on usage. The primary objects are mostly changed via UPDATES and the secondary objects are all inserted at the end of the tables. We also have a few random other data sets which loosely relate to the primary parent objects.

To get a couple out of the way, I’m not going to cover memcached (because it’s not a db), memcachedb (general sentiment that it is immature), couchdb (because we didn’t want to use map/reduce to pull information and there are questions about it’s performance and replication), dynomite (seen as immature),  Amazon SimpleDB (because of size limits), or Lightcloud (seen as immature). As far as the ones that we I deemed immature, I’m sure there are people out there using these things and having a great time, but our research into them, and word of mouth from others who have tried them kept us from really going deep.

Tokyo *

url: http://tokyocabinet.sourceforge.net/
type: Key/Value store with Full Text Search*
Conclusion: Doesn’t scale.

We liked Tokyo Tyrant so much, we put it in production. In fact, every request to AboutUs.org hits Tokyo. One of the uses is as a persistent memcached replacement for caching 10 million+ wiki pages (as a json document of all the pieces of our page, which comes out to around 51gb(edited) of data), and it works great. It runs on a single server, it serves up a single type of data, very quickly, and has been a pleasure to use. We keep other ancillary data sets on some other servers too, and it’s great for this. Tokyo Tyrant is a great example of very performant software, but it doesn’t scale. If you’d like to make it scale, it’s not very hard, you scale it exactly like Memcached (by some sort of application side hashing of keys). You can have as many servers as you’d like, but you can’t easily add servers to a cluster (increase in size while maintaining proportion) and therefore, you can’t tolerate significant increases in throughput. The good news it that here “significant”  is relatively massive, and you probably won’t need to scale it any time soon.

We tried to insert 160mil 2k -20k documents into a single Tokyo Tyrant server, and performance quickly dropped off and kept going down. You could have had a nice holiday skiing on the graph of inserts/sec. This is pretty typical of anything that you write to a disk. The more you write, the slower it goes. We could distribute the write easily, because Tokyo doesn’t scale.

Tokyo does support replication, and a few other great things, but these don’t make for scaling.

* We don’t use the full text search, so I can’t comment there.

Redis

url: http://code.google.com/p/redis/
type: Key/Value store with collections and counters
Conclusion: Doesn’t scale.

Redis is also awesome like Tokyo. I would say the two are pretty comparable as simple k/v stores. The counters and collections are AWESOME, and if I was still at AboutUs, I think I’d be pushing to move a couple pieces of the infrastructure to Redis. I have less to say about Redis because I haven’t used it in production, but it looks great if it fits your bill. Does it scale? No. It, just as memecached and tokyo tyrant, can do sharding by handling it in the client, and therefore, you can’t just start adding new servers and increase your throughput. Nor is it fault tolerant. Your redis server dies, and there goes that data. And just as tokyo tyrant and memcached, you probably won’t ever need to try to scale it. Redis also supports replication.

Project Voldemort

url: http://project-voldemort.com/
type: Distributed Key/Value store
Conclusion: Scales!

Voldemort is a very cool project that comes out of LinkedIn.  They seem to even be providing a full time guy doing development and support via a mailing list. Kudos to them, because Voldemort, as far as I can tell, is great. Best of all, it scales. You can add servers to the cluster, you don’t do any client side hashing, throughput is increased as the size of the cluster increases. As far as I can tell, you can handle any increase in requests by adding servers as well as those servers being fault tolerant, so a dead server doesn’t bring down the cluster.

Voldemort does have a downside for me, because I primarily use ruby and the provided client is written in java, so you either have to use JRuby (which is awesome but not always realistic) or Facebook Thrift to interact with Voldemort. This means thrift has to be compiled on all of your machines, and since Thrift uses Boost C++ library, and Boost C++ library is both slow and painful to compile, deployment of Voldemort apps is increased significantly.

Voldemort is also intersting because it has pluggable data storage backend and the bulk of it is mostly for the sharding and fault tolerance and less about data storage. Voldemort might actually be a good layer on top of Redis or Tokyo Cabinet some day.

Voldemort, it should be noted, is also only going to be worth using if you actually need to spread your data out over a cluster of servers. If your data fits on a single server in Tokyo Tyrant, you are not going to gain anything by using Voldemort. Voldemort however, might be seen as a good migration path from Tokyo * when you do hit that wall were performance isn’t enough.

MongoDB

url: http://www.mongodb.org
type: Document Database
Conclusion: Doesn’t scale (yet!)

MongoDB is not a key/value store, it’s quite a bit more. It’s definitely not a RDBMS either. I haven’t used MongoDB in production, but I have used it a little building a test app and it is a very cool piece of kit. It seems to be very performant and either has, or will have soon, fault tolerance and auto-sharding (aka it will scale). I think Mongo might be the closest thing to a RDBMS replacement that I’ve seen so far. It won’t work for all data sets and access patterns, but it’s built for your typical CRUD stuff. Storing what is essentially a huge hash, and being able to select on any of those keys, is what most people use a relational database for. If your DB is 3NF and you don’t do any joins (you’re just selecting a bunch of tables and putting all the objects together, AKA what most people do in a web app), MongoDB would probably kick ass for you.

Oh, and did I mention that, of all the NoSQL options out there, MongoDB is the one of the only ones being developed as a business with commercial support available? If you’re dealing with lots of other people’s data, and have a business built on the data in your DB, this isn’t trivial.

On a side note, if you use Ruby, check out MongoMapper for very easy and nice to use ruby access.

Cassandra

url: http://incubator.apache.org/cassandra/
type: Column Database
Conclusion: Probably scales

Cassandra is another very promising project that I wouldn’t use yet. Cassandra came out of Facebook and seems to be in use there powering search in your inbox. It’s described as a distributed key/value store, where values can be collections of other key/values (called column families). It is definitely supposed to scale, and probably does at Facebook, by simply adding another machine (they will hook up with each other using a gossip protocol), but the OSS version doesn’t seem to support some key things, like loosing a machine all together. They are also in the midst of changing how the basic datastructures are stored on disk, and I don’t know that I’d trust my data to this sexy db until those things are worked out, which should be soon.

Cassandra also seems like a contender for a primary database or RDBMS replacement, as soon as it matures. The scaling possibilities are very attractive, and complex data structures shouldn’t be hard to model in it. I’m not going to go any deeper on cassandra because Evan Weaver did a great job of it here, but I will say that Cassandra is very promising and we were (when I left) looking at it very closely at AboutUs.org.

Amazon S3

url: http://aws.amazon.com/s3/
type: key/value store
Conclusion: Scales amazingly well

You’re probably all like “What?!?”. But guess what, S3 is a killer key/value store. It is not as performant as any of the other options, but it scales *insanely* well. It scales so well, you don’t do anything. You just keep sticking shit in it, and it keeps pumping it out. Sometimes it’s faster than other times, but most of the time it’s fast enough. In fact, it’s faster than hitting mysql with 10 queries (for us). S3 is my favorite k/v data store of any out there.

MySQL

url: http://www.mysql.com
type: RDBMS
Conclusion: Doesn’t Scale

Now you are probably like “Dude, what?!? You got some SQL in this NoSQL article”. I’ve got news for you guys, mysql is a pretty bad ass key/value store. It can do everything that Tokyo and Redis can do, and it really isn’t that much slower. In fact, for some data sets, I’ve seen MySQL perform ALOT faster than Tokyo Tyrant (I’ll post my findings in a follow up). For most applications (and say, FriendFeed), MySQL is plenty fast and it’s familiar and ubiquitous. I’m sure the NoSQL guys reading this will all be saying “Yeah, but we are dealing with more data than MySQL can handle”. Well, you might be dealing with more data than mysql used as a RDBMS might be able to handle, but it’s just as easy or easier to shard MySQL as it is Tokyo or Redis, and it’s hard to argue that they can win on many other points.

Conclusion

So, does RDBMS scale? I would say the answer is: not any worse than lots of other things. Most of what doesn’t scale in a RDBMS is stuff people don’t use that often anyway. And does NoSQL scale: a couple solutions do, most don’t. You might even argue that it’s just as easy to scale mysql (with sharding via mysql proxy) as it is to shard some of these NoSQL dbs. And I think it’s a pretty far leap to declare the RDBMS dead.

The real thing to point out is that if you are being held back from making something super awesome because you can’t choose a database, you are doing it wrong. If you know mysql, just used it. Optimize when you actually need to. Use it like a k/v store, use it like a rdbms, but for god sake, build your killer app! None of this will matter to most apps. Facebook still uses MySQL, a lot. Wikipedia uses MySQL, a lot. FriendFeed uses MySQL, a lot. NoSQL is a great tool, but it’s certainly not going to be your competitive edge, it’s not going to make your app hot, and most of all, your users won’t give a shit about any of this.

What am I going to build my next app on? Probably Postgres. Will I use NoSQL? Maybe. I might also use Hadoop and Hive. I might keep everything in flat files. Maybe I’ll start hacking on Maglev. I’ll use whatever is best for the job. If I need reporting, I won’t be using any NoSQL. If I need caching, I’ll probably use Tokyo Tyrant. If I need ACIDity, I won’t use NoSQL. If I need a ton of counters, I’ll use Redis. If I need transactions, I’ll use Postgres. If I have a ton of a single type of documents, I’ll probably use Mongo. If I need to write 1 billion objects a day, I’d probably use Voldemort. If I need full text search, I’d probably use Solr. If I need full text search of volatile data, I’d probably use Sphinx.

If there’s anything to take away from the NoSQL debate, it’s just to be  happy there’s more tools, because more cool tools = more win for everyone.

  • I always enjoy learning what other people think about Amazon Web Services and how they use them. Check out my very own tool CloudBerry Explorer that helps to
    manage S3 on Windows . It is a freeware. http://cloudberrylab.com/

  • didip

    “(as a json document of all the pieces of our page, which comes out to around 14gb of data)”

    It has now grown to about 51 GB =)

  • Wow, that was something I never even considered. S3 as a database. Suddenly I can think of many applications for that. Thanks for sparking that inspiration for me!

    • BJ Clark

      Thanks Didip. Updated. 🙂

  • Jake

    Quick correction:

    “Your redis server dies, and there goes that data.” Redis is not memcached, in that it periodically saves all your k/v, lists and set information to disk. The interval at which disk saves occur (configurable by seconds past, keys created or keys created per second) can be configured.

  • Pingback: NoSQL: If Only It Was That Easy « Marked As Pertinent « Netcrema - creme de la social news via digg + delicious + stumpleupon + reddit()

  • Just out of interest, when you say “I’ve got news for you guys, mysql is a pretty bad ass key/value store.”, do you just mean “mysql, but denormalized to hell”, or is there some more detail there …?

    • BJ Clark

      I mean, if you create a table in MySQL, with 2 columns, “key” and “value”, the performance is on par with most other key/value stores. Not to mention you can do all the sql-y things that hash tables don’t provide.

  • Pingback: popurls.com // popular today()

  • If you need to write one billion objects a day, Cassandra would be a better bet than Voldemort.

  • So there is scaling and there is scaling. It is extremely unlikely you are building something that is going to outscale an rdbms. I worked recently at a news site which delivered ~300k pages an hour, with all the data in Oracle rdbms.

    • Jamieson Becker

      On the contrary, there are many, many applications that will scale beyond ~300k pages per hour. Typically the first wall that is run into is with regards to millions of user accounts. The other pertinent point is that although Oracle can indeed scale to heights that make most other RDBMS’s look like toys, it is at a cost that makes it irrelevant for most dot-com startups today. There is a reason why Google, Flickr, Amazon, Yahoo, LinkedIn, etc have all created their own unique and specialized data stores.

      • Jamieson Becker

        (Of course, most applications, and indeed many parts of even Internet-scale applications, would be well-served by considering all options, including powerful RDBMs.)

  • I’m surprised you didn’t give HBase a try. It’s a database that scales very well, and though it suffers from some API problems it’s under active development as part of the Hadoop project.

  • Of course it is true that RDBMS is not appropriate for some applications. I spent the last six years working for a p2p data capture vendor using a proprietary Key/Value data store. Whilst this solution suited the domain wonderfully we suffered badly with trying to report on that database.

    In the end we ended up bridging the data back to an RDBMS system purely for reporting purposes. That was/is a difficult proposition.

    In many cases this comes down to which compromises you’re ready to make to serve one goal or another. Not all use cases involve simply picking isolated data from a table, and while denormalised forms can be incredibly efficient for some cases they can also represent a maintenance nightmare.

    For me the golden rules are: Think. Use your own brain. Never trust in magic bullets no matter how mainly people are sucking on the koolaid.

  • Pingback: 30 fresh links to enjoy, share, retweet, bookmark … whatever you’d like to :) « Adrian Zyzik’s Weblog()

  • Gary

    NoSQL: If Only It *Were* That Easy

  • Pingback: Another blog post about NOSQL « Spoot!()

  • Robert

    “We tried to insert 160mil 2k -20k documents into a single Tokyo Tyrant server, and performance quickly dropped off and kept going down.”

    Um…what they hell did you think would happen? That doesn’t test scalability at all.

  • Re Cassandra:

    > the OSS version doesn’t seem to support some key things, like loosing a machine all together.

    That’s something you have to repair manually right now, yes. But the other systems you mention here are either in the same boat (Voldemort) or don’t even try to autopartition your data (everyone else*).

    > They are also in the midst of changing how the basic datastructures are stored on disk

    That doesn’t affect how you design your app, though. PostgreSQL has changed their on-disk format incompatibly with _each major version_ for the past nine or so years, so clearly this isn’t a show-stopper. 🙂

    * (technically, MongoDB has alpha support for partitioning now, so maybe it belongs in the “partitions, but doesn’t handle full machine loss automatically” category instead.)

  • (Apache) Thrift has a patch ready to remove the Boost dependency for the compiler, some kinks are just being worked out. This would remedy the time-to-deploy. I think the Ruby might still depend on it, but I don’t use the Ruby lib so I’m not positive.

    I’m surprised you didn’t bring up HBase, but I am currently undergoing a switch from Dynomite + HBase to Cassandra (reasons: Windows compatibility, easier setup, fast writes of Dynomite with accessibility of HBase, no SPOF). So far, it has surpassed my expectations.

  • Pingback: NoSQL: If Only It Was That Easy « Marked As Pertinent « The other side of the firewall()

  • yachris

    Another face of scaling that you haven’t mentioned is thread/process scaling on large multi-CPU machines. We’re running on 32, 64 and larger dual-core HP, Sun, SGI, etc. machines (one has a terabyte of RAM). It’s the future for everyone, of course, but we’re there now.

    So our problem is finding a k/v store which doesn’t stop in the face of multiple threads. For instance, Berkeley DB (don’t blame me, I wasn’t here when it was chosen 🙂 ) is “Multi-Thread Safe” by which they mean it locks the database to do the update. This is *safe* (in the ACID sense) but stops other threads.

    Couch claims to *never* stop reads (lock-free algorithms FTW!) so we’re watching it closely.

    • Hi, I’m Greg a product manager (and former engineer) on Berkeley DB.

      Berkeley DB locks at the page level, not the database level. Page size is configurable. Pages contain data to manage the BTREE as well as the key and value data. Threads will block when there is a lock on a page, that normally happens when someone is executing some transactional operation. However there are many ways to reduce locks and lock contention. You can change page size, or use multi-version concurrency control (using MVCC means that you never stop reads), you can use different deadlock detection methods, configure a transaction for dirty-reads (a different level of Isolation, the “I” in ACID). Essentially Berkeley DB is as concurrent as you’d like it to be. In addition in our 4.8 release we’ve added a new latch based architecture to improve performance (dramatically) and scalability on multi-core (CMP) and SMP systems.

      In short, if Berkeley DB “stops in the face of multiple threads” it is highly likely that you’ve not configured it properly, and we’re happy to help you do that. Just come to the OTN Forum on Berkeley DB and show us some sample code.

      Or you could read the docs. 😉

      All the *newer* “lock-free” databases still have to make some trade-offs related to ACID. Which part will be sacrificed? There is always something. With Berkeley DB we try to allow the developer to make those trade-offs and to choose those trade-offs at run time and if possible on a transaction-by-transaction basis. For example, one transaction may be okay with “dirty data” (uncommitted data) so turn on dirty reads. Another transaction may need to never block on reads, turn on MVCC. This is our philosophy, that we’re a database and you’re the developer. We’re the swiss army knife, you choose the blade (or blades) for the job.

      The system sizes you mention are common for Berkeley DB installations, email Howard Chu developer of OpenLDAP from Symas and ask him to tell you a story or two about Berkeley DB in OpenLDAP huge-scale systems in production.

      I hope this clears up some common mis-conceptions related to Berkeley DB.

      we like NoSQL we’ve always been k/v!

      -greg

      • yachris

        Hey Greg,

        Thanks for the quick and informative reply! I’ll definitely check out the resources you mentioned… much appreciated.

  • MongoDB has sharding support in early alpha (code complete but I would not use in mission critical production yet). Additionally, the design of the product is for horizontal scalability: it does not contain features (such as complex transactions) that would make various scaling strategies difficult int he future.

    Replication is available and production ready.

    thanks

  • Pingback: Twitted by pr_entrepreneur()

  • Pingback: Double Shot #512 « A Fresh Cup()

  • Pingback: tech: Four short links: 5 August 2009 | tech3bite()

  • There is also Schemafree.

    http://code.google.com/p/schemafree/

    We also use MySql as storage, stream objects inside an SFEntity class (table), and also have support for lists that are concurrently insertable. Memcached integration comes out of the box.

  • What about Berkeley DB? Doesn’t it count as a nosql option?

  • crap, i spoke too soon. nevermind!

  • I’d really like to see a response from Damien Katz or one of the other CouchDB guys. It sounds (to me) like your tests are probably missing a lot of important points that one of them could address quite well.

    Kudos for including Mongo, though. Another awesome (from what I can see) alternative DB.

  • Wouldn’t using a “consistent hashing” algorithm like ketama solve the client-side hashing issues you’ve mentioned above around Tokyo* and Redis?

    You would be able to add servers to the cluster without reshuffling all the keys around then. It would of course require a config change on all the client apps to make it actually happen though.

  • Pingback: Twitted by patmaddox()

  • Pingback: Twitted by alexvollmer()

  • > Wouldn’t using a “consistent hashing” algorithm like ketama solve the client-side hashing issues you’ve mentioned above around Tokyo* and Redis? You would be able to add servers to the cluster without reshuffling all the keys around then.

    The devil is in the details. 🙂

    This is basically exactly what things like Dynomite or Voldemort give you (I believe both offer pluggable back-end storage). So better to write an adapter for your engine of choice than to reinvent that particular wheel.

    • BJ Clark

      Jonathan is right, if you’re going to go that far, you might as well just use Voldemort. There’s nothing that Tokyo provides that is going to be worth rolling your own consistant hashing algo instead of using existing ones.

  • Pingback: tech: Four short links: 6 August 2009 | tech3bite()

  • Stu Hood

    > This is pretty typical of anything that you write to a > disk. The more you write, the slower it goes.
    This tends to be the case for B-Trees, but there are alternative structures that don’t suffer this limitation, like the log structured merge ‘trees’ in HBase/Hypertable/Cassandra.

    • BJ Clark

      That’s a good point Stu. Tokyo also supports as hash table format that should be good for this, however I’m not clear how it impacts read performance.

  • I’m surprised you haven’t considered native XML databases. A lot of data spends its entire life as XML, and most data can be modeled as XML in a way that captures semantics and relationships while still allowing efficient access that is both performant and scalable. Moreover, the newest generation of XML databases are based on widely supported standards like XQuery and XPath, vs. ad hoc and proprietary access methods that seem to be used in many of the products discussed here. Check out http://developer.emc.com/xmltech for some in-depth technical content on native XML databases and related topics.

    • BJ Clark

      Our application is in Ruby (which has mostly terribly slow xml capabilities) and none of our data is in XML, so we would probably never consider xml.

  • Pingback: NoSQL: If Only It Was That Easy « Marked As Pertinent – The Facebook News()

  • Yeah, rolling your own would be stupid. That’s why I would use one of the existing ones 🙂

    If you combine consistent hashing with really, any of the k/v stores you can scale like crazy.

  • Pingback: ? Relations › links for 2009-08-06()

  • ryan

    Currently I’m using HBase in a production environment in both a real-time manner backing a website, and also in a more batch fashion handling a large data set of about 1200 gb.

    The bigtable/hbase model is one known to scale, and so far it is kicking ass for us, and we are using it for more and more stuff.

    HBase solves the big data problem, but I’m not sure you have that, due the following quote:

    “Voldemort, it should be noted, is also only going to be worth using if you actually need to spread your data out over a cluster of servers.”

    If your data set size isn’t larger than a single machine, then why are you writing a blog post?

    Right now HBase scales across over a hundred machines, holds TBs of data and has a really great API and feature set that the competition doesnt:
    – indexes and ordered keys
    – partial left key lookups
    – cluster-up table modification, multiple tables per cluster
    – row-level atomic updates
    – incrementColumnValue – good for counters or sequences
    – automated cluster recovery and failover (automatic log recovery not tied to individual machines)
    – HBase itself has no spof – master isn’t involved in read path at all
    – Can run on HDFS or KFS (via the Hadoop KFS adapter, I don’t do this however)
    – Large, active community. 45+ people RSVPed for the latest user group meeting!

  • Pingback: ?????????? ?????? ?184 - max - ???? ?????????????()

  • Maglev will probably change the game if it fulfills its promises. No need for ORM anymore. Their Smalltalk VM/OODMS is pretty awesome (see http://discuss.joelonsoftware.com/default.asp?biz.5.594244.20 for testimonials), we’ll see how it works with Ruby…

  • Pingback: Destillat KW31-2009 | duetsch.info - GNU/Linux, Open Source, Softwareentwicklung, Selbstmanagement, Vim ...()

  • rodrigob

    What about db4o ? It is a nosql alternative.

  • Quick correction about your comment about thrift: very long ago boost c++ used to be difficult to install. Now you just have to do this:

    apt-get install libboost.*-dev

    (or whatever your distro variant)

    Even windows has an installer, and precompiled binaries:

    http://www.boostpro.com/download

    • BJ Clark

      My problem with boost isn’t that it’s hard to install, it’s that it takes hours to compile.

  • Pingback: Andrés Felipe Vargas (andphe) 's status on Sunday, 09-Aug-09 18:36:06 UTC - Identi.ca()

  • Pingback: BotchagalupeMarks for August 9th - 16:00 | IT Management and Cloud Blog()

  • Pingback: Four short links: 6 August 2009 | Design Website()

  • Pingback: Månhus » Länksprutning – 10 August 2009()

  • Pingback: EsLoMas.com » Comparativa de escalado en almacenes Clave Valor()

  • Pingback: SqlServerKudos()

  • Pingback: NoSQL DB Comparison « I am Zef()

  • I just can’t let the FUD of boost go.

    1. Thrift only depends on a relatively old version of boost 1.33.1+, which is available in binary packages on most platforms that already have it installed. It takes less than a minute to install the package.

    2. Thrift only needs some headers of Boost library. NO compilation of entire boost libraries is required. A header only install takes less than a minute.

  • >>If I need to write 1 billion objects a day, I’d probably use Voldemort.

    and

    >>Voldemort, it should be noted, is also only going to be worth using if you actually need to spread your data out over a cluster of servers.

    so why not S3 which “scales *insanely* well”?

  • The gist of what I got from this article is that some of the distributed DBs out now are “high-throughput, high-latency”. That’s pretty much true — for now. HBase is as fast at scans as an RDBMS now, and Streamy.com serves pages directly from it.

    The problem is that you can’t *do* anything with these databases besides high-latency calculations and low-latency gets/scans.

    I think that’s about to change with some new functionality being added to HBase that allows rapid aggregation — in real-time. Check it out:

    • i think it varies by product. things coming from a map/reduce heritage will naturally be oriented towards throughput more than latency.

      but some solutions are good for low latency — e.g. i think mongodb is.

  • mongolove

    I am loving mongodb right now, it’s awesome.

  • Pingback: An Independent Game Developer’s Diary » Blog Archive » Bookmarks for August 7th through August 13th()

  • Keith Tomas

    Thanks for the great information. As I have been primarily studying Hadoop I am naturally curious as to why HBase and the rest of the Hadoop family were not considered.

  • CinCy

    Very informative!

    There’s a Dr. Dobbs’ article that examines database scalability. It discusses NoSQL databases but also looks at tuple spaces, in-memory databases, XML, RDF and SQL databases.

    “Databases in the Cloud: Elysian Fields or Briar Patch?” is at
    http://www.ddj.com/database/218900502

  • Pingback: Is NoSQL the future of databases? « Online Marketing at Canada's Web Shop()

  • Pingback: NoSQL Daily – Tue Sep 7 › PHP App Engine()

  • Pingback: NoSQL Daily – Tue Sep 7 › PHP App Engine()

  • Pingback: NoSQL Daily – Thu Sep 30 › PHP App Engine()

  • Wow! that was a great article, got me thinking a lot, i foud it looking for applications for nosql hehe 🙂

  • Mo’SQL please!

  • Pingback: NOSQL Databases for Bioinformatics « Homologus()

  • Pingback: Reading Notes 2012-06-11 | Matricis()

  • JPP

    Thankyou. Your second to last paragraph nailed it for me. MySQL it is. The last week of trying to pick a suitable NoSQL solution has been interesting, but holding me back from what is important – getting it done. Cheers.

  • Pingback: Construyendo un Arbol B+ (B+Tree) « El bienestar de la mayoría, supera al bienestar de la minoría. O de uno solo.()