NoSQL databases: What you should know about

Relational database model has been co-existing with us for around a quarter of a century -so much time, right?-, but a new class of database has emerged in the enterprise. I’m talking about NoSQL.

 

What is NoSQL?

NoSQL, also known as “non-relational” or “cloud”, is a broad class of database management system with significant differences from a classic relational database management system (RDBMS). The stored data not require fixed table schemas. It usually avoids join operations and typically scale horizontally.

 

Architecture with NoSQL

NoSQL database is characterized by a move away from the complexity of SQL based servers. The logic of validation, access control, mapping querieable indexed data, correlating related data, conflic resolution, maintaining integrity constraints and triggered procedures is moved out of the database layer. This enables NoSQL databases engines to focus on exceptional performance and scalability.
A key concept of NoSQL systems is to have DBs focus on the task of high-performance scalable data storage, and provide low-level access to data management layer.
Pros & Cons
Pros
  • Improve performance – Performance metrics have shown significant improvements vs relational access. For example, this performance metric compares MySQL vs Cassandra:
    Facebook Search
    MySQL > 50 GB Data
    – Writes Average: ~300 ms
    – Reads Average: ~350 ms
    Rewritten with Cassandra > 50 GB Data
    – Writes Average: 0.12 ms
    – Reads Average: 15 ms
  • Scaling – NoSQL databases are designed to expand transparently and they’re usually designed with low-cost commodity hardware in mind.
  • Big data handling – Over the last decade, the volumes of data has been increased massively. NoSQL systems can handle big data in the same way as the biggest RDBMS.
  • Less DBA time – NoSQL databases are generally designed to requiere less management: automatic repair, data distribution and simpler data models.
  • Reduce costs – RDBMS uses expensive proprietary servers and storage systems, while NoSQL databases user clusters of cheap commodity servers. So, the cost per gigabyte or transaction/second for NoSQL can be many time less.
  • Flexible data models – NoSQL key-value stores and document databases allow the application to store virtually any structure it wants in a data element.

Cons

  • Maturity – RDBMS systems are stable and richly functional. But most NoSQL alternatives are in pre-production versions with many key features yet to be implemented.
  • Support – Most NoSQL systems are open source projects. There’re a couple of small companies offering support for each NoSQL database.
  • Analytics & BI – NoSQL databases do not offer facilities for ad-hoc query and analysis. Commonly used business intelligence tools do not provide connectivity to NoSQL systems.
  • Administration – Although the design goal of NoSQL system is to provide a zero-admin solution, it’s true that requires a lot of skill to install and a lot of effort to maintain.
  • Expertise – As NoSQL systems is a new paradigm, all developers are in a learning mode.

Use a RDBMS or a NoSQL Database?

Depends mainly on what you’re trying to achieve. It’s certainly mature enough to use. But few applications really need to scale that massively. For most, a traditional RDBMS is sufficient. However, with internet usage becoming more ubiquitous all the time, it’s quite likely that applications that do will become more common (though probably not dominant).

NoSQL Implementations

There are currently more than 122 NoSQL databases. They can be categorized by their manner of implementation:
  • Wide column store / column families – Cassandra, Hadoop, Hypertable, Cloudata, Cloudera, Amazon SimpleDB, SciDB
  • Document store – MongoDB, CouchDB, Terrastore, ThruDB, OrientDB, RavenDB, Citrusleaf, SisoDB
  • Key Value / Tuple store – Azure Table Storage, MEMBASE, Riak, Redis, Chordless, GenieDB, Scalaris, BerkeleyDB, MemcacheDB.
  • Eventually consistent Key Value store – Amazon Dynamo, Voldemort, Dynomite, KAI
  • Graph databases – Neo4J, Infinite Graph, Sones, InfoGrid, HyperGraphDB, Trinity, etc
  • And others, and others…

Early Adopters of NoSQL

Social media corporations as the primary traiblazers of NoSQL implementations. The list includes:
  • Facebook
  • Twitter
  • MySpace
  • Google (Hadoop, Google App Engine)
  • Amazon (Dynamo)

Books & Papers Recomended

  • Professional NoSQL (Wiley/Wrox. 2011)
  • NoSQL Database Technology (CouchBase. 2011)
  • NoSQL Handbook (Mathias Meyer)
  • No Relation: The Mixed Blessings of Non-Relational Databases (Ian Thomas Varley. 2009)
  • Cassandra: The definitive guide (Even Hewitt. 2010)
  • CouchDB: The Definitive Guide: Time to Relax (J. Chris Anderson, Jan Lehnardt, Noah Slater. 2010)
  • Hadoop in Action (Chuck Lam. 2010)
  • HBase: The Definitive Guide (Lars George. 2011)
  • MongoDB in Action (Kyle Banker. 2010)
  • Beginning SimpleDB (Kevin Marshall, Tyler Freeling. 2009)

Conclusions

NoSQL databases solve problems which born with the global digital data growth, where the DBAs have been dealing with the well-known RDBMS.
Outside of scalability, it really seems that NoSQL databases do not have a killer feature.
Although, I think this is a new opportunity to become a professional on this paradigm.


New Oracle Lucene Domain Index release based on Lucene 3.0.2

Just a few words to announce a new release of Oracle Lucene Domain Index, this zip is valid for 10g and 11g database version (10g using back-ported classes from 1.5 to 1.4)
This release is compiled using Lucene 3.0.2 version and incorporates a set of new features added, here the list:

  • Added a long awaited functionality, a parallel/shared/slave search process used during a start-fetch-close and CountHits function
  • Added lfreqterms ancillary operator returning the freq terms array of rows visited
  • Added lsimilarity ancillary operator returning a computed Levenshtein distance of the row visited
  • Added a ldidyoumean pipeline table function using DidYouMean.indexDictionary storage
  • Added test using SQLUnit

The bigger addition is the Parallel-Shared-Slave search process, this architectural change was in my to-do list for a long time and finally I added in this release :)
The idea behind this is to have a new Oracle process started by the DBMS_SCHEDULER sub-system during the database startup process and stopped immediately before shutdown.
Now this process is responsible for implementing the ODCI methods start-fetch-close/count-hit on behalf of the client process (process associated to an specific user session) which connect to the shared-slave process by using RMI.
With this new architecture we have two principal benefits:

  • Reduce memory consumption
  • Increase Lucene Cache Hits

Less memory consumption because the internal OJVM implementation is attached to a client session, so the Java space used by Lucene structures is isolated and independent from another concurrent session, now allLucene memory structures used during index scan process are created in a shared process and then not replicated.
Also if one session submits a Lucene search, this search is cached for subsequent queries, all subsequent queries coming from the same client session or any other which are associated to the same index and with the same Query string implies a hit.
I’ll explain more in detail this new architecture in another post also showing how many parallel process can work together when using Parallel Indexing and Searching.
On the other hand next week I’ll be at the Oracle OpenWorld 2010 in SFO presenting the session:

Schedule: Tuesday: 09:30AM
Session ID: S315660
Title: Database Applications Lifecycle Management
Event: JavaOne and Oracle Develop
Stream(s): ORACLE DEVELOP
Track(s): Database Development
Abstract: Complex applications, such as Java running inside the database, require an application lifecycle management to develop and delivery good code. This session will cover some best practices, tools, and experience managing and delivering code for running inside the database, including tools for debugging, automatic test, packaging, deployment, and release management. Some of the tools presented will include Apache Maven, JUnit, log4j, Oracle JDeveloper, and others integrated into the Oracle Java Virtual Machine (JVM) development environment.

See you there or at any of networking planned events :)

Dealing with JDK1.5 libraries on Oracle 10g

Modern libraries are compiled with JDK 1.5 and the question is How to deal with these libraries on an Oracle 10g OJVM.
Some examples are Lucene 3.x branch or Hadoop. The solution that I tested is using a Java Retro Translatorand some complementary libraries.
I have tested this solution in Lucene Domain Index 3.x branch with success.
As you can see on the CVS there is build.xml file which performs all the retro translator steps. Here an step by step explanation of the process:

  1. Load all required libraries provided by Retro translator project which implements features not available on JDK 1.4/1.3 runtime, this is done on the target load-retrotranslator-sys-code. This target loads many libraries on SYS schema due are immutable, or with low probability of change. It will change if we upgrade a retro-translator version. All libraries are then compiled to assembler using NCOMP utility, target ncomp-runtime-retrotranslator-sys-code.
  2. Then we can convert libraries compiled with JDK1.5, in this build.xml file the Lucene and Lucene Domain Index implementation, to a JDK1.4 target runtime. This is done on the targets backport-code-lucene andbackport-code-odi, on first target We converts all Lucene libraries excluding JUnit and Test code, these libraries require as a dependency JUnit and retro-translator jars. Second target converts Lucene Domain Index jar depending on Lucene core and Oracle’s libs. The back-port operation generates a file namedlucene-odi-all-${version}.jar with Lucene and Lucene Domain Index code ready to run on JDK1.4 runtime.
  3. Once We have the code back-ported to a JDK1.4 runtime We can upload and NCOMP into Oracle 10g, this is done on targets load-lucene-odi-backported-code and ncomp-lucene-all.

And that’s all!!, the code works fine on my Oracle 10.2 database – Linux :) , finally users of 11g and 10g databases can deploy Lucene Domain Index implementation using one distribution file.