Cassandra is column oriented NoSQL Database. In NoSQL, there are four flavours. They are:
- Key Value datastores
- Column oriented datastores
- Document Orientes datastores
- Graph datastores
A columnar database is a database management system (DBMS) that stores data in columns instead of rows.
The goal of a columnar database is to efficiently write and read data to and from hard disk storage in order to speed up the time it takes to return a query.
Assume that you have the following data:
Rajesh | Menon | 45 | M |
Harish | Nair | 55 | M |
Jyoti | Pradhan | 27 | F |
In a row oriented database like RDBMS the data would be stored as follows:
Rajesh,Menon,45,M,Harish,Nair,55,M,Jyoti,Pradhan,27,F
While in a Columnar database like Cassandra it is stored as follows:
Rajesh,Harish,Jyoti,Menon,Nair,Pradhan,45,55,27,M,M,F
One of the main benefits of a columnar database is that data can be highly compressed. The compression permits columnar operations — like MIN, MAX, SUM, COUNT and AVG— to be performed very rapidly. Another benefit is that because a column-based DBMSs is self-indexing, it uses less disk space than a relational database management system (RDBMS) containing the same data.
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra’s support for replicating across multiple datacentres is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for de-normalization and materialized views, and powerful built-in caching.
Basic structure of data in Cassandra
- A keyspace is kind of like a schema in a relational database.
- A column family is kind of like a table in a relational database.
- Keys are in a random order; the recommended way of using Cassandra; allows distributing things evenly over your cluster
- With ordered keys, you have to be careful not to create hotspots in your cluster
- Column names are sorted in alphabetic order
- You can optionally type values
- Every value has a timestamp associated with it; used for conflict resolution – make sure that clocks are in sync.
Now start your CQLShell (Please download Cassandra from the site and you will get the shell)
First, create a keyspace — a namespace of tables.
CREATE KEYSPACE mykeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
Second, authenticate to the new keyspace:
USE mykeyspace;
Third, create a users table:
CREATE TABLE users ( user_id int PRIMARY KEY, fname text, lname text );
Now you can store data into users:
INSERT INTO users (user_id, fname, lname) VALUES (1745, 'john', 'smith'); INSERT INTO users (user_id, fname, lname) VALUES (1744, 'john', 'doe'); INSERT INTO users (user_id, fname, lname) VALUES (1746, 'john', 'smith');
Now let’s fetch the data you inserted:
SELECT * FROM users;
You should see output reflecting your new rows:
user_id | fname | lname ---------+-------+------- 1745 | john | smith 1744 | john | doe 1746 | john | smith
You can retrieve data about users whose last name is smith by creating an index, then querying the table as follows:
CREATE INDEX ON users (lname); SELECT * FROM users WHERE lname = 'smith'; user_id | fname | lname ---------+-------+------- 1745 | john | smith 1746 | john | smith
Apache Cassandra is a very powerful tool and should be used for high performance and fault tolerant database applications.