A Relational Analogy for the Cassandra Data Model

The idea here is to use Twissandra to try to clarify the column-oriented data model employed by Cassandra, which goes beyond the simple key/value model and often is misunderstood. It’s not quite hard to understand, if you have a little familiarity with JSON-ish data structures, but is quite a bit if you don’t. On the other hand, almost everybody knows relational data model. So here lies my idea on take the relational data model as an analogy.

Wait! I don’t pretend make you think that one can use Cassandra in the same fashion that a relational database. Instead I’d want to get you introduced to that paradigm and give you some ground to further learning. Hopefully, you’ll have insight on potential use cases as well.

Here we go!

Let’s everybody to the same page

Twissandra is a sample application which aim is to demonstrate how to use Cassandra. It’s essentially a simplified Twitter clone, as you can see at http://twissandra.com.

I gently ask you to take a look on the project README, at the specific topic on Schema Layout, in order to getting knowledge of its data model (i.e. users, tweets, friends, etc), because that is going to give you underlying knowledge to follow the analogy which follows.

The Analogy

I believe that at this point you have read the schema layout of Twissandra. So now I think it’s a good time to put a conceptual model in place, in order to synthetize what you have read there.

Keyspace = {
    ColumnFamily1 = {
        RowKey1 = {
            ColumnName1 = ColumnValue,
            ColumnNameN = ColumnValue
        },
        RowKeyN = {
            ColumnName1 = ColumnValue,
            ColumnNameN = ColumnValue
        }
    },
    ColumnFamilyN = {
        RowKey1 = {
            ColumnName1 = ColumnValue,
            ColumnNameN = ColumnValue
        },
        RowKeyN = {
            ColumnName1 = ColumnValue,
            ColumnNameN = ColumnValue
        }
    }
}

What does it look like? Did you say a map of maps? Oh yeah, you’re right. Come on walk through each piece now.

Keyspaces

Analogy: Database, schema
There: Twissandra

This is the top-level identifier of our schema. As such, we usually have one by application.

Column Families

Analogy: Table
There:

  • Users
  • Tweets
  • Friends
  • Followers
  • Timeline
  • Userline

Analogously to tables, column families are containers for rows. Each row has a key and contains a number of columns.

Columns

Analogy: Field
There:

  • Users:
    • password = string
    • email = string
  • Tweets:
    • username = string
    • body = string
  • Friends
    • [{user.name} = timestamp]
  • Followers:
    • [{user.name} = timestamp]
  • Timeline:
    • [{timestamp} = {tweet.id}]
  • Userline:
    • [{timestamp} = {tweet.id}]

Columns follow a name/value fashion. As such, their names can be strings, numerics, etc, and are used as indexes, since they are stored orderly. To put it simple, let’s take the Friends column family as an example, whose which each row is keyed by an username.

Friends = {
    'hermes': {
        # friend id: timestamp of when the friendship was added
        'larry': '1267413962580791',
        'curly': '1267413990076949',
        'moe'  : '1267414008133277',
    },
}

Each row has a number of friend usernames as column names (i.e. column names can be used to store values) and timestamps for their values, thus we can easily know which users are following a given user and since when.

This data model breaks away the common relational concept of selecting records through joins among many normalised tables. Here we often design our “tables” with our future data reading in mind, not with the data storing. I mean, what is read together is indeed stored together.

Going to the next page

Super Columns

There is also another type of column, whose is not employed on Twissandra, which is called Super Column. It’s a special type of column that contains a number of regular columns, what let us to something like a map of maps of maps.

Keyspace = {
    ColumnFamily1 = {
        RowKey1 = {
            SuperColumnName1 = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            },
            SuperColumnNameN = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            }
        },
        RowKeyN = {
            SuperColumnName1 = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            },
            SuperColumnNameN = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            }
        }
    },
    ColumnFamilyN = {
        RowKey1 = {
            SuperColumnName1 = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            },
            SuperColumnNameN = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            }
        },
        RowKeyN = {
            SuperColumnName1 = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            },
            SuperColumnNameN = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            }
        }
    }
}

It might be useful if we decided, for example, that we should have friend details right in the Friends column family, e.g. their long names. In this case, our Friends column family would be actually a super column family like follow.

Friends = {
    'hermes': {
        # friend id: timestamp of when the friendship was added and his/her name
        'larry': {
            'longname': 'Larry Page',
            'since': '1267413962580791'
        },
        'curly': {
            'longname': 'Curly Howard',
            'since': '1267413990076949'
        },
        'moe': {
            'longname': 'Moe',
            'since': '1267414008133277'
        }
    }
}

This model is in line with our aforementioned philosophy of “what is read together is stored together”. And also it’s worth to mention that super column names are stored orderly by name, i.e. they’re indexed, just like regular columns are.

Far more

This blog post was undoubtedly a simplified explanation on the Cassandra data model, relying on an analogy to help people, which might be already familiar with relational data model, to getting started with the Cassandra data model.

So now that you have a basic understanding, I’d strongly suggest you to read the official explanation from Cassandra’s wiki and other good explanations, like these following:

I hope have helped you!

Autor: Leandro Silva

I do code for a happy living.

Uma consideração sobre “A Relational Analogy for the Cassandra Data Model”

  1. I’ve been really struggling with understanding the column model employed by Cassandra, but seeing it in a JSON-like format makes so much more sense.

    Thanks for a great walkthrough!

Deixe um comentário