You are currently browsing the archives for the english category


A Relational Analogy for the Cassandra Data Model

The idea here is to use Twissandra to try to clarify the column-oriented data model employed by Cassandra, which goes beyond the simple key/value model and often is misunderstood. It’s not quite hard to understand, if you have a little familiarity with JSON-ish data structures, but is quite a bit if you don’t. On the other hand, almost everybody knows relational data model. So here lies my idea on take the relational data model as an analogy.

Wait! I don’t pretend make you think that one can use Cassandra in the same fashion that a relational database. Instead I’d want to get you introduced to that paradigm and give you some ground to further learning. Hopefully, you’ll have insight on potential use cases as well.

Here we go!

Let’s everybody to the same page

Twissandra is a sample application which aim is to demonstrate how to use Cassandra. It’s essentially a simplified Twitter clone, as you can see at http://twissandra.com.

I gently ask you to take a look on the project README, at the specific topic on Schema Layout, in order to getting knowledge of its data model (i.e. users, tweets, friends, etc), because that is going to give you underlying knowledge to follow the analogy which follows.

The Analogy

I believe that at this point you have read the schema layout of Twissandra. So now I think it’s a good time to put a conceptual model in place, in order to synthetize what you have read there.

Keyspace = {
    ColumnFamily1 = {
        RowKey1 = {
            ColumnName1 = ColumnValue,
            ColumnNameN = ColumnValue
        },
        RowKeyN = {
            ColumnName1 = ColumnValue,
            ColumnNameN = ColumnValue
        }
    },
    ColumnFamilyN = {
        RowKey1 = {
            ColumnName1 = ColumnValue,
            ColumnNameN = ColumnValue
        },
        RowKeyN = {
            ColumnName1 = ColumnValue,
            ColumnNameN = ColumnValue
        }
    }
}

What does it look like? Did you say a map of maps? Oh yeah, you’re right. Come on walk through each piece now.

Keyspaces

Analogy: Database, schema
There: Twissandra

This is the top-level identifier of our schema. As such, we usually have one by application.

Column Families

Analogy: Table
There:

  • Users
  • Tweets
  • Friends
  • Followers
  • Timeline
  • Userline

Analogously to tables, column families are containers for rows. Each row has a key and contains a number of columns.

Columns

Analogy: Field
There:

  • Users:
    • password = string
    • email = string
  • Tweets:
    • username = string
    • body = string
  • Friends
    • [{user.name} = timestamp]
  • Followers:
    • [{user.name} = timestamp]
  • Timeline:
    • [{timestamp} = {tweet.id}]
  • Userline:
    • [{timestamp} = {tweet.id}]

Columns follow a name/value fashion. As such, their names can be strings, numerics, etc, and are used as indexes, since they are stored orderly. To put it simple, let’s take the Friends column family as an example, whose which each row is keyed by an username.

Friends = {
    'hermes': {
        # friend id: timestamp of when the friendship was added
        'larry': '1267413962580791',
        'curly': '1267413990076949',
        'moe'  : '1267414008133277',
    },
}

Each row has a number of friend usernames as column names (i.e. column names can be used to store values) and timestamps for their values, thus we can easily know which users are following a given user and since when.

This data model breaks away the common relational concept of selecting records through joins among many normalised tables. Here we often design our “tables” with our future data reading in mind, not with the data storing. I mean, what is read together is indeed stored together.

Going to the next page

Super Columns

There is also another type of column, whose is not employed on Twissandra, which is called Super Column. It’s a special type of column that contains a number of regular columns, what let us to something like a map of maps of maps.

Keyspace = {
    ColumnFamily1 = {
        RowKey1 = {
            SuperColumnName1 = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            },
            SuperColumnNameN = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            }
        },
        RowKeyN = {
            SuperColumnName1 = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            },
            SuperColumnNameN = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            }
        }
    },
    ColumnFamilyN = {
        RowKey1 = {
            SuperColumnName1 = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            },
            SuperColumnNameN = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            }
        },
        RowKeyN = {
            SuperColumnName1 = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            },
            SuperColumnNameN = {
                ColumnName1 = ColumnValue,
                ColumnNameN = ColumnValue
            }
        }
    }
}

It might be useful if we decided, for example, that we should have friend details right in the Friends column family, e.g. their long names. In this case, our Friends column family would be actually a super column family like follow.

Friends = {
    'hermes': {
        # friend id: timestamp of when the friendship was added and his/her name
        'larry': {
            'longname': 'Larry Page',
            'since': '1267413962580791'
        },
        'curly': {
            'longname': 'Curly Howard',
            'since': '1267413990076949'
        },
        'moe': {
            'longname': 'Moe',
            'since': '1267414008133277'
        }
    }
}

This model is in line with our aforementioned philosophy of “what is read together is stored together”. And also it’s worth to mention that super column names are stored orderly by name, i.e. they’re indexed, just like regular columns are.

Far more

This blog post was undoubtedly a simplified explanation on the Cassandra data model, relying on an analogy to help people, which might be already familiar with relational data model, to getting started with the Cassandra data model.

So now that you have a basic understanding, I’d strongly suggest you to read the official explanation from Cassandra’s wiki and other good explanations, like these following:

I hope have helped you!

klogd2: My second try to route Syslog messages to Kafka

It was really cool to play around with klogd but I have to confess that I’d like to have more fun. So this is my aim with klogd2.

Klogd2 is essentially a new implementation of klogd but in Java, relying on Syslog4j, as I said on klogd2’s README:

I’d want to try Syslog4j on the server side, because I know it’s a rock solid stuff and all those cool kids are using it, e.g. Graylog2.

Take a couple of minutes to get a look there, when you can. As usual, I’d really appreciate your feedback and possibly a pull request.

klogd: What about route Syslog messages to Kafka?

Today I was searching for a way to route Syslog messages to Kafka, since Syslog is the standard bucket for logs on Unix-like operational systems and there are many legacy applications which use it and cannot be changed to use something else. Unfortunately, I didn’t find anything. Therefore I decided to write something to try it.

Kafka is a pretty interesting high-throughput distributed messaging system from LinkedIn’s Data Team guys, whose aim is to serve as the foundation for LinkedIn’s activity stream and operational data processing pipeline. They have used it to handle lots of real-time data everyday and have open sourced it as an Apache project. I suggest you to take a look on its design and use cases today.

The result of my first try is klogd.

It’s a dumb simple Python program which simply listen for UDP packets on 1514 port and send them to a Kafka server. Just it. So I know, of course, there are many things to be done, because klogd is still too naive. This is just the begining.

Take a time to try it and give me your feedback. Further, fork it, hack it, and send me a pull request.

MessagePack-RPC with Clojure and the AOT compilation

A few of days ago, during a night without sleep, I decided to play a little bit with MessagePack-RPC for Java. It was absolutely a fun time but I ended feeling like it might be better. So I decided to make it more fun and I rewrote those client/server programs in Clojure.

As usual, those codes are on my GitHub account:

Those are two really simple samples, far from those of the Real World, of course, but them still worth enough to try the underlying concept and have fun.

One or two interesting things there

During my experiment, I realized some interesting things about compiling Clojure code ahead of time (AOT) and Java interop that I thought I might share briefly here. It happened due to the nature of the implementation of the MessagePack-RPC for Java library once it does data serialization and deserialization.

Here you will see compilation of Clojure code and what it results. If you want to run the samples, which is worth,you can find instructions in the project’s README.

So let’s look on…

Getting the Clojure project’s code

Once the whole project’s code is on GitHub, this is quite simple task:

$ git clone https://github.com/leandrosilva/msgpackrpc-sample-clojure

Important files in the project:

In this blog post we are going to focus only on server.clj and client.clj.

Inspecting the server

Now you have the code, compile only the server code at first time.

$ lein compile msgpackrpc-sample.server

Ok. So what did the compilation above generated?

$ cd target/classes

$ ls -la

  MathServer.class
  msgpackrpc_sample

The compilation generated:

  • MathServer class
  • And msgpackrpc_sample directory – which is the sanitized name for mesgpackrpc.sample namespace, as you can imagine
Right. This is going to be interesting. Let’s inspect msgpackrpc_sample directory.
$ ls -la msgpackrpc_sample/

  server$_add.class
  server$_div.class
  server$_main.class
  server$_mul.class
  server$_sub.class
  server$loading__4784__auto__.class
  server__init.class

Its content is:

  • A class to load the namespace code
  • A class to initialize the namespace
  • And a class for every function included in the namespace
Really interesting, non? But let’s go further more and inspect MathServer class now.
$ javap -c MathServer

Output:

  public class MathServer extends java.lang.Object {
      public static {};
        Code:
        ...
      public MathServer();
        Code:
        ...
      public java.lang.Object clone();
        Code:
        ...
      public int hashCode();
        Code:
        ...
      public java.lang.String toString();
        Code:
        ...
      public boolean equals(java.lang.Object);
        Code:
        ...
      public int add(int, int);
        Code:
        ...
      public int sub(int, int);
        Code:
        ...
      public int mul(int, int);
        Code:
        ...
      public double div(int, int);
        Code:
        ...
  }

I’m not sure whether you are familiarized with Clojure’s gen-class macro or not but it is the magic that has generated the Java code above.

(gen-class
  :name MathServer
  :methods [[add [int int] int]
            [sub [int int] int]
            [mul [int int] int]
            [div [int int] double]])

Putting it simple, the code above define like a public interface for MathServer class and the code below implements them, as follow.

(defn -add [this a b]
  (+ a b))

(defn -sub [this a b]
  (- a b))

(defn -mul [this a b]
  (* a b))

(defn -div [this a b]
  (/ a b))

This is a pretty much exciting thing, non? This is because if you need to interop your Clojure program/library with any Java program/library (legacy or not) which requires specific interface, you can design it quickly and unpainfuly using gen-class. Further more, you can design it up front, compile AOT, and finally delivery it as any regular Java .class file. Awesome!

Are you curious about the hyphen prefix to every function?

This is the notation to say that a given method implementation should bind to a function with its name plus hyphen prefix. It is possible to define a prefix other than hyphen, which is the default one, using :prefix option of gen-class macro. So if you define :prefix option as “banana-” for MathServer class, every function that implements a method of MathServer class should start with “banana-“.

You can find an example here.

Inspecting the client

The same way you did for server code, you have to do to the client code.

$ lein compile msgpackrpc-sample.client

Ok. Since you already did it before let’s move a little fast here (without verbiage of my end, which is good to you. :)).

$ cd target/classes

$ ls -la

  IMath.class
  MathServer.class
  msgpackrpc_sample

$ ls -la msgpackrpc_sample/

  client$_main.class
  client$loading__4784__auto__.class
  client__init.class
  ifaces$loading__4784__auto__.class
  server$_add.class
  server$_div.class
  server$_main.class
  server$_mul.class
  server$_sub.class
  server$loading__4784__auto__.class
  server__init.class

I’m pretty sure you got it, since it is close to what you saw for server before, but I’d like to just comment a little bit. The compilation generated:

  • IMath class, which is an interface actually
  • Classes for client and ifaces namespaces, pretty much the same happened for server – note that those two are declared in the same file client.clj

So let’s inspect the IMath interface.

$ javap -c IMath

Output:

  public interface IMath {
      public abstract int add(int, int);
      public abstract int sub(int, int);
      public abstract int mul(int, int);
      public abstract double div(int, int);
  }

This is the magic of gen-interface macro!

(gen-interface
  :name IMath
  :methods [[add [int int] int]
            [sub [int int] int]
            [mul [int int] int]
            [div [int int] double]])

And after that “magic”, this interface is imported like any other regular Java class, as follow.

(ns msgpackrpc-sample.client
  (:import IMath)
  (:import [org.msgpack.rpc.loop EventLoop])
  (:import [org.msgpack.rpc Client]))

Wow! I love it, baby! Don’t you?

The End

There are yet other many interesting things on MessagePack, MessagePack-RPC and, of course, Clojure compilation, so I hope you have get interested on them as well, and go learn and try further more, because I definitely will do.

Enjoy it!

A briefly introduction to Cameron

Hi everybody!

I’ve started to write a kind of documentation about Cameron, in the spirit of a presentation, really brief and direct. It’s still a work in progress, like Cameron as well, but I think it’s already able to give you an idea on what Cameron aim to be.

Stay tuned…

A briefly introduction to Cameron

Who is Cameron?

Cameron is an Erlang-based workflow engine in which I have been working on for a few weeks.

It has been built as an Erlang/OTP system with an REST-like API, powered by Misultin, through which one can POST a JSON request to run a given process workflows, that will be executed in background [parallely with other running ones], and then GET its JSON results as soon as them become available. And obviously, it uses Redis for the win.

What about Process Workflows?

Process workflows are defined in terms of REST-like web services, written in virtually any language, which basically must talk a simple JSON contract.

These web services are the activities that define a process workflow; these web services are the tasks to achieve a given target. And as you can imagine, yes, an activity can cascade many others; it is pipeline-based, as well.

So if you have any background job that must cascade many tasks to achieve a goal, maybe it fits to your needs.

Does it work?

Although it still experimental, a work in progress, it works reasonably well – at least under my tests.

So if you have time, take a look at the documentation.