Thursday, 31 May 2012

Setting up a multi-node Cassandra cluster on a single Windows machine

In Windows explorer, go to “C:\Windows\System32\drivers\etc”
Copy the file called “hosts” to your desktop ( or any editable location)
Open the hosts file from the desktop and add the following to the end of the file:

#cassandra nodes
127.0.0.1               127.0.0.2
127.0.0.1               127.0.0.3
127.0.0.1               127.0.0.4
127.0.0.1               127.0.0.5
127.0.0.1               127.0.0.6

Each line represents a node in your cluster to be.You can replace 127.0.0.1…6
with any host name you desire such as node1,node2, etc… it should* still work
Next, re-name the hosts file in “C:\Windows\System32\drivers\etc” to hosts.bak i.e
“C:\Windows\System32\drivers\etc\hosts.bak”. You may need admin permission to do this.
Now copy the modified hosts file from your desktop (or wherever you copied it to) to
“C:\Windows\System32\drivers\etc”. If you have a web server running you can access it by typing,
any of the nodes you entered i.e 127.0.0.2 for e.g.
Download the latest version of Cassandra from the Cassandra download page.
Now, there are several ways you could do this part but the method I use presents far fewer problems later on.
Create a folder at the root of your drive called cassandra, i.e ”C:\cassandra”, and in the cassandra folder, create
6 folders named 1 to 6(or however many nodes you want).
You should now have a folder structure looking like this:

C:\cassandra
            \1
            \2
            \3
            \4
            \5
            \6

Extract the files from the cassandra download once into each sub-directory.
i.e put the cassandra files/folders (bin,conf,interface,lib,javadoc, etc) in each folder (1…6)
Starting with the folder named “1″, open cassandra.yaml file for editing. i.e. Open

"C:\cassandra\1\conf\cassandra.yaml"

.
Give the cluster name a meaningful value :

cluster_name: 'Awesomeness'

We want to make the first and second node seed nodes so ensure that auto_bootstrap is
false for these two but true for the others

auto_bootstrap: false

Provide the seed nodes for each cassandra.yaml file as 127.0.0.1 and 127.0.0.2

seed_provider:
    # Addresses of hosts that are deemed contact points.
    # Cassandra nodes use this list of hosts to find each other and learn
    # the topology of the ring.  You must change this if you are running
    # multiple nodes!
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          # seeds is actually a comma-delimited list of addresses.
          # Ex: "<ip1>,<ip2>,<ip3>"
          - seeds: "127.0.0.1,127.0.0.2"

IMPORTANT: You need to ensure that the following values are changed for each node:
# directories where Cassandra should store data on disk.

data_file_directories:
    - /cassandra/1/var/lib/cassandra/data

# commit log
commitlog_directory: /cassandra/1/var/lib/cassandra/commitlog

# saved caches
saved_caches_directory: /cassandra/1/var/lib/cassandra/saved_caches

Notice the “/cassandra/1/var/lib/”, this path changes to “/cassandra/2/var/lib/”. Its important each tell each node to
use a different location to store its data.
Finally make the following changes to the first file:

listen_address: 127.0.0.1
rpc_address: 127.0.0.1

Once that’s done, save the cassandra.yaml file. Copy the edited file into the conf folder of the other nodes
(make any other tweaks you like to the configuration) .
Edit each cassandra.yaml file as described above making sure you change the path and the host as well as the boot strap
option.
So in the end yor configuration will be similar to:

Node 1 = 127.0.0.1 
Node 2 = 127.0.0.2
Node 3 = 127.0.0.3
Node 4 = 127.0.0.4
Node 5 = 127.0.0.5
Node 6 = 127.0.0.6

IMPORTANT: Go into each node’s bin directory and edit cassandra.bat file, for each node change the line that says “-Dcom.sun.management.jmxremote.port=7199^” changing the 7199 to a unique number. This is the JMX port that allows you to connect to your cluster using nodetool or JConsole each node needs to have their own port so each bat file must be edited and have a unique port set.
Once you have edited cassandra.bat go to each of your bin folders and double click cassandra.bat to start each node.

Using Cassandra in ubuntu

1. First upgrade your software as is with the following two commands (just for good measure):

sudo apt-get update
sudo apt-get upgrade

2. Now, open up your Debian package sources list with Nano for editing using the following command:

sudo nano /etc/apt/sources.list

3. Next, add the following sources to your /etc/apt/sources.list file.

deb http://www.apache.org/dist/incubator/cassandra/debian unstable main
deb-src http://www.apache.org/dist/incubator/cassandra/debian unstable main

After you add these two lines, press cntrl+X to close Nano. It’ll ask “Save modified buffer?” Press Y. Press Enter when Nano asks “File Name to Write.”
4. Run the update to install Casandra with this command:

sudo apt-get update

5. ERROR! At this point you receive an error similar to this:

W: GPG error: http://www.apache.org unstable Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY F758CE318D77295D

6. Use the following three commands to ignore the signature error, and continue installing:
NOTE: You must replace the key value ‘F758CE318D77295D’ with the key value you received in your error message.

gpg --keyserver wwwkeys.eu.pgp.net --recv-keys F758CE318D77295D
sudo apt-key add ~/.gnupg/pubring.gpg
sudo apt-get update

7. Install Cassandra:

sudo apt-get install cassandra

8. Next you need to change Cassandra’s default port number from 8080 to something else, because the 8080 port typically conflicts with SSH terminal connections. Use Nano to open up the Cassandra configuration file using the following command:

sudo nano /usr/share/cassandra/cassandra.in.sh

9. Then change the port number 8080 on the following line to 10036, and save the file:

-Dcom.sun.management.jmxremote.port=10036 \

10. Start Cassandra with the command:

/etc/init.d/cassandra start

Once you have Cassandra running, test it with Cassandra’s command line tool CLI.

Starting the CLI

You can start the CLI using the bin/cassandra-cli script in your Cassandra installation (bin\cassandra-cli.bat on windows). If you are evaluating a local cassandra node then be sure that it has been correctly configured and successfully started before starting the CLI.

If successful you will see output similar to this:

Welcome to cassandra CLI.

Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit.

You must then specify a system to connect to:

connect localhost/9160;

Creating a Keyspace

We first create a keyspace to run our examples in.

create keyspace Twissandra;

Selecting the keyspace to user

We must then select our example keyspace as our new context before we can run any queries.

use Twissandra;

To Create A Column

We can then create a column to play with.

create column family User with comparator = UTF8Type;

For the later examples to work you must also update the schema using the following command. This will set the return type for the first and last name to make them human readable. It will also add and index for the age field so that you filter your gets using the Users name field.

update column family User with
        column_metadata =
        [
        {column_name: first, validation_class: UTF8Type},
        {column_name: last, validation_class: UTF8Type},
        {column_name: age, validation_class: UTF8Type, index_type: KEYS}
        ];

To Add Data

To add data we want to into our new column we must first specify our default key type otherwise we would have to specify it for each key using the format [utf8('keyname')] this is probably advisable if you have mixed key types but makes simple cases harder to read.

So we run the command below, which will last the length of you cli session. On quitting and restarting we must run it again.

assume User keys as utf8;

and then we add our data.

set User['jsmith']['first'] = 'John';
set User['jsmith']['last'] = 'Smith';
set User['jsmith']['age'] = '38';

If you get the error like this cannot parse 'John' as hex bytes, then it likely you either haven't set your default key type or you haven't updated your schema as in the create column example.

The set command uses API#insert

To Update Data

If we need to update a value we simply set it again.

set User['jsmith']['first'] = 'Jack';

To Get Data

Now let's read back the jsmith row to see what it contains:

get User['jsmith'];

The get command uses API#get_slice

To Query Data

get User where age = '12';

For help

help;

To Quit

quit;

To Execute Script

bin/cassandra-cli -host localhost -port 9160 -f script.txt

Getting Started Using the Cassandra CLI

The Cassandra CLI client utility can be used to do basic data definition (DDL) and data manipulation (DML) within a Cassandra cluster. It is located in /usr/bin/cassandra-cli in packaged installations or <install_location>/bin/cassandra-cli in binary installations.
To start the CLI and connect to a particular Cassandra instance, launch the script together with -host and -port options. It will connect to the cluster name specified in the cassandra.yaml file (which is Test Cluster by default). For example, if you have a single-node cluster on localhost:

$ cassandra-cli -host localhost -port 9160

Or to connect to a node in a multi-node cluster, give the IP address of the node:

$ cassandra-cli -host 110.123.4.5 -port 9160

To see help on the various commands available:

[default@unknown] help;

For detailed help on a specific command, use help <command>;. For example:

[default@unknown] help SET;

Note

A command is not sent to the server unless it is terminated by a semicolon (;). Hitting the return key without a semicolon at the end of the line echos an ellipsis ( . . . ), which indicates that the CLI expects more input.

Creating a Keyspace

You can use the Cassandra CLI commands described in this section to create a keyspace. In this example, we create a keyspace called demo, with a replication factor of 1 and using the SimpleStrategy replica placement strategy.
Note the single quotes around the string value of placement_strategy:

[default@unknown] CREATE KEYSPACE demo
with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'
and strategy_options = [{replication_factor:1}];

You can verify the creation of a keyspace with the SHOW KEYSPACES command. The new keyspace is listed along with the system keyspace and any other existing keyspaces.

Creating a Column Family

First, connect to the keyspace where you want to define the column family with the USE command.

[default@unknown] USE demo;

In this example, we create a users column family in the demo keyspace. In this column family we are defining a few columns; full_name, email, state, gender, and birth_year. This is considered a static column family - we are defining the column names up front and most rows are expected to have more-or-less the same columns.
Notice the settings of comparator, key_validation_class and validation_class. These are setting the default encoding used for column names, row key values and column values. In the case of column names, the comparator also determines the sort order.

[default@unknown] USE demo;

[default@demo] CREATE COLUMN FAMILY users
WITH comparator = UTF8Type
AND key_validation_class=UTF8Type
AND column_metadata = [
{column_name: full_name, validation_class: UTF8Type}
{column_name: email, validation_class: UTF8Type}
{column_name: state, validation_class: UTF8Type}
{column_name: gender, validation_class: UTF8Type}
{column_name: birth_year, validation_class: LongType}
];

Next, create a dynamic column family called blog_entry. Notice that here we do not specify column definitions as the column names are expected to be supplied later by the client application.

[default@demo] CREATE COLUMN FAMILY blog_entry
WITH comparator = TimeUUIDType
AND key_validation_class=UTF8Type
AND default_validation_class = UTF8Type;

Creating a Counter Column Family

A counter column family contains counter columns. A counter column is a specific kind of column whose user-visible value is a 64-bit signed integer that can be incremented (or decremented) by a client application. The counter column tracks the most recent value (or count) of all updates made to it. A counter column cannot be mixed in with regular columns of a column family, you must create a column family specifically to hold counters.
To create a column family that holds counter columns, set the default_validation_class of the column family to CounterColumnType. For example:

[default@demo] CREATE COLUMN FAMILY page_view_counts
WITH default_validation_class=CounterColumnType
AND key_validation_class=UTF8Type AND comparator=UTF8Type;

To insert a row and counter column into the column family (with the initial counter value set to 0):

[default@demo] INCR page_view_counts['www.datastax.com'][home] BY 0;

To increment the counter:

[default@demo] INCR page_view_counts['www.datastax.com'][home] BY 1;

Inserting Rows and Columns

The following examples illustrate using the SET command to insert columns for a particular row key into the users column family. In this example, the row key is bobbyjo and we are setting each of the columns for this user. Notice that you can only set one column at a time in a SET command.

[default@demo] SET users['bobbyjo']['full_name']='Robert Jones';

[default@demo] SET users['bobbyjo']['email']='bobjones@gmail.com';

[default@demo] SET users['bobbyjo']['state']='TX';

[default@demo] SET users['bobbyjo']['gender']='M';

[default@demo] SET users['bobbyjo']['birth_year']='1975';

In this example, the row key is yomama and we are just setting some of the columns for this user.

[default@demo] SET users['yomama']['full_name']='Cathy Smith';

[default@demo] SET users['yomama']['state']='CA';

[default@demo] SET users['yomama']['gender']='F';

[default@demo] SET users['yomama']['birth_year']='1969';

In this example, we are creating an entry in the blog_entry column family for row key yomama:

[default@demo] SET blog_entry['yomama'][timeuuid()] = 'I love my new shoes!';

Note

The Cassandra CLI uses a default consistency level of ONE for all write and read operations. Specifying different consistency levels is not supported within Cassandra CLI.

Reading Rows and Columns

Use the GET command within Cassandra CLI to retrieve a particular row from a column family. Use the LIST command to return a batch of rows and their associated columns (default limit of rows returned is 100).
For example, to return the first 100 rows (and all associated columns) from the users column family:

[default@demo] LIST users;

Cassandra stores all data internally as hex byte arrays by default. If you do not specify a default row key validation class, column comparator and column validation class when you define the column family, Cassandra CLI will expect input data for row keys, column names, and column values to be in hex format (and data will be returned in hex format).
To pass and return data in human-readable format, you can pass a value through an encoding function. Available encodings are:

ascii
bytes
integer (a generic variable-length integer type)
lexicalUUID
long
utf8

For example to return a particular row key and column in UTF8 format:

[default@demo] GET users[utf8('bobby')][utf8('full_name')];

You can also use the ASSUME command to specify the encoding in which column family data should be returned for the entire client session. For example, to return row keys, column names, and column values in ASCII-encoded format:

[default@demo] ASSUME users KEYS AS ascii;
[default@demo] ASSUME users COMPARATOR AS ascii;
[default@demo] ASSUME users VALIDATOR AS ascii;

Setting an Expiring Column

When you set a column in Cassandra, you can optionally set an expiration time, or time-to-live (TTL) attribute for it.
For example, suppose we are tracking coupon codes for our users that expire after 10 days. We can define a coupon_code column and set an expiration date on that column. For example:

[default@demo] SET users['bobbyjo']
[utf8('coupon_code')] = utf8('SAVE20') WITH ttl=864000;

After ten days, or 864,000 seconds have elapsed since the setting of this column, its value will be marked as deleted and no longer be returned by read operations. Note, however, that the value is not actually deleted from disk until normal Cassandra compaction processes are completed.

Indexing a Column

The CLI can be used to create secondary indexes (indexes on column values). You can add a secondary index when you create a column family or add it later using the UPDATE COLUMN FAMILY command.
For example, to add a secondary index to the birth_year column of the users column family:

[default@demo] UPDATE COLUMN FAMILY users
WITH comparator = UTF8Type
AND column_metadata = [{column_name: birth_year, validation_class: LongType, index_type: KEYS}];

Because of the secondary index created for the column birth_year, its values can be queried directly for users born in a given year as follows:

[default@demo] GET users WHERE birth_date = 1969;

Deleting Rows and Columns

The Cassandra CLI provides the DEL command to delete a row or column (or subcolumn).
For example, to delete the coupon_code column for the yomama row key in the users column family:

[default@demo] DEL users ['yomama']['coupon_code'];

[default@demo] GET users ['yomama'];

Or to delete an entire row:

[default@demo] DEL users ['yomama'];

Dropping Column Families and Keyspaces

With Cassandra CLI commands you can drop column families and keyspaces in much the same way that tables and databases are dropped in a relational database. This example shows the commands to drop our example users column family and then drop the demo keyspace altogether:

[default@demo] DROP COLUMN FAMILY users;

[default@demo] DROP KEYSPACE demo;

QL »

Getting Started with CQL

Developers can access CQL commands in a variety of ways. Drivers are available for Python, Twisted Python, and JDBC-based client programs.
For the purposes of administrators, the most direct way to run simple CQL commands is via the Python-based cqlsh command-line client.

Starting the CQL Command-Line Program (cqlsh)

As of Apache Cassandra version 1.0.5 and DataStax Community version 1.0.1, the cqlsh client is installed with Cassandra in <install_location>/bin/cqlsh for tarball installations, or /usr/bin/cqlsh for packaged installations.
When you start cqlsh, you must provide the IP of a Cassandra node to connect to (default is localhost) and the RPC connection port (default is 9160). For example:

$ cqlsh 103.263.89.126 9160
cqlsh>

To exit cqlsh type exit at the command prompt.

cqlsh> exit

Running CQL Commands with cqlsh

Commands in cqlsh combine SQL-like syntax that maps to Cassandra concepts and operations. If you are just getting started with CQL, make sure to refer to the CQL Reference.
As of CQL version 2.0, cqlsh has the following limitations in support for Cassandra operations and data objects:

Super Columns are not supported; column_type and subcomparator arguments are not valid
Composite columns are not supported
Only a subset of all the available column family storage properties can be set using CQL.

The rest of this section provides some guidance with simple CQL commands using cqlsh. This is a similar (but not identical) set of commands as the set described in Using the Cassandra Client.

Creating a Keyspace

You can use the cqlsh commands described in this section to create a keyspace. In creating an example keyspace for Twissandra, we will assume a desired replication factor of 3 and implementation of the NetworkTopologyStrategy replica placement strategy. For more information on these keyspace options, see About Replication in Cassandra.
Note the single quotes around the string value of strategy_class:

cqlsh> CREATE KEYSPACE twissandra WITH
       strategy_class = 'NetworkTopologyStrategy'
       AND strategy_options:DC1 = 3;

Creating a Column Family

For this example, we use cqlsh to create a users column family in the newly created keyspace. Note the USE command to connect to the twissandra keyspace.

cqlsh> USE twissandra;

cqlsh> CREATE COLUMNFAMILY users (
 ...  KEY varchar PRIMARY KEY,
 ...  password varchar,
 ...  gender varchar,
 ...  session_token varchar,
 ...  state varchar,
 ...  birth_year bigint);

Inserting and Retrieving Columns

Though in production scenarios it is more practical to insert columns and column values programmatically, it is possible to use cqlsh for these operations. The example in this section illustrates using the INSERT and SELECT commands to insert and retrieve some columns in the users column family.
The following commands create and then get a user record for “jsmith.” The record includes a value for the password column we created when we created the column family, as well as an expiration time for the password column. Note that the user name “jsmith” is the row key, or in CQL terms, the primary key.

cqlsh> INSERT INTO users (KEY, password) VALUES ('jsmith', 'ch@ngem3a') USING TTL 86400;
cqlsh> SELECT * FROM users WHERE KEY='jsmith';
u'jsmith' | u'password',u'ch@ngem3a' | u'ttl', 86400

Adding Columns with ALTER COLUMNFAMILY

The ALTER COLUMNFAMILY command lets you add new columns to a column family. For example, to add a coupon_code column with the varchar validation type to the users column family:

cqlsh> ALTER TABLE users ADD coupon_code varchar;

This creates the column metadata and adds the column to the column family schema, but does not update any existing rows.

Altering Column Metadata

With ALTER COLUMNFAMILY, you can change the type of a column any time after it is defined or added to a column family. For example, if we decided the coupon_code column should store coupon codes in the form of integers, we could change the validation type as follows:

cqlsh> ALTER TABLE users ALTER coupon_code TYPE int;

Note that existing coupon codes will not be validated against the new type, only newly inserted values.

Specifying Column Expiration with TTL

Both the INSERT and UPDATE commands support setting a column expiration time (TTL). In the INSERT example above for the key jsmith we set the password column to expire at 86400 seconds, or one day. If we wanted to extend the expiration period to five days, we could use the UPDATE command a shown:

cqlsh> UPDATE users USING TTL 432000 SET 'password' = 'ch@ngem3a' WHERE KEY = 'jsmith';

Dropping Column Metadata

If your aim is to remove a column’s metadata entirely, including the column name and validation type, you can use ALTER TABLE <columnFamily> DROP <column>. The following command removes the name and validator without affecting or deleting any existing data:

cqlsh> ALTER TABLE users DROP coupon_code;

After you run this command, clients can still add new columns named coupon_code to the users column family – but they will not be validated until you explicitly add a type again.

Indexing a Column

cqlsh can be used to create secondary indexes, or indexes on column values. In this example, we will create an index on the state and birth_year columns in the users column family.

cqlsh> CREATE INDEX state_key ON users (state);
cqlsh> CREATE INDEX birth_year_key ON users (birth_year);

Because of the secondary index created for the two columns, their values can be queried directly as follows:

cqlsh> SELECT * FROM users
 ... WHERE gender='f' AND
 ...  state='TX' AND
...  birth_year='1968';
u'user1' | u'birth_year',1968 | u'gender',u'f' | u'password',u'ch@ngem3' | u'state',u'TX'

Deleting Columns and Rows

cqlsh provides the DELETE command to delete a column or row. In this example we will delete user jsmith’s session token column, and then delete jsmith’s row entirely.

cqlsh> DELETE session_token FROM users where KEY = 'jsmith';
cqlsh> DELETE FROM users where KEY = 'jsmith';

Note, however, that the phenomena called “range ghosts” in Cassandra may mean that keys for deleted rows are still retrieved by SELECT statements and other “get” operations. Deleted values, including range ghosts, are removed completely by the first compaction following deletion.

Dropping Column Families and Keyspaces

With cqlsh commands you can drop column families and keyspaces in much the same way that tables and databases are dropped in relational models. This example shows the commands to drop our example users column family and then drop the twissandra keyspace altogether:

cqlsh> DROP COLUMNFAMILY users;
cqlsh> DROP KEYSPACE twissandra;

Install cassandra on ubuntu

Installing the JRE on Debian or Ubuntu Systems

The Oracle Java Runtime Environment (JRE) has been removed from the official software repositories of Ubuntu and only provides a binary (.bin) version. You can get the JRE from the Java SE Downloads.

Download the appropriate version of the JRE, such as jre-6u31-linux-i586.bin, for your system and unpack it directly under /opt/java/<32 or 64>.

Make the file executable:

sudo chmod 755 /opt/java/32/jre-6u31-linux-i586.bin

Go to the new folder:
```
cd /opt/java
```
Execute the file:
```
sudo ./jre-6u31-linux-i586.bin
```
If needed, accept the license terms to continue installing the JRE.

Tell the system that there’s a new Java version available:

sudo update-alternatives --install "/usr/bin/java" "java" "/opt/java/32/jre1.6.0_31/bin/java" 1

Note

If updating from a previous version that was removed manually, execute the above command twice, because you’ll get an error message the first time.

Set the new JRE as the default:

sudo update-alternatives --set java /opt/java/32/jre1.6.0_31/bin/java

Make sure your system is now using the correct JRE:

$ sudo java -version

java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b04, mixed mode)

Installing Cassandra Debian Packages

DataStax provides a debian package repository for Apache Cassandra.
These instructions assume that you have the aptitude package management application installed, and that you have root access on the machine where you are installing.

Note

By downloading community software from DataStax you agree to the terms of the DataStax Community EULA (End User License Agreement) posted on the DataStax web site.

Edit the aptitude repository source list file (/etc/apt/sources.list).
```
$ sudo vi /etc/apt/sources.list
```

In this file, add the DataStax Community repository.

deb http://debian.datastax.com/community stable main

(Debian Systems Only) Find the line that describes your source repository for Debian and add contrib non-free to the end of the line. This allows installation of the Oracle JVM instead of the OpenJDK JVM. For example:
```
deb http://some.debian.mirror/debian/ $distro main contrib non-free
```

Save and close the file when you are done adding/editing your sources.

Add the DataStax repository key to your aptitude trusted keys.

$ curl -L http://debian.datastax.com/debian/repo_key | sudo apt-key add -

If needed, install the Python CQL driver.

sudo apt-get install python-cql=1.0.10-1

Install the package.
```
$ sudo apt-get update
$ sudo apt-get install cassandra=1.0.10 dsc
```
This installs the Cassandra, DataStax Community demos, and OpsCenter packages. By default, the Debian packages start the Cassandra service automatically.
To stop the service and clear the initial gossip history that gets populated by this initial start:
```
$ sudo service cassandra stop
$ sudo bash -c 'rm /var/lib/cassandra/data/system/*'
```

Configuring and Starting a Cassandra Cluster

The process for initializing a Cassandra cluster (be it a single node, multiple node, or multiple data center cluster) is to first correctly configure the Node and Cluster Initialization Properties in each node’s cassandra.yaml configuration file, and then start each node individually starting with the seed node(s).
For more guidance on choosing the right configuration properties for your needs, see Choosing Node Configuration Options.

Initializing a Multi-Node or Multi-Data Center Cluster

To correctly configure a multi-node or multi-data center cluster, you must determine the following information:

A name for your cluster.
How many total nodes your cluster will have, and how many nodes per data center (or replication group).
The IP addresses of each node.
The token for each node (see Calculating Tokens).
If you are deploying a multi-data center cluster, make sure to assign tokens so that data is evenly distributed within each data center or replication grouping (see Calculating Tokens for a Multi-Data Center Cluster).
Which nodes will serve as the seed nodes.
If you are deploying a multi-data center cluster, the seed list (a comma-delimited list of addresses) should include a node from each data center or replication group. Cassandra nodes use this host list to find each other and learn the topology of the ring.
The snitch you plan to use.

This information is used to configure the Node and Cluster Initialization Properties in the cassandra.yaml configuration file on each node in the cluster. Each node should be correctly configured before starting up the cluster, one node at a time (starting with the seed nodes).
For example, suppose you are configuring a 6 node cluster spanning 2 racks in a single data center. The nodes have the following IPs, and one node per rack will serve as a seed:

node0 110.82.155.0 (seed1)
node1 110.82.155.1
node2 110.82.155.2
node3 110.82.156.3 (seed2)
node4 110.82.156.4
node5 110.82.156.5

The cassandra.yaml files for each node would then have the following modified property settings.
node0

cluster_name: 'MyDemoCluster'
initial_token: 0
seed_provider:
      - class_name: org.apache.cassandra.locator.SimpleSeedProvider
        parameters:
            - seeds: "110.82.155.0,110.82.155.3"
listen_address: 110.82.155.0
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

node1

cluster_name: 'MyDemoCluster'
initial_token: 28356863910078205288614550619314017621
seed_provider:
      - class_name: org.apache.cassandra.locator.SimpleSeedProvider
        parameters:
            - seeds: "110.82.155.0,110.82.155.3"
listen_address: 110.82.155.1
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

node2

cluster_name: 'MyDemoCluster'
initial_token: 56713727820156410577229101238628035242
seed_provider:
      - class_name: org.apache.cassandra.locator.SimpleSeedProvider
        parameters:
            - seeds: "110.82.155.0,110.82.155.3"
listen_address: 110.82.155.2
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

node3

cluster_name: 'MyDemoCluster'
initial_token: 85070591730234615865843651857942052864
seed_provider:
      - class_name: org.apache.cassandra.locator.SimpleSeedProvider
        parameters:
            - seeds: "110.82.155.0,110.82.155.3"
listen_address: 110.82.155.3
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

node4

cluster_name: 'MyDemoCluster'
initial_token: 113427455640312821154458202477256070485
seed_provider:
      - class_name: org.apache.cassandra.locator.SimpleSeedProvider
        parameters:
            - seeds: "110.82.155.0,110.82.155.3"
listen_address: 110.82.155.4
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

node5

cluster_name: 'MyDemoCluster'
initial_token: 141784319550391026443072753096570088106
seed_provider:
      - class_name: org.apache.cassandra.locator.SimpleSeedProvider
        parameters:
            - seeds: "110.82.155.0,110.82.155.3"
listen_address: 110.82.155.5
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

Calculating Tokens

Tokens are used to assign a range of data to a particular node. Assuming you are using the RandomPartitioner (the default partitioner), the approaches described in this section will ensure even data distribution.
Each node in the cluster should be assigned a token before it is started for the first time. The token is set with the initial_token property in the cassandra.yaml configuration file.

Calculating Tokens for Multiple Racks

If you have multiple racks in single data center or a multiple data center cluster, you can use the same formula for calculating the tokens. However you should assign the tokens to nodes in alternating racks. For example: rack1, rack2, rack3, rack1, rack2, rack3, and so on. Be sure to have the same number of nodes in each rack.
../../_images/multirack_tokens.png

Calculating Tokens for a Single Data Center

Create a new file for your token generator program:
```
vi tokengentool
```

Paste the following Python program into this file:

#! /usr/bin/python
import sys
if (len(sys.argv) > 1):
    num=int(sys.argv[1])
else:
    num=int(raw_input("How many nodes are in your cluster? "))
for i in range(0, num):
    print 'token %d: %d' % (i, (i*(2**127)/num))

Save and close the file and make it executable:
```
chmod +x tokengentool
```
Run the script:
```
./tokengentool
```

When prompted, enter the total number of nodes in your cluster:

How many nodes are in your cluster? 6
token 0: 0
token 1: 28356863910078205288614550619314017621
token 2: 56713727820156410577229101238628035242
token 3: 85070591730234615865843651857942052864
token 4: 113427455640312821154458202477256070485
token 5: 141784319550391026443072753096570088106

On each node, edit the cassandra.yaml file and enter its corresponding token value in the initial_token property.

Calculating Tokens for a Multi-Data Center Cluster

In multi-data center deployments, replica placement is calculated per data center using the NetworkTopologyStrategy replica placement strategy. In each data center (or replication group) the first replica for a particular row is determined by the token value assigned to a node. Additional replicas in the same data center are placed by walking the ring clockwise until it reaches the first node in another rack.
If you do not calculate partitioner tokens so that the data ranges are evenly distributed for each data center, you could end up with uneven data distribution within a data center. The goal is to ensure that the nodes for each data center are evenly dispersed around the ring, or to calculate tokens for each replication group individually (without conflicting token assignments).
One way to avoid uneven distribution is to calculate tokens for all nodes in the cluster, and then alternate the token assignments so that the nodes for each data center are evenly dispersed around the ring.
../../_images/multidc_alternate_tokens.png

../../_images/multidc_alternate_tokens.png

Another way to assign tokens in a multi data center cluster is to generate tokens for the nodes in one data center, and then offset those token numbers by 1 for all nodes in the next data center, by 2 for the nodes in the next data center, and so on. This approach is good if you are adding a data center to an established cluster, or if your data centers do not have the same number of nodes.
../../_images/multidc_tokens_offset.png

Starting and Stopping a Cassandra Node

After you have installed and configured Cassandra on all nodes, you are ready to start your cluster. On initial start-up, each node must be started one at a time, starting with your seed nodes.
Packaged installations include startup scripts for running Cassandra as a service. Binary packages do not.

Starting/Stopping Cassandra as a Stand-Alone Process

You can start the Cassandra Java server process as follows:

$ cd <install_location>
$ sh bin/cassandra -f

To stop the Cassandra process, find the Cassandra Java process ID (PID), and then kill -9 the process using its PID number. For example:

$ ps ax | grep java
$ kill -9 1539

Starting/Stopping Cassandra as a Service

Packaged installations provide startup scripts in /etc/init.d for starting Cassandra as a service. The service runs as the cassandra user. You must have root or sudo permissions to start or stop services.
To start the Cassandra service (as root):

# service cassandra start

To stop the Cassandra service (as root):

# service cassandra stop

Note

On Enterprise Linux systems, the Cassandra service runs as a java process. On Debian systems, the Cassandra service runs as a jsvc process.

# enable add-apt-repository
sudo apt-get install python-software-properties
# add repository for java
sudo add-apt-repository ppa:ferramroberto/java
# update
sudo apt-get update
# install Sun (I hate Oracle) java
sudo apt-get install sun-java6-jdk sun-java6-plugin
# create directory for installation
sudo mkdir /opt/cassandra
# add cassandra user [set password]
sudo adduser cassandra
# change owner of istallation directory
sudo chown cassandra:cassandra /opt/cassandra/
# switch to cassandra user
su -l cassandra
# go to installation directory
cd /opt/cassandra
# download latest version (check address on cassandra.apache.org)
wget http://www.apache.net.pl//cassandra/1.0.7/apache-cassandra-1.0.7-bin.tar.gz
# untar
tar xvzf apache-cassandra-1.0.7-bin.tar.gz
# back to admin account, create cassandra var directory
logout
sudo mkdir /var/lib/cassandra/
sudo chown cassandra:cassandra /var/lib/cassandra/
sudo mkdir /var/log/cassandra/
sudo chown cassandra:cassandra /var/log/cassandra/
# switch again to cassandra user
su -l cassandra
mkdir /var/lib/cassandra/data
mkdir /var/lib/cassandra/commitlog
mkdir /var/lib/cassandra/saved_caches

To install Cassandra on Debian or other Debian derivatives like Ubuntu, LinuxMint…, use the following:
1- First upgrade your software :

sudo apt-get upgrade

2- Now open sources.list

sudo vi /etc/apt/sources.list

3- add the following lines to your source.list

deb http://www.apache.org/dist/cassandra/debian unstable main

deb-src http://www.apache.org/dist/cassandra/debian unstable main

4- Run update

sudo apt-get update

Now you will see an error similar to this:
GPG error: http://www.apache.org unstable Release: The following signatures couldn’t be verified because the public key is not available: NO_PUBKEY F758CE318D77295D
This simply means you need to add the PUBLIC_KEY. You do that like this:
gpg –keyserver wwwkeys.eu.pgp.net –recv-keys F758CE318D77295D
gpg –export –armor F758CE318D77295D | sudo apt-key add -
5- Run update again and install cassandra

sudo apt-get update && sudo apt-get install cassandra

6- Now start Cassandra :

sudo /etc/init.d/cassandra start

Wednesday, 30 May 2012

Installing Cassandra on windows xp

First we need to download Cassandra.For Download Cassandra Click this link Download. Now unzip the file and put it into a drive where you want to install Cassandra.you can change the folder name cause deafult name is "apache-cassandra-0.5.1".its to long to remember easily.So i change my unzip folder name to "cassandra". Now we have our Cassandra in a spacific drive.Lets go install and configur Cassandra database on windowsXP.You need to install java 1.6.for JAVA download
i) First Step :
First we need to add Two Environment Variables. One for java and other for Cassandra home Directory.Bellow show how to set the Environment Variables on WindowsXp.

Click start then select My Computer and Click Right button on mouse.After that it show System properties Tab manu.Now Select Advanced Tab.

Now select Environment Variables button.Its open Environment Variables set window like bellow.

Now Select New From System Variables Section.its open a new window like bellow.

Fill Variable name section to "JAVA_HOME" and Variable value to java jdk path. I install java to my C:\program files folder for this my java jdk path is "C:\program Files\java\jdk1.6.0_18" please change this path with your java installation path. same way set CASSANDRA_HOME path.my Cassandra is "F:\cassandra" casue i put Cassandra unzip folder to f drive and i change it to cassandra.please change it with your cassandra dirve.

Now click ok->apply and your Environment Variables setting is completed.Now lest go to run Cassandra and connect with database.Again Click Start then click run and type cmd in text box on run window and Enter .it open commend mode on windows.Like this.

Now type f: then enter and then type cd cassandra\bin and enter.here f is the drive where my cassandra folder is.now its look like this

First you need some change in configuration file . for edit configuration file please open "storage-conf.xml" file on any txt editor and write this line under

<replicationfactor>1</replicationfactor> tag.

<commitlogdirectory>F:/cassandra/data/commitlog</commitlogdirectory>
<datafiledirectories>
<datafiledirectory>F:/cassandra/data/data</datafiledirectory></datafiledirectories>
<calloutlocation>F:/cassandra/data/callouts</calloutlocation>
<stagingfiledirectory>F:/cassandra/data/staging</stagingfiledirectory>

you find this configuration file on F:\cassandra\conf this location. please change Directory with your Cassandra Directory. Now type cassandra.bat and Enter.if it run correctly the commend prompt show like this

Its mean the Cassandra server run correctly.if you check your my computer then you found there is a extra drive created with same name where your Cassandra folder is.Now type cassandra-cli.bat on your commend and Enter.it will show a welcome message.now connect with Thrift.type connect / its mean connect localhost/9160 and enter.here 9160 is your ThriftPort that define in configuration file in defult.after run this commend your commend prompt look like this

its show connected to localhost/9160 it means you are connect to Thrift.If you want to check your connection is ok or not, type show keyspaces and press enter,it shows your default Keyspace name like this

To create Column need to add some code in storage-conf.xml.The next thing we need to do is to locate and configure the database storage-conf.xml file.which is found in conf folder.Open the storage-conf.xml in your favorite text editor.now Add the following code under tag.

<Keyspace Name="Blog">
        <ColumnFamily Name="Post"
                    ColumnType="Super"
                    CompareWith="UTF8Type"
                    CompareSubcolumnsWith="UTF8Type"
                    Comment="A column family with supercolumns, whose column and subcolumn names are UTF8 strings"                      />
   </Keyspace>

save it.The above configuration creates one Column Family(or table).called Posts in a Keyspace (or database ) called Blog.

Tuesday, 29 May 2012

practical cassandra

http://skillsmatter.com/podcast/nosql/cassandra-meetup-march

About cassandra

Cassandra has been architecture for consuming large amounts of data as fast as possible. To accomplish this, Cassandra first writes new data to a commit log to ensure it is safe. After that, the data is then written to an in-memory structure called a memtable. Cassandra deems the write successful once it is stored on both the commit log and a memtable, which provides the durability required for mission-critical systems.
Once a memtable‘s memory limit is reached, all writes are then written to disk in the form of an SSTable (sorted strings table). An SSTable is immutable, meaning it is not written to ever again. If the data contained in the SSTable is modified, the data is written to Cassandra in an upsert fashion and the previous data automatically removed.
Because SSTables are immutable and only written once the corresponding memtable is full, Cassandra avoids random seeks and instead only performs sequential IO in large batches, resulting in high write throughput.
A related factor is that Cassandra doesn’t have to do a read as part of a write (i.e. check index to see where current data is). This means that insert performance remains high as data size grows, while with b-tree based engines (e.g. MongoDB) it deteriorates.

Cassandra is architected in a peer-to-peer fashion and uses a protocol called “gossip” to communicate with other nodes in a cluster. The gossip process runs every second to exchange information across the cluster.
Gossip only includes information about the cluster itself (e.g., up/down, joining, leaving, version, schema) and does not manage the data. Data is transferred node-to-node using a message-passing like protocol on a distinct port from what client applications connect to. The Cassandra partitioner turns a column family key into a token, the replication strategy picks the set of nodes responsible for that token (using information from the snitch) and Cassandra sends messages to those replicas with the request (read or write).

Unlike relational databases, Cassandra does not offer fully ACID-compliant transactions. There is no locking or transactional dependencies when concurrently updating multiple rows or column families. But if by “transactions” you mean real-time data entry and retrieval, with durability and tunable consistency, then yes.
Cassandra does not support transactions in the sense of bundling multiple row updates into one all-or-nothing operation. Nor does it roll back when a write succeeds on one replica, but fails on other replicas. It is possible in Cassandra to have a write operation report a failure to the client, but still actually persist the write to a replica.
However, this does not mean that Cassandra cannot be used as an operational or real time datastore. Data is very safe in Cassandra because writes in Cassandra are durable. All writes to a replica node are recorded both in memory and in a commit log before they are acknowledged as a success. If a crash or server failure occurs before the memory tables are flushed to disk, the commit log is replayed on restart to recover any lost writes.

In Cassandra, the keyspace is the container for your application data, similar to a schema in a relational database. Keyspaces are used to group column families together. Typically, a cluster has one keyspace per application.
Replication is controlled on a per-keyspace basis, so data that has different replication requirements should reside in different keyspaces. Keyspaces are not designed to be used as a significant map layer within the data model, only as a way to control data replication for a set of column families.

When comparing Cassandra to a relational database, the column family is similar to a table in that it is a container for columns and rows. However, a column family requires a major shift in thinking for those coming from the relational world.
In a relational database, you define tables, which have defined columns. The table defines the column names and their data types, and the client application then supplies rows conforming to that schema: each row contains the same fixed set of columns.
In Cassandra, you define column families. Column families can (and should) define metadata about the columns, but the actual columns that make up a row are determined by the client application. Each row can have a different set of columns.

A Cassandra column family can contain regular columns (key/value pairs) or super columns. Super columns add another level of nesting to the regular column family column structure. Super columns are comprised of a (super) column name and an ordered map of sub-columns. A super column is a way to group multiple columns based on a common lookup value.

The primary use case for super columns is to denormalize multiple rows from other column families into a single row, allowing for materialized view data retrieval.
Super columns should not be used when the number of sub-columns is expected to be a large number. During reads, all sub-columns of a super column must be deserialized to read a single sub-column, so performance of super columns is not optimal if there are a large number of sub-columns. Also, you cannot create a secondary index on a sub-column of a super column.

brighthouse.ini configuration in infobright

This is my brighthouse.ini file.
################## BrightHouse configuration file ####################
# To change values, uncomment the parameter and specify desired value.
############ Critical Disk Settings ############
# Data Folder: check where you installed brighthouse data folder (directory this file is in) - it should be on a fast disk.
# CacheFolder - a place in which temporary database objects (memory cache) are stored.
# Should be on a fast drive, possibly not the same as data. Allow at least 20 GB of free space (depending on database size).
CacheFolder = /usr/local/infobright/cache
############ Critical Memory Settings ############
# Note: the **default settings** below are for 2 GB machines. When more memory is avaliable, set it higher.
# System Memory Server Main Heap Size Server Compressed Heap Size Loader Main Heap Size
# 3GB 1200 400
# ServerMainHeapSize - Size of the main memory heap in the server process, in MB
ServerMainHeapSize=400
# ServerCompressedHeapSize - Size of the compressed memory heap in the server process, in MB.
ServerCompressedHeapSize=200
# LoaderMainHeapSize - Size of the memory heap in the loader process, in MB.
LoaderMainHeapSize=300
############ Logging Settings ############
# ControlMessages - Set to 2 to turn the control messages on. This is usually needed by Infobright to support performance investigation.
# ControlMessages = 0
############ Other Settings ############
# ClusterSize - maximum size of data files in MB [10 - 2000].
# Decreasing ClusterSize may make differential backup easier, but overall performance may decrease for large databases.
# ClusterSize = 2000
# KNFolder - Directory where the Knowledge Grid is stored.
# KNFolder = BH_RSI_Repository
# AllowMySQLQueryPath can be set to 0 to disable MySQL Query path or 1 to enable it.
# AllowMySQLQueryPath = 0