Skip to main content
Version: 2.2.6

Using Secondary Indexes

Note: Riak Search preferred for querying

If you're interested in non-primary-key-based querying in Riak, i.e. if you're looking to go beyond straightforward K/V operations, we now recommend Riak Search rather than secondary indexes for a variety of reasons. Most importantly, Riak Search has a far more capacious querying API and can be used with all of Riak's storage backends.

Secondary indexes (2i) in Riak enable you to tag objects stored in Riak, at write time, with one or more queryable values. Those values can then be used to find multiple objects in Riak. If you're storing user data, for example, you could tag each object associated with that user with a username or other unique marker. Once tagged, you could find all objects in a Riak bucket sharing that tag. Secondary indexes can be either a binary or string, such as sensor_1_data or admin_user or click_event, or an integer, such as 99 or 141121.

Riak Search serves analogous purposes but is quite different because it parses key/value data itself and builds indexes on the basis of Solr schemas.

Please note that 2i can be used only with the LevelDB and Memory backends.

Features

  • Allows two types of secondary attributes: integers and strings (aka binaries)
  • Allows querying by exact match or range on one index
  • Allows pagination of results
  • Allows streaming of results
  • Query results can be used as input to a MapReduce query
Note on 2i and strong consistency

Secondary indexes do not currently work with the strong consistency feature introduced in Riak version 2.0. If you store objects in strongly consistent buckets and attach secondary index metadata to those objects, you can still perform strongly consistent operations on those objects but the secondary indexes will be ignored.

When to Use Secondary Indexes

Secondary indexes are useful when you want to find data on the basis of something other than objects' bucket type, bucket, and key, i.e. when you want objects to be discoverable based on more than their location alone.

2i works best for objects whose value is stored in an opaque blob, like a binary file, because those objects don't offer any clues that enable you to discover them later. Indexing enables you to tag those objects and find all objects with the same tag in a specified bucket later on.

2i is thus recommended when your use case requires an easy-to-use search mechanism that does not require a schema (as does Riak Search) and a basic query interface, i.e. an interface that enables an application to tell Riak things like "fetch all objects tagged with the string Milwaukee_Bucks" or "fetch all objects tagged with numbers between 1500 and 1509."

2i is also recommended if your use case requires anti-entropy. Since secondary indexes are just metadata attached to key/value objects, 2i piggybacks off of read-repair.

When Not to Use Secondary Indexes

  • If your ring size exceeds 512 partitions, 2i can cause performance issues in large clusters.
  • When you need more than the exact match and range searches that 2i supports. If that's the case, we recommend checking out Riak Search.
  • When you want to use composite queries. A query like last_name=zezeski AND state=MD would have to be split into two queries and the results merged (or it would need to involve MapReduce).

Query Interfaces and Examples

Typically, the result set from a 2i query is a list of object keys from the specified bucket that include the index values in question. As we'll see below, when executing range queries in Riak 1.4 or higher, it is possible to retrieve the index values along with the object keys.

Inserting Objects with Secondary Indexes

In this example, the key john_smith is used to store user data in the bucket users, which bears the default bucket type. Let's say that an application would like add a Twitter handle and an email address to this object as secondary indexes.

Location johnSmithKey = new Location(new Namespace("default", "users"), "john_smith");

// In the Java client (and all clients), if you do not specify a bucket type,
// the client will use the default type. And so the following store command
// would be equivalent to the one above:
Location johnSmithKey = new Location(new Namespace("users"), "john_smith");

RiakObject obj = new RiakObject()
.setContentType("application/json")
.setValue(BinaryValue.create("{'user_data':{ ... }}"));

obj.getIndexes().getIndex(StringBinIndex.named("twitter")).add("jsmith123");
obj.getIndexes().getIndex(StringBinIndex.named("email")).add("jsmith@basho.com");

StoreValue store = new StoreValue.Builder(obj)
.withLocation(johnSmithKey)
.build();
client.execute(store);

Getting started with Riak clients

If you are connecting to Riak using one of Basho's official client libraries, you can find more information about getting started with your client in the Developing with Riak KV: Getting Started section.

This has accomplished the following:

  • The object has been stored with a primary bucket/key of users/john_smith
  • The object now has a secondary index called twitter_bin with a value of jsmith123
  • The object now has a secondary index called email_bin with a value of jsmith@basho.com

Querying Objects with Secondary Indexes

Let's query the users bucket on the basis of Twitter handle to make sure that we can find our stored object:

Namespace usersBucket = new Namespace("users");
BinIndexQuery biq = new BinIndexQuery.Builder(usersBucket, "twitter", "jsmith123")
.build();
BinIndexQuery.Response response = client.execute(biq);
List<BinIndexQuery.Response.Entry> entries = response.getEntries();
for (BinIndexQuery.Response.Entry entry : entries) {
System.out.println(entry.getRiakObjectLocation().getKey());
}

The response:

john_smith

Examples

To run the following examples, make sure that Riak is configured to use an index-capable storage backend, such as LevelDB or Memory.

Indexing Objects

The following example indexes four different objects. Notice that we're storing both integer and string (aka binary) fields. Field names are automatically lowercased, some fields have multiple values, and duplicate fields are automatically de-duplicated, as in the following example:

Namespace peopleBucket = new Namespace("indexes", "people");

RiakObject larry = new RiakObject()
.setValue(BinaryValue.create("My name is Larry"));
larry.getIndexes().getIndex(StringBinIndex.named("field1")).add("val1");
larry.getIndexes().getIndex(LongIntIndex.named("field2")).add(1001L);
StoreValue storeLarry = new StoreValue.Builder(larry)
.withLocation(peopleBucket.setKey("larry"))
.build();
client.execute(storeLarry);

RiakObject moe = new RiakObject()
.setValue(BinaryValue.create("Ny name is Moe"));
moe.getIndexes().getIndex(StringBinIdex.named("Field1")).add("val2");
moe.getIndexes().getIndex(LongIntIndex.named("Field2")).add(1002L);
StoreValue storeMoe = new StoreValue.Builder(moe)
.withLocation(peopleBucket.setKey("moe"))
.build();
client.execute(storeMoe);

RiakObject curly = new RiakObject()
.setValue(BinaryValue.create("My name is Curly"));
curly.getIndexes().getIndex(StringBinIndex.named("FIELD1")).add("val3");
curly.getIndexes().getIndex(LongIntIndex.named("FIELD2")).add(1003L);
StoreValue storeCurly = new StoreValue.Builder(curly)
.withLocation(peopleBucket.setKey("curly"))
.build();
client.execute(storeCurly);

RiakObject veronica = new RiakObject()
.setValue(BinaryValue.create("My name is Veronica"));
veronica.getIndexes().getIndex(StringBinIndex.named("field1"))
.add("val4").add("val4");
veronica.getIndexes().getIndex(LongIntIndex.named("field2"))
.add(1004L).add(1005L).add(1006L).add(1004L).add(1004L).add(1007L);
StoreValue storeVeronica = new StoreValue.Builder(veronica)
.withLocation(peopleBucket.setKey("veronica"))
.build();
client.execute(storeVeronica);

The above objects will end up having the following secondary indexes, respectively:

  • Larry --- Binary index field1_bin and integer index field2_int
  • Moe --- Binary index field1_bin and integer index field2_int (note that the index names are set to lowercase by Riak)
  • Curly --- Binary index field1_bin and integer index field2_int (note again that the index names are set to lowercase)
  • Veronica --- Binary index field1_bin with the values val4, val4a, and val4b and integer index field2_int with the values 1004, 1005, 1006, and 1007 (note that redundancies have been removed)

As these examples show, there are safeguards in Riak that both normalize the names of indexes and prevent the accumulation of redundant indexes.

Invalid Field Names and Types

The following examples demonstrate what happens when an index field is specified with an invalid field name or type. The system responds with 400 Bad Request and a description of the error.

Invalid field name:

// The Java client will not allow you to provide invalid index names,
// because you are not required to add "_bin" or "_int" to the end of
// those names
Location key = new Location(new Namespace("people"), "larry");
RiakObject obj = new RiakObject();
obj.getIndexes().getIndex(LongIntIndex.named("field2")).add("bar");

// The Java client will return a response indicating a type mismatch.
// The output may look something like this:

Error:(46, 68) java: no suitable method found for add(java.lang.String)
method com.basho.riak.client.query.indexes.RiakIndex.add(java.lang.Long) is not applicable
(argument mismatch; java.lang.String cannot be converted to java.lang.Long)
method com.basho.riak.client.query.indexes.RiakIndex.add(java.util.Collection<java.lang.Long>) is not applicable
(argument mismatch; java.lang.String cannot be converted to java.util.Collection<java.lang.Long>)

Querying

Note on 2i queries and the R parameter

For all 2i queries, the R parameter is set to 1, which means that queries that are run while handoffs and related operations are underway may not return all keys as expected.

Exact Match

The following examples perform an exact match index query.

Query a binary index:

Namespace myBucket = new Namespace("indexes", "people");
BinIndexQuery biq = new BinIndexQuery.Builder(myBucket, "field1", "val1").build();
BinIndexQuery.Response response = client.execute(biq);

Query an integer index:

Namespace myBucket = new Namespace("indexes", "people");
IntIndexQuery iiq = new IntIndexQuery.Builder(myBucket, "field2", 1001L)
.build();
IntIndexQuery.Response response = client.execute(iiq);

The following example performs an exact match query and pipes the results into a MapReduce job:

curl -XPOST localhost:8098/mapred \
-H "Content-Type: application/json" \
-d @-<<EOF
{
"inputs": {
"bucket": "people",
"index": "field2_bin",
"key":"val3"
},
"query": [
{
"reduce": {
"language":"erlang",
"module": "riak_kv_mapreduce",
"function": "reduce_identity",
"keep": true
}
}
]
}
EOF

Range

The following examples perform a range query.

Query a binary index...

Namespace myBucket = new Namespace("indexes", "people");
BinIndexQuery biq = new BinIndexQuery.Builder(myBucket, "field1", "val2", "val4")
.build();
BinIndexQuery.Response response = client.execute(biq);

Or query an integer index...

Namespace myBucket = new Namespace("indexes", "people");
IntIndexQuery iiq = new IntIndexQuery.Builder(myBucket, "field2", 1002L, 1004L)
.build();
IntIndexQuery.Response response = client.execute(iiq);

The following example performs a range query and pipes the results into a MapReduce job:

curl -XPOST localhost:8098/mapred\
-H "Content-Type: application/json" \
-d @-<<EOF
{
"inputs": {
"bucket": "people",
"index": "field2_bin",
"start": "1002",
"end": "1004"
},
"query": [
{
"reduce": {
"language": "erlang",
"module": "riak_kv_mapreduce",
"function": "reduce_identity",
"keep": true
}
}
]
}
EOF

Range with terms

When performing a range query, it is possible to retrieve the matched index values alongside the Riak keys using return_terms=true. An example from a small sampling of Twitter data with indexed hash tags:

Namespace tweetsBucket = new Namespace("indexes", "tweets");
BinIndexQuery biq = new BinIndexQuery.Builder(tweetsBucket, "hashtags", "rock", "rocl")
.withKeyAndIndex(true)
.build();
BinIndexQuery.Response response = client.execute(biq);

Response:

{
"results": [
{
"rock": "349224101224787968"
},
{
"rocks": "349223639880699905"
}
]
}

Pagination

When asking for large result sets, it is often desirable to ask the servers to return chunks of results instead of a firehose. You can do so using max_results=<n>, where n is the number of results you'd like to receive.

Assuming more keys are available, a continuation value will be included in the results to allow the client to request the next page.

Here is an example of a range query with both return_terms and pagination against the same Twitter data set.

Namespace tweetsBucket = new Namespace("indexes", "tweets");
BinIndexQuery biq = new BinIndexQuery.Builder(tweetsBucket, "hashtags", "ri", "ru")
.withMaxResults(5)
.build();
BinIndexQuery.Response response = client.execute(biq);

Here is an example JSON response (your client-specific response may differ):

{
"continuation": "g2gCbQAAAAdyaXBqYWtlbQAAABIzNDkyMjA2ODcwNTcxMjk0NzM=",
"results": [
{ "rice": "349222574510710785" },
{ "rickross": "349222868095217664" },
{ "ridelife": "349221819552763905" },
{ "ripjake": "349220649341952001" },
{ "ripjake": "349220687057129473" }
]
}

Take the continuation value from the previous result set and feed it back into the query.

Namespace tweetsBucket = new Namespace("indexes", "tweets");
BinIndexQuery biq = new BinIndexQuery.Builder(tweetsBucket, "hashtags", "ri", "ru")
.withContinuation(BinaryValue.create("g2gCbQAAAAdyaXBqYWtlbQAAABIzNDkyMjA2ODcwNTcxMjk0NzM"))
.withMaxResults(5)
.withKeyAndIndex(true)
.build();
BinIndexQuery.Response response = client.execute(biq);

The result:

{
"continuation": "g2gCbQAAAAlyb2Jhc2VyaWFtAAAAEjM0OTIyMzcwMjc2NTkxMjA2NQ==",
"results": [
{
"ripjake": "349221198774808579"
},
{
"ripped": "349224017347100672"
},
{
"roadtrip": "349221207155032066"
},
{
"roastietime": "349221370724491265"
},
{
"robaseria": "349223702765912065"
}
]
}

Streaming

It is also possible to stream results:

// Available in Riak Java Client 2.1.0 and later
int pollTimeoutMS = 200;
Namespace ns = new Namespace("indexes", "tweets");
String indexName = "hashtags";

BinIndexQuery indexQuery =
new BinIndexQuery.Builder(ns, indexName, "ri", "ru").build();

final RiakFuture<BinIndexQuery.StreamingResponse, BinIndexQuery> streamingFuture =
client.executeAsyncStreaming(indexQuery, pollTimeoutMS);

// For streaming commands, the future's value will be available before
// the future is complete, so you may begin to pull results from the
// provided iterator as soon as possible.
final BinIndexQuery.StreamingResponse streamingResponse = streamingFuture.get();

for (BinIndexQuery.Response.Entry e : streamingResponse)
{
// Do something with key...
}

streamingFuture.await();
Assert.assertTrue(streamingFuture.isDone());

Streaming can also be combined with pagination and return_terms.

Sorting

As of Riak 1.4, the result set is sorted on index values (when executing range queries) and object keys. See the pagination example above: hash tags (2i keys) are returned in ascending order, and the object keys (Twitter IDs) for the messages which contain the ripjake hash tag are also returned in ascending order.

Retrieve all Bucket Keys via the $bucket Index

The following example retrieves the keys for all objects stored in the bucket people using an exact match on the special $bucket index.

curl localhost:8098/types/indexes/buckets/people/index/\$bucket/_

Count Bucket Objects via $bucket Index

The following example performs a secondary index lookup on the $bucket index like in the previous example and pipes this into a MapReduce that counts the number of records in the people bucket. In order to improve efficiency, the batch size has been increased from the default size of 20.

curl -XPOST localhost:8098/mapred\
-H "Content-Type: application/json" \
-d @-<<EOF
{
"inputs": {
"bucket": "people",
"index": "\$bucket",
"key":"people"
},
"query": [
{
"reduce": {
"language": "erlang",
"module": "riak_kv_mapreduce",
"function": "reduce_count_inputs",
"arg": {
"reduce_phase_batch_size":1000
}
}
}
]
}
EOF