BlobStore Guide

The BlobStore API is a portable means of managing key-value storage providers such as Microsoft Azure Blob Service and Amazon S3. It offers both asynchronous and synchronous APIs, as well as Map-based access to your data. Our APIs are dramatically simplified from the providers, yet still offer enough sophistication to perform most work in a portable manner. We also have integrations underway for popular tools such as Apache commons VFS.

Like other components in jclouds, you always have means to gain access to the provider-specific interface if you need functionality that is not available in our abstraction.

Features

Location Aware API

Our location API helps you to portably identify a container within context, such as Americas or Europe.

We use the same model across the ComputeGuide which allows you to facilitate collocation of processing and data.

Asynchronous API

You have a choice of using either synchronous or asynchronous BlobStore API. If you choose to use the Asynchronous API, you'll benefit by gaining access to the most efficient means to achieve this, regardless of whether it is via threads, non-blocking io, or native async clients such as google appengine's async url fetch service.

Integration with non-java clients

Using our BlobRequestSigner, you can portably generate HTTP requests that can be passed to external systems for execution or processing. Use cases include javascript side-loading and curl-based processing on the bash prompt. Be creative!

Transient Provider

Our in-memory BlobStore allows you to test your storage code without credentials or a credit card!

Filesystem Provider

Our file system BlobStore allows you to use the same API when persisting to disk, memory, or a remote BlobStore like Amazon S3.

Supported Providers

See Blobstore API: Supported Providers for providers that can be used equally in any Blobstore API tool.

Concepts

The BlobStore API requires knowledge of 3 concepts: service, container, and blob.
A BlobStore is a key-value store such as Amazon S3, where your account exists, and where you can create containers.
A container is a namespace for your data, and you can have many of them.
Inside your container, you store data as a Blob referenced by a name. In all BlobStores the combination of your account, container, and blob relates directly to an HTTPs url.

Here are some key points about blobstores:

  • Globally addressable
  • Key, value with metadata
  • Accessed via HTTP
  • Containers are provisioned on demand through API calls
  • Unlimited scaling
  • Most are billed on a usage basis

Container

A container is a namespace for your objects. Depending on the service, the scope can be global, account, or sub-account scoped.
For example, in Amazon S3, containers are called buckets, and they must be uniquely named such that no-one else in the world conflicts. In other blobstores, the naming convention of the container is less strict.
All blobstores allow you to list your containers and also the contents within them. These contents can either be blobs, folders, or virtual paths.

Everything in a BlobStore is stored in a container. A container is like a website. So, if my container name is adrian, it will be created in an http accessible way.
For example, if I'm using Amazon S3, a container looks like this: http://adrian.s3.amazonaws.com.
If I store my photo with a key "mymug.jpg," you can guess it will end up here: http://adrian.s3.amazonaws.com/mymug.jpg

Blob

A blob is unstructured data that is stored in a container.
Some blobstores call them objects, blobs, or files. You lookup blobs in a container by a text key, which often relates directly to the HTTP url used to manipulate it. Blobs can be zero length or larger, some restricting size to 5GB, and others not restricting at all.

Finally, blobs can have metadata in the form of text key-value pairs you can store alongside the data.
When a blob is in a folder, its name is relative to that folder. Otherwise, it is its full path.

Folder

A folder is a subcontainer. It can contain blobs or other folders. The names of items in a folder are basenames. Blob names incorporate folders via "/" - just like you would with a "regular" file system.

Virtual Path

A virtual path can either be a marker file or a prefix. In either case, they are purely used to give the appearance of a hierarchical structure in a flat BlobStore.
When you perform a list at a virtual path, the blob names returned are absolute paths.

Access Control

By default, every item you put into a container is private, if you are interested in giving access to others, you will have to explicitly configure that.

Currently, means to expose your containers to the public are provider-specific.

limitations

Each blobstore has its own limitations.

  • S3 According to Amazon, it is better to create or delete buckets in a separate initialization or setup routine that you run less often. You are also only allowed 100 buckets per account, so be parsimonious (frugal).
  • Azure You have to wait 30 seconds before recreating a container with the same name.
  • Azure currently supports max 64MB files, google storage has a very large limit,
  • Amazon S3 introduced a multipart upload possibility which allow files until 5TB size.
  • S3 and Rackspace CloudFiles have a 5GB limit. We've engineered jclouds to not put further limits on blob size.

Using BlobStore API

Connecting to a BlobStore

A connection to a BlobStore like S3 in jclouds is called a BlobStoreContext. It should be reused for multiple requests and is thread-safe. An BlobStoreContext associates an identity on a provider to a set of network connections. At a minimum, you need to specify your identity (in the case of S3, AWS Access Key ID) and a credential (in S3, your Secret Access Key).
Once you have your credentials, connecting to your BlobStore service is easy:

BlobStoreContext context = new BlobStoreContextFactory().createContext("aws-s3", identity, credential);

This will give you a connection to the BlobStore, and if it is remote, it will be SSL unless unsupported by the provider. Everything you access from this context will use the same credentials and potentially the same objects.

Disconnecting

When you are finished with a context, you should close it using close method:

context.close();

There are many options available for creating contexts. Please see the Javadoc for ContextBuilder for detailed description.

APIs

You can choose from four APIs in increasing complexity: Map, BlobMap, BlobStore, and AsyncBlobStore.
For simple applications, you may find the most basic Map<String,InputStream> interface most appropriate.
As complexity increases, you are also able to use the AsyncBlobStore interface: FutureCommand. Let's review the Map APIs first.

InputStreamMap

If you don't want to be bothered with the details of a BlobStore like Amazon S3, you may consider just accessing containers as a plain Map<String, InputStream> object. Just create your context to to the BlobStore, choose the container of the stuff you want to manage, and get to work:

BlobStoreContext context = new BlobStoreContextFactory().createContext("aws-s3", identity, credential);
Map<String, InputStream> map = context.createInputStreamMap("adrian.photos");
// do work
context.close();
Tips
  • Always close your InputStreams When you do something like this, the InputStream returned may be holding a connection to the provider.
    Be sure to close your InputStream promptly.
InputStream aGreatMovie = map.get("theshining.mpg");
try {
      //watch
} finally {
if (aGreatMovie != null) aGreatMovie.close();
}
  • Extra put methods While you can feel free to use map.put("stuff", new FileInputStream("stuff.txt"), jclouds does provide some extra goodies.
    To use these, use the InputStreamMap class as opposed to Map<String,InputStream> when creating you Map view.
InputStreamMap map = context.createInputStreamMap("adrian.photos");
map.putFile("stuff", new File("stuff.txt"));
map.putBytes("secrets", Util.encrypt("secrets.txt"));
map.putString("index.html", "<html><body>hello world</body></html>");

There are also corresponding putAllFiles, Bytes, Strings methods if you have bulk stuff to store.

BlobMap

There are some limitations when using the Map<String, InputStream> API. For starters, you cannot pass any extra data to the provider. For example, if you want to pass a default filename via the Content-Disposition group, it cannot be done this way. BlobMap allows you do customize the data you are sending at the cost of coding to a jclouds API. Considering it is only one class at this point, this is a decent tradeoff for many.

Here is an example that shows how to use the BlobMap API:

BlobStoreContext context = new BlobStoreContextFactory().createContext("aws-s3", identity, credential);
BlobMap map = context.createBlobMap("adrian.photos");

Blob blob = map.blobBuilder("sushi.jpg")                                                                                                                                                  
               .payload(new File("sushi.jpg"))// or byte[]. InputStream, etc.                                                                                                             
               .contentDisposition("attachment; filename=sushi.jpg")                                                                                                                      
               .contentType("image/jpeg")                                                                                                                                                 
               .calculateMD5().build();

map.put(blob.getName(), blob);

context.close();

BlobStore (Synchronous)

Here is an example of the BlobStore interface.

// init
context = new BlobStoreContextFactory().createContext(
                                                "aws-s3",
                                                accesskeyid,
                                                secretaccesskey);
blobStore = context.getBlobStore();

// create container
blobStore.createContainerInLocation(null, "mycontainer");
  
// add blob
blob = blobStore.blobBuilder("test")  // you can use folders via newBlob(folderName + "/sushi.jpg")                                                                                     
                  .payload("testdata").build();
blobStore.putBlob(containerName, blob);
Creating a Container

If you don't already have a container, you will need to create one. First, get a BlobStore from your context:

BlobStore blobstore = context.getBlobStore();

Location is a region, provider or another scope in which a container can be created to ensure data locality. If you don't have a location concern, pass null to accept the default.

boolean created = blobStore.createContainerInLocation(null, String container);
if (created){
    // the container didn't exist, but does now
}else{
    // the container already existed
}
    

AsyncBlobStore

AsyncBlobStore is the third and most powerful way to interact with a BlobStore. The API are asynchronous even if the engine you choose is not asynchronous. AsyncBlobStore has the same methods as the BlobStore, but all commands return java.util.concurrent.Future results.
Using AsyncBlobStore you can perform a lot of commands simultaneously, such as aggregating hundreds of blobs for processing.
You can also attach listeners to the result, so that you can do things like publish to a message queue when an operation completes.

Here's an example of uploading tons of blobs at the same time:

import static org.jclouds.concurrent.FutureIterables.awaitCompletion;

Map<Blob, Future<?>> responses = Maps.newHashMap();
for (Blob blob : blobs) {
   responses.put(blob, context.getAsyncBlobStore().putBlob(containerName, blob));
}
exceptions = awaitCompletion(responses, 
      context.utils().userExecutor(),
      maxTime,
      logger,
      String.format("putting into containerName: %s", containerName));

Multipart upload

Providers may implement multipart upload for large or very files.
Here's an example of multipart upload using aws-s3 provider, which allow uploading files large as 5TB.

import static org.jclouds.blobstore.options.PutOptions.Builder.multipart;                                                                                                                 

  // init                                                                                                                                                                                 
  context = new BlobStoreContextFactory().createContext(                                                                                                                                  
                  "aws-s3",                                                                                                                                                               
                  accesskeyid,                                                                                                                                                            
                  secretaccesskey);                                                                                                                                                       
  AsyncBlobStore blobStore = context.getAsyncBlobStore();                                                                                                                                 
                                                                                                                                                                                          
  // create container                                                                                                                                                                     
  blobStore.createContainerInLocation(null, "mycontainer");                                                                                                                               
                                                                                                                                                                                          
  File input = new File(fileName);                                                                                                                                                        
  // Add a Blob                                                                                                                                                                           
  Blob blob = blobStore.blobBuilder(objectName).payload(input)                                                                                                                            
               .contentType(MediaType.APPLICATION_OCTET_STREAM).contentDisposition(objectName).build();                                                                                   
  // Upload a file                                                                                                                                                                        
  ListenableFuture<String> futureETag = blobStore.putBlob(containerName, blob, multipart());                                                                                              
                                                                                                                                                                                          
  // asynchronously wait for the upload                                                                                                                                                   
  String eTag = futureETag.get();                                                                                                                                                         

Logging

You can now see status of aggregate blobstore commands by enabling at least DEBUG on the log category: "jclouds.blobstore".

Here is example output:

2010-01-31 14:41:14,921 TRACE [jclouds.blobstore] (pool-4-thread-4) deleting from containerName: adriancole-blobstore2, 
completed: 5001/5001, errors: 0, rate: 14ms/op

If you are using the Log4JLoggingModule, here is an example log4j.xml stanza you can use to enable blobstore logging:

   <appender name="BLOBSTOREFILE" class="org.apache.log4j.DailyRollingFileAppender">
        <param name="File" value="logs/jclouds-blobstore.log" />
        <param name="Append" value="true" />
        <param name="DatePattern" value="'.'yyyy-MM-dd" />
        <param name="Threshold" value="TRACE" />
        <layout class="org.apache.log4j.PatternLayout">
            <param name="ConversionPattern" value="%d %-5p [%c] (%t) %m%n" />
        </layout>
    </appender>

    <appender name="ASYNCBLOBSTORE" class="org.apache.log4j.AsyncAppender">
        <appender-ref ref="BLOBSTOREFILE" />
    </appender>

   <category name="jclouds.blobstore">
        <priority value="TRACE" />
        <appender-ref ref="ASYNCBLOBSTORE" />
    </category>

Clojure

The above examples show how to use the BlobStore API in Java. You can also use the API in Clojure.

Setup

  • Install leiningen
  • lein new mygroup/myproject
  • cd myproject
  • vi project.clj
  • for jclouds 1.1 and earlier (clojure 1.2 only)
(defproject mygroup/myproject "1.0.0" 
  :description "FIXME: write"
  :dependencies [[org.clojure/clojure "1.2.0"]
                 [org.clojure/clojure-contrib "1.2.0"]
         [org.jclouds/jclouds-allblobstore "1.1.0"]])
* for jclouds 1.2 / snapshot (clojure 1.2 and 1.3)
(defproject mygroup/myproject "1.0.0" 
  :description "FIXME: write"
  :dependencies [[org.clojure/clojure "1.3.0"]
                 [org.clojure/core.incubator "0.1.0"]
                 [org.clojure/tools.logging "0.2.3"]
                 [org.jclouds/jclouds-allcompute "1.2.0-SNAPSHOT"]]
  :repositories {"jclouds-snapshot" "https://oss.sonatype.org/content/repositories/snapshots"}) 
  • Execute lein deps

Usage

Execute lein repl to get a repl, then paste the following or write your own code.
Clearly, you need to substitute your accounts and keys below.

(use 'org.jclouds.blobstore2)

(def *blobstore* (blobstore "azureblob" account encodedkey))                                                                                                                            
(create-container *blobstore* "mycontainer")                                                                                                                                            
(put-blob *blobstore* "mycontainer" (blob "test" :payload "testdata"))

Advanced Concepts

This section covers advanced topics typically needed by developers of clouds.

Signing requests

java example

HttpRequest request = context.getSigner().
                              signGetBlob("adriansmovies",
                                          "sushi.avi");

clojure example

(let [request (sign-blob-request "adriansmovies"
                                 "sushi.avi" {:method :get})])

Configure multipart upload strategies ==

There are two MultipartUploadStrategy implementations: SequentialMultipartUploadStrategy and ParallelMultipartUploadStrategy.
Default strategy is the ParallelMultipartUploadStrategy. With parallel strategy the number of threads running in parallel can be configured using jclouds.mpu.parallel.degree property, the default value is 4.

Design

Marker Files

Marker Files allow you to establish presence of directories in a flat key-value store.
Azure, S3, and Rackspace all use pseudo-directories, but in a different ways. For example, some tools look for a content type "application/directory", while others look for naming patterns such as a trailing slash or the suffix _$folder$.

In jclouds, we attempt to detect whether a blob is pretending to be a directory, and if so, type it as StorageType.RELATIVE_PATH.
Then, in a list() command, it will appear as a normal directory. The two objects responsible for this are IfDirectoryReturnNameStrategy and MkdirStrategy.

There is a problem with this approach, there are multiple ways to suggest presence of a directory. For example, it is entirely possible that both the trailing slash and _$folder$ suffixes exist. For this reason, a simple remove, or rmDir will not work, as may be the case that there are multiple tokens relating to the same directory.

For this reason, we have a DeleteDirectoryStrategy strategy. The default version of this used for flat trees removes all known deviations of directory markers.

Content Metadata : Content Disposition

You may be using jclouds to upload some photos to the cloud, show thumbnails of them to the user via a website and allow to download the original image. When the user clicks on the thumbnail, a the download dialog appears. To control the name of the file in the "save as" dialog, you must set Content Disposition. Here's how you can do it with BlobStore API:

Blob blob = context.getBlobStore().blobBuilder("sushi.jpg")                                                                                                                               
               .payload(new File("sushi.jpg"))// or byte[]. InputStream, etc.                                                                                                             
               .contentDisposition("attachment; filename=sushi.jpg")                                                                                                                      
               .contentType("image/jpeg")                                                                                                                                                 
               .calculateMD5().build();

Large Lists

A listing is a set of metadata about items in a container. It is normally associated with a single GET request against your container.

Large lists are those who exceed the default or maximum list size of the blob store. In S3, Azure, and Rackspace, this is 1000, 5000, and 10000 respectively. Upon hitting this threshold, you need to continue the list in another HTTP request.

In the new BlobStore API, list responses return a PageSet object.

A PageSet object is the same as a normal Set, except that it has a new method

String getNextMarker();

If this returns null, you have the entire listing. If not, you can choose to continue iterating the list by specifying the ListContainerOption afterMarker to this value.

Our Map API knows how to concatenate lists via the ListMetadataStrategy object.

Return Null on Not Found

All APIs, provider-specific or abstraction, must return null when an object is requested, but not found.
Throwing exceptions is only appropriate when there is a state problem, for example requesting an object from a container that does not exist is a state problem, and should throw an exception.

Uploading Large Files

As long as you use either InputStream or File as the payload your blob, you should be fine.
Note that in S3, you must calculate the length ahead of time, since it doesn't support chunked encoding. Our integration tests ensure that we don't rebuffer in memory on upload: testUploadBigFile.

This is verified against all of our http clients, conceding that it isn't going to help limited environments such as google app engine.

Downloading Large Files

A blob you've downloaded via blobstore.getBlob() can be accessed via blob.getPayload().getInput() or blob.getPayload().writeTo(outputStream). Since these are streaming, you shouldn't have a problem with memory unless you rebuffer it.