Tuesday, September 25, 2012

The page buffer

In this post I would like to talk a little bit about page buffering. We remember that the previous posts had subjects like File System Abstration where the focus was on reading and writing blocks from or to the disk. So what is a page? A page is just a block of bytes, so it is quite similar to the block which was explained previously. But it's not only a new name for the same thing, but it differs regarding it's use case. A block is something which is addressed physically on disc and which accessed in an abstract way by the DBMS, whereby a page is a block of bytes which is kept inside the main memory. So a page is managed within the page buffer. For this purpose it has some additional information attached

So the purpose of the buffer is to keep pages available within the main memory. The main memory access is a lot faster than the hard disc access. A buffer can be cold or hot. A hot buffer contains almost every typically used page. So the target is to reach a cache hit ratio which is more than 90%. The cache hit ratio is the ratio between cache hits and cache misses. A cache hit means that a required page was available in the buffer. The opposite is a cache miss, which means that the page was gathered from disc.

You may ask why we explicitely implement a buffer by not using the operating system one. The answer is that the characteristics of an OS buffer are not everytime sufficient. Not every OS buffer knows e.g. of the pin-operation. This operation allows to tell the buffer that a specific page should not moved out of the buffer. Let's assume a nested loop join. Which means two nested loops those are accessing pages. A column-value of a row of a page within the buffer should be compared with other values of other pages. It's easy to see that we need to keep the outer page inside the buffer whereby comparing with the other ones.

A buffer has a maximum size. So it can only keep a specified number of pages. If the buffer is full, which means that every buffer frame is used, it then is required to replace an existing page with the requested one. So strongly related to the buffer is the scheduler. A scheduler decides which page should become replaced. Schedulers can be very simple or more complex. Here some possible scheduling algorithms:
  • F(irst) (I)n F(irst) O(ut): Replaces just the oldest page
  • L(east) F(requently) U(sed): Replaces the page which was less often referenced 
  • D(ynamic) G(erneralized) CLOCK: Goes clockwise through the frames by counting the local references and by having also one global counter. If the global counter reaches a limit, then a normalization happens.
jCoreDB does at first implemented LFU. Afterwards we will implement DGCLOCK. LFU has the disadvantage that it only takes the number of references into account but not the age of the page within the buffer. So the most often used pages are kept inside the buffer. We can easily see that this is not fair enough for just added pages those have also a higher probability to be used the next time. So we also should take the age into account to get a better scheduler. DGCLOCK does this by combining a reference counter with an age counter.

Besides the scheduling strategy, a buffer also realizes a write strategy. The following strategies are available:
  • Write Through: Just write every change (Update, Insert) from the buffer directly to the underlying file system.
  • Write Behind: Perform the change at first in the buffer. If a page has another state within the buffer than on disk it is marked as 'dirty'. One or multiple writers are used to afterwards write the dirty pages to the underlying file system.
jCoreDB does at first implement Write Through, just because the implementation effort is less than for Write Behind. Later we will implement Write Behind. Write Behind is necessary to realize a multi threaded buffer. It will also become important related to transaction management.

Currently the following classes were implemented in order to realize the Page Buffer requirements:
  • IPageBuffer: Defines a wrapper for the file system functions by providing additional buffering
  • IScheduler: Defines which operations a scheduler should support
  • Page: A page extends a block. A page has a state within the buffer (pinned, dirty,...) and a  type (data, index, overflow)
  • SchedulerFactory: Returns a scheduler implementation based on the configuration
  • LFUScheduler: The scheduler which realizes the LFU algorithm
  • PageIndex: A hash index which maps the block id to the buffer frame index
  • WriteThroughPageBuffer: The buffer which performs a Write Through

Friday, July 20, 2012

RESTFul Web Service Access

It was quite clear to me that jCoreDB should provide Web Services on each layer to have the most of the flexiblity available if you want to use it in an service oriented environment. The first implementation which I did for this purpose was a SOAP one. And I still think that SOAP has some advantages. SOAP is by default well defined. This allows to create a Web Service Client for about every platform with less effort. In fact the Client code is generated automatically from the Web Service's descriptor. There are also disadvantages. One big one is that SOAP is very 'chatty'. In order to be such well defined, it's required to transport a lot of data about the data (so meta data). This can become an issue if you transfer a lot of data or if performance is a requirement. So why seems REST to be cooler? REST is stateless and it is light weighted (only the HTTP methods are used - GET, DELETE, PUT and POST). Not that amazing so far? It means especially, that a RESTFul Web Service provides you HTTP based access to resources. So the client/server communication happens not message based, but resource based. In SOAP the client would send a message like 'Please call the following method on the server side and send me it's result'. In the REST world you just directly access resource which is then dynamically produced on server side. Our RESTFul Web Services will use JSON as the resource format. This has 2 reasons. The first one is that JSON is really as powefull as XML, but less 'chatty'. The other second one is that JSON can be interpreted out of the box by ever Java Script application. So a Java Script based Web Application (and there are more and more ...) can use the data just out of the box.
So what would a RESTFul Service require:
  • It needs to talk HTTP, which means it should at least be able to handle GET requests. 
  • Dynamic resources are required. They are generated dependent on the request parameters those the client passes to the server. The resources must be mapped to URL-s.
  • A basic authentication should work. So it's only possible to access a specifc resource if a previous authentication was successful.
Maybe the following implementation is a bit strange, because I know that there are RESTFul frameworks based on JAX-RS in the Java world. Seems to be no reason to do it by my own ;-).
So a very simple HTTP server was developed. One advantage of it is that you do not need a Servlet Container in order to host the service (Yes, we could have used the Endpoint classes of Java ...). It's now possible to provide several services on several ports, even if you deploy the jCoreDB core to a servlet container like Tomcat. So the services are Servlet Container independent. Back to the HTTP Server: My HTTPServer class implements the IServer interface. This means that it has a 'start', 'stop' and 'status' method. The primarily method is named 'doGet'. It has 3 parameters 'getHeader', 'authHeader' and 'out'. The first one is used to pass the GET request information. The second one contains the basic HTTP authentication information. 'out' is a printstream which is used to return the result to the client. The next part of this lightweighted REST framework is an abstract class (implements IResource) with the name AResource. Such a resource gets constrcuted by passing a Hashmap of parameters and the target output stream. AResource has one abstract method which is named 'create'. This method is used to create the dynamic resource on demand. Such a resource needs to be annotated in order to map it to an URL. Therefore the annotation interface WebResourceURL was created. Then Java's reflection mechanisms are used to find the resource which is mapped to a specific URL. Here the source code to create the resource as part of the HTTP server's doGET method:

Class c = HTTPResourceFinder.getClassByURL(urlprefix);

if (c != null) { 

    Constructor ctx = c.getConstructor(new Class[] {
       new HashMap<string, string="">().getClass() });

   Object[] args = new Object[] { out, getParams(url) };

   IHTTPResource resource = (IHTTPResource) ctx.newInstance(args);


The annotation interface looks quite simple:

public @interface WebResourceURL  {

    String value();

To get the list of available containers a ContainerListResouce was created. Here its code:

@WebResourceURL(value = "/listcontainers")
public class ContainerListResource extends AHTTPResource { 

  public ContainerListResource(PrintStream out, HashMap<String, String> params) {
        super(out, params);

    public void create() {
        IFileSystem fs = FileSystemRegistry.get(Configs.getFSConfig().getRootDir());
        IContainer[] cs = fs.getContainers();
        out.println("\"containers\":" + "[");
        for (int i = 0; i < cs.length; i++)
             IContainer c = cs[i];

             if (i < cs.length-1) out.println("{\"id\":" + c.getId() + "}" + ",");
             else out.println("{\"id\":" + c.getId() + "}");               


Another resource is the one to get a specific container details:

public class ContainerResource extends AHTTPResource{

     * The container's id
    int id;
     * The resource constructor
     * @param out
    public ContainerResource(PrintStream out, HashMap<String,String> params) {
        super(out, params); = Integer.parseInt(params.get("id"));
    public void create() {
        IFileSystem fs = FileSystemRegistry.get(Configs.getFSConfig().getRootDir());
        IContainer c = fs.getContainer(id);
        if (c!=null)
        out.println("\"id\":" + c.getId() + ",");
        out.println("\"parentpath\":" + "\"" + c.getParentPath() + "\"" + ",");
        out.println("\"numofsegs\":" + c.getNumOfSegs() + ",");
        out.println("\"segments\":" + "[");
        for (int i = 0; i < c.getSegments().length; i++)
             ISegment s = c.getSegments()[i];

             if (i < c.getSegments().length-1) out.println("{\"id\":" + s.getId() + "}" + ",");
             else out.println("{\"id\":" + s.getId() + "}");               


 I will soon publish the source code of the above mentioned components. So maybe you like this lightweighted RESTFul Web Service framework ( only a few classes: HTTPServer, HTTPConstants, HTTPResourceFinder, IResource, AResource, WebResourceURL) which exists now in the context of jCoreDB.

Thursday, June 21, 2012

Parallel worlds

Since the single threaded file system implementation is done, we may investigate another implementation which can write in parallel.

The first question regarding this subject is 'Is this real useful?' and the answer is 'It depends.'. The following article shows why this is the only possible answer.

At first let's talk about resources: The physical resource is given by for instance a single hard disc. We have multiple physical resources if we have an array of hard disks. So our expectations are: Parallel writing to one single disc does not bring any benefit. Instead we assume an overhead which is caused by the fact that the threads have to be managed.

Additionally we have to face logical resources. A logical resource is a file which is provided by the disc's file system. Even if we have an array of physical disks available, we can just write to one file by using exactly one thread.

Let's spot some light on my current idea to implement such a parallel writing.

* A Segment is realized as one file. So one segment is exactly one logical resource.
* A parallel writer should be able to write multiple segments in parallel.

So I extended a File System to write and append an array of blocks. To avoid resource blocking, I am using a 'Slicing' mechanism.

At first we index the Blocks by their Segment id. Such an SegmentIndex is a Hash Index which has the Segment Id as the key. The value for each key is a list of Blocks. It looks as the following:

- (0,1) -> [Block_011,...,Block_01n]
- (0,2) -> [Block_021,...,Block_02k]
- ...
- (c,s) -> [Block_cs1,...,Block_csb]
- ...

Whereby 'c' is the container id, 's' is the segment id within the container and 'b' is the block id within the segment.

Now we can just slice the index. The first slice is E.G.:

[Block_011, Block_021, ..., Block_cs1, ...]

Each entry of the slice points to another segment. What we now need is a pool of Writer Threads whereby one thread writes exactly to one of  the entries of the slice. It's surely possible that there are less or more entries than Writers are available. If there are more, then the thread pool queues it.

I already added a simple benchmark test, which shows that now all 4 CPU-cores of my test box are busy. Before it was just one CPU.  But because my development box has just one single hard disc, I can't evaluate the benefits yet. Some meaningful benchmarks will follow.  Quite interesting may it to test with an SSD because there is no seek time.  However, in summary we can say that writing to a disk is indeed absolutely IO bound and so less CPU bound. This raises again the question 'Is this useful?'.Further evaluation will show it.

Feedback? Just post a comment to this article.

Saturday, May 12, 2012

File system abstraction

I just began to create a file system abstraction layer for jCoreDB. This is the lowest layer of an Database System. It is used to interact with the Operating Systems File system. Often DBMs also have raw file system access implemented. This means that not the OS file system is used, but the bytes are directly written to a not yet formatted partition. Other abstractions are possible as well. So such an abstracted FS could even use a web service based cloud store in the background. However, jCoreDB will currently abstract only the tradional file system which is provided by the underlying OS.
Here some terminilogy:
  • A container contains one or more segments
  • A segment is a file with a preallocated number of blocks. We differ between data segments and header segments. A header segment belongs to a data segment and contains the information about the size of the segment, block size inside the segment and a free memory bitmap.
  • A block has a fixed number of bytes
  • A block id is a tuple which contains the Container id, the Segment id and the position within a segment

What has a file system to abstract:
  • Create, open and delete containers
  • Write a specific block
  • Read a specific block
  • Append a specific block
  • Delete a specific block
Behind the scenes it has take care about which blocks are available by doing  a basic free memory management. Segments are provided as necessary. So if a block is free inside a segment which belongs to the container, then the block could be written  or overridden. It should be also possible to just append data by avoiding a free memory management overhead during bulk imports.

 I will soon provide first lines of source code regarding the File System implementation. The following thoughts should be taken into account:
  • File System Concurrency
  • Distributed File Systems
File system concurrency is interesting because a file is a kind of atomic resource. So there can't be multiple threads those are writing to one single file the same time. This means that real multi threading is not that possible. An emulated multi threading would be possible, but I would not expect any benifit from it. So in summary let's stay with the statement: "A file is an atomic resource." The idea is then to realize a segment as one file. So it is in theory possible to write multiple segements in parallel. Assuming that we have one single hard disc, this would also have even a negative effect because the hard disk's seek time. If we assume a raid of hard disks, whereby some blocks of the segments are stored on hard disk #1 and others on #2 we could at least expect a minimal benefit. What may bring the most is to have one thread by container. The advantage is only existent when inserting to multiple containers the same time. All these points are arguments why we begin to implement our file system abstraction layer in a single threaded mode. Later we will add a multi threaded file system abstraction layer.

Distribution is a kind of related to concurrency, but we will more focus on thougts regarding data distribution. A file system contains multiple containers. A container may be bound to a path. If you put container #I to a path which belongs to disk #1 and container #II to disk #2 then you also achived a simple distribution of data. So the idea is to use a container as a partition. Currently we are not interested in why data is stored inside the container, this will be covered layers above. Another distribution approach could be a more service based one. Today everything is available in the cloud ;-) , so it would be also possible (but not with the same performance) to build a web service on top of of file system. Then it could give a file system load balancer and registry. If a file system starts up, then it registers with the registry and so it is taken into account by the load balancer. Load balancer rules are used to determine which block should be written to which file system. In a distrubuted mode file system #1 has only the containers #1, #2 whereby file system #2 has the container #3, #4 ... and so on. So each file system has two partitions. In a fail over mode each write request will be forwarded to every registered file system. The read requests could be scheduled by using the real load information or at first by using just Round Robin.

I will soon publish some first (not yet tested) source code. After this code is tested and evaluated, the next layer will be the Page Buffer one. So I am looking really forward to write the next blog post about page buffering and scheduling. Another important topic will be indexing and storage structures.


Welcome to jCoreDBs blog. This blog is more a notebook for ideas. It is then on your side to post comments regarding them.