Saturday, May 12, 2012

File system abstraction

I just began to create a file system abstraction layer for jCoreDB. This is the lowest layer of an Database System. It is used to interact with the Operating Systems File system. Often DBMs also have raw file system access implemented. This means that not the OS file system is used, but the bytes are directly written to a not yet formatted partition. Other abstractions are possible as well. So such an abstracted FS could even use a web service based cloud store in the background. However, jCoreDB will currently abstract only the tradional file system which is provided by the underlying OS.
Here some terminilogy:
  • A container contains one or more segments
  • A segment is a file with a preallocated number of blocks. We differ between data segments and header segments. A header segment belongs to a data segment and contains the information about the size of the segment, block size inside the segment and a free memory bitmap.
  • A block has a fixed number of bytes
  • A block id is a tuple which contains the Container id, the Segment id and the position within a segment

What has a file system to abstract:
  • Create, open and delete containers
  • Write a specific block
  • Read a specific block
  • Append a specific block
  • Delete a specific block
Behind the scenes it has take care about which blocks are available by doing  a basic free memory management. Segments are provided as necessary. So if a block is free inside a segment which belongs to the container, then the block could be written  or overridden. It should be also possible to just append data by avoiding a free memory management overhead during bulk imports.

 I will soon provide first lines of source code regarding the File System implementation. The following thoughts should be taken into account:
  • File System Concurrency
  • Distributed File Systems
File system concurrency is interesting because a file is a kind of atomic resource. So there can't be multiple threads those are writing to one single file the same time. This means that real multi threading is not that possible. An emulated multi threading would be possible, but I would not expect any benifit from it. So in summary let's stay with the statement: "A file is an atomic resource." The idea is then to realize a segment as one file. So it is in theory possible to write multiple segements in parallel. Assuming that we have one single hard disc, this would also have even a negative effect because the hard disk's seek time. If we assume a raid of hard disks, whereby some blocks of the segments are stored on hard disk #1 and others on #2 we could at least expect a minimal benefit. What may bring the most is to have one thread by container. The advantage is only existent when inserting to multiple containers the same time. All these points are arguments why we begin to implement our file system abstraction layer in a single threaded mode. Later we will add a multi threaded file system abstraction layer.

Distribution is a kind of related to concurrency, but we will more focus on thougts regarding data distribution. A file system contains multiple containers. A container may be bound to a path. If you put container #I to a path which belongs to disk #1 and container #II to disk #2 then you also achived a simple distribution of data. So the idea is to use a container as a partition. Currently we are not interested in why data is stored inside the container, this will be covered layers above. Another distribution approach could be a more service based one. Today everything is available in the cloud ;-) , so it would be also possible (but not with the same performance) to build a web service on top of of file system. Then it could give a file system load balancer and registry. If a file system starts up, then it registers with the registry and so it is taken into account by the load balancer. Load balancer rules are used to determine which block should be written to which file system. In a distrubuted mode file system #1 has only the containers #1, #2 whereby file system #2 has the container #3, #4 ... and so on. So each file system has two partitions. In a fail over mode each write request will be forwarded to every registered file system. The read requests could be scheduled by using the real load information or at first by using just Round Robin.

I will soon publish some first (not yet tested) source code. After this code is tested and evaluated, the next layer will be the Page Buffer one. So I am looking really forward to write the next blog post about page buffering and scheduling. Another important topic will be indexing and storage structures.


  1. You're just about reinventing the old VMS filesystem/storage API (which still can be found in Mainframe world), a predecessor of Unix/Posix fileystem API, which then was trimmed-down to Plan9 FS/IPC API.

    Why not just taking the old VMS API, along with it's segmented memory model (maybe later drop in HSM, etc) or take Plan9's distributed/grid model ?

    IIRC, you just want (named) containers, that hold segments consisting of (arbitrary number of) equally-sized blocks, with block-based IO.

    This is exactly the traditional VMS filesystem concept.

    I'd suggest just writing a fast direct wrapper to the native posix filesystem via JNI, and then let the OS/kernel handle everything else (including buffer cache, etc).

  2. Hi Mr. Weigelt. :-) Thank you for joining the discussion. Yes, the API is the one of a File System. I would say even the Windows FS API looks a bit similar. However, this API is just the Java pendant to access the Operating Systems one. So you are absolutely right. We will see in chapter 'Page Buffer Layer' that a OS File System Buffer is not suitable for Database Systems. We will especially miss the 'Pin' and 'Unpin' function to support specific access paterns. Let's also add a forum or mailing list to the project to do some deeper discussions.

  3. However, we should keep the 'Wrap VMS' by using JNI in mind. It could be an alternative IFilesSystem implementation. It would be very interesting to benchmark the both implementation. Simple file based access vs. VMS wrapper. Sounds cool!

  4. I added some source code: