I just began to create a file system abstraction layer for jCoreDB. This is the lowest layer of an Database System. It is used to interact with the Operating Systems File system. Often DBMs also have raw file system access implemented. This means that not the OS file system is used, but the bytes are directly written to a not yet formatted partition. Other abstractions are possible as well. So such an abstracted FS could even use a web service based cloud store in the background. However, jCoreDB will currently abstract only the tradional file system which is provided by the underlying OS.
Here some terminilogy:
What has a file system to abstract:
I will soon provide first lines of source code regarding the File System implementation. The following thoughts should be taken into account:
Distribution is a kind of related to concurrency, but we will more focus on thougts regarding data distribution. A file system contains multiple containers. A container may be bound to a path. If you put container #I to a path which belongs to disk #1 and container #II to disk #2 then you also achived a simple distribution of data. So the idea is to use a container as a partition. Currently we are not interested in why data is stored inside the container, this will be covered layers above. Another distribution approach could be a more service based one. Today everything is available in the cloud ;-) , so it would be also possible (but not with the same performance) to build a web service on top of of file system. Then it could give a file system load balancer and registry. If a file system starts up, then it registers with the registry and so it is taken into account by the load balancer. Load balancer rules are used to determine which block should be written to which file system. In a distrubuted mode file system #1 has only the containers #1, #2 whereby file system #2 has the container #3, #4 ... and so on. So each file system has two partitions. In a fail over mode each write request will be forwarded to every registered file system. The read requests could be scheduled by using the real load information or at first by using just Round Robin.
I will soon publish some first (not yet tested) source code. After this code is tested and evaluated, the next layer will be the Page Buffer one. So I am looking really forward to write the next blog post about page buffering and scheduling. Another important topic will be indexing and storage structures.
Here some terminilogy:
- A container contains one or more segments
- A segment is a file with a preallocated number of blocks. We differ between data segments and header segments. A header segment belongs to a data segment and contains the information about the size of the segment, block size inside the segment and a free memory bitmap.
- A block has a fixed number of bytes
- A block id is a tuple which contains the Container id, the Segment id and the position within a segment
What has a file system to abstract:
- Create, open and delete containers
- Write a specific block
- Read a specific block
- Append a specific block
- Delete a specific block
I will soon provide first lines of source code regarding the File System implementation. The following thoughts should be taken into account:
- File System Concurrency
- Distributed File Systems
Distribution is a kind of related to concurrency, but we will more focus on thougts regarding data distribution. A file system contains multiple containers. A container may be bound to a path. If you put container #I to a path which belongs to disk #1 and container #II to disk #2 then you also achived a simple distribution of data. So the idea is to use a container as a partition. Currently we are not interested in why data is stored inside the container, this will be covered layers above. Another distribution approach could be a more service based one. Today everything is available in the cloud ;-) , so it would be also possible (but not with the same performance) to build a web service on top of of file system. Then it could give a file system load balancer and registry. If a file system starts up, then it registers with the registry and so it is taken into account by the load balancer. Load balancer rules are used to determine which block should be written to which file system. In a distrubuted mode file system #1 has only the containers #1, #2 whereby file system #2 has the container #3, #4 ... and so on. So each file system has two partitions. In a fail over mode each write request will be forwarded to every registered file system. The read requests could be scheduled by using the real load information or at first by using just Round Robin.
I will soon publish some first (not yet tested) source code. After this code is tested and evaluated, the next layer will be the Page Buffer one. So I am looking really forward to write the next blog post about page buffering and scheduling. Another important topic will be indexing and storage structures.
You're just about reinventing the old VMS filesystem/storage API (which still can be found in Mainframe world), a predecessor of Unix/Posix fileystem API, which then was trimmed-down to Plan9 FS/IPC API.
ReplyDeleteWhy not just taking the old VMS API, along with it's segmented memory model (maybe later drop in HSM, etc) or take Plan9's distributed/grid model ?
IIRC, you just want (named) containers, that hold segments consisting of (arbitrary number of) equally-sized blocks, with block-based IO.
This is exactly the traditional VMS filesystem concept.
I'd suggest just writing a fast direct wrapper to the native posix filesystem via JNI, and then let the OS/kernel handle everything else (including buffer cache, etc).
Hi Mr. Weigelt. :-) Thank you for joining the discussion. Yes, the API is the one of a File System. I would say even the Windows FS API looks a bit similar. However, this API is just the Java pendant to access the Operating Systems one. So you are absolutely right. We will see in chapter 'Page Buffer Layer' that a OS File System Buffer is not suitable for Database Systems. We will especially miss the 'Pin' and 'Unpin' function to support specific access paterns. Let's also add a forum or mailing list to the project to do some deeper discussions.
ReplyDeleteHowever, we should keep the 'Wrap VMS' by using JNI in mind. It could be an alternative IFilesSystem implementation. It would be very interesting to benchmark the both implementation. Simple file based access vs. VMS wrapper. Sounds cool!
ReplyDeleteI added some source code: http://sourceforge.net/p/jcoredb/code/ref/master~/
ReplyDelete