Filesystems Explained: Part 3 - RAID

Published on May 31, 2009 by Godji

Previous: Part 2 - Consistency and Journaling

A typical filesystem exists on a single block device. The filesystem is tied to its block device (usually a hard disk) in several ways:

1. The filesystem contains as much data as the size of the block device.
2. The data can only be accessed at the speed that the block device can sustain.
3. If the block device fails, the filesystem and all data on it are lost.

If the third one does not scare you, it should, because a hard disk is a mechanical device that can fail at any time for any reason.

There are two ways to remove the three limitations - decouple the concept of a block device from a physical hard disk, or to redefine the filesystem to live on multiple block devices. RAID takes the former approach, while ZFS takes the latter.

RAID uses a set of block devices (hard disks) to implement a virtual block device (a RAID array), on which a filesystem can be placed. Depending on the type of RAID, the array can be larger, faster, and more reliable than a hard disk, or any combination of the three.

As a simple example, a RAID 1 is an array in which all blocks are replicated on N hard disks. (This is called data redundancy.) Assuming the disks are identical, the array has the same size and write speed as each drive, but it can read N times faster, and can sustain N - 1 disk failures. If any drive(s) in the array die, your data will be safe as long as there is at least one operational disk. You could then replace the failed disks with new empty disks, and the array would rebuild itself, in this case by copying the data from any healthy drive to the new one(s). Once the array is rebuilt, even if that healthy drive fails too, the data will be safe on the new disks. The cost for data redundancy is capacity - in this case N disks can store only one disk’s worth of data.

RAID is not a specific technology or implementation; rather it is a name given to any technology that combines several block devices into one to provide data redundancy or speed.

RAID Types

Different types of RAID include:
1. RAID 0 a.k.a. “scary RAID” needs at least 2 disks. Data is striped across N disks. Provides incredible speed benefits and no capacity cost, but if any single drive fails, all data is immediately gone, hence the nickname “scary RAID”.
2. RAID 1 needs at least 2 disks. Data is copied on all N disks. Provides maximum reliability (N - 1 drives can fail), but at a large capacity cost.
3. RAID 5 needs at least 3 disks. Data is striped across N - 1 disks, and one disk is used for parity. If the parity drive fails, no data is lost; if another drive fails, the parity drive can be used to calculate what the data on the failing drive must have been. But if a second drive fails before the array has been rebuilt, all data is gone.
4. RAID 6 needs at least 4 disks. It is the same as RAID 5 but with two parity drives. It can tolerate the loss of two disks.

RAID can be provided either by a specialized hardware chip (hardware RAID), by the motherboard BIOS (hybrid RAID, or by the operating system (software RAID). A full discussion of their differences and which option is best in what cases would require its own article.

RAID and Data Safety

The key point of RAID is that although all hard disks eventually fail, the probability that several drives will fail at the same time is low. A RAID 5 array is normally enough to ensure the life of important data when combined with regular backups. This brings me to a very important point I could not overemphasize:

RAID does not replace backups.

When I first saw this statement, I questioned it. If RAID guarantees, with a very high probability, that your data will survive dying disks, then why should you worry about backing up?

Here is a list of things RAID will not save your data from:
1. Death of the entire array. This could happen, for instance, due to an electrical surge that fries several disks or due to an earthquake or a fire.
2. Death of the RAID controller, if there is one. This may or may not be fatal depending on the details of your RAID implementation, but it is worth contemplating.
3. Software bugs, a security breach, or human error. If you (or a hacker) ask the filesystem on a RAID to delete your files, it will. If the filesystem is destroyed by a software bug or for any other reason, the RAID will very reliably store a destroyed filesystem.
4. An out-of-sync RAID array.

Normally, a RAID array is in sync, or consistent*. In that state, the information stored on all disks constitutes a block device and all parity calculations are correct. Unfortunately sometimes one or more disks will contain wrong data and the array will be out of sync or inconsistent. Just like with file systems, this could be caused by power failures during writes, faulty hardware, software bugs, etc.

Consider a RAID 5 array of 3 disks (i.e. two disks for data, one for parity). Disk 1 and 2 contain bits, and these bits together form a block device with some data on it. But what if disk 3’s parity calculation is wrong? Is the parity information wrong, or does one of the first two disks contain the wrong data? There is no way to know, because both options are equally likely.

RAID implementations differ in the way they deal with the problem. RAID inconsistency is typically detected when one tries to read data, at which point it is usually too late to save the data. Most arrays run a periodic scrub, which reads all blocks on all disks and ensures that all parity calculations are correct. For blocks that are not, the array must decide which disk had the wrong data, which usually means choosing one disk at random. The blocks on that disk are then rewritten in such a way as to make the parity correct, although the data may be corrupted. While scrubbing is not a general solution, it can be used to ensure that there are no problems while the data can still be restored safely (i.e. before it is truly needed).

If your RAID implementation is particularly good, it supports a write intent bitmap, which is the RAID equivalent of a filesystem journal. This helps against inconsistencies caused by power loss during a write - by far the most common cause of out-of-sync RAID arrays.

If your RAID implementation is particularly bad, it will just read the data disks and not even check parity, until one of the data disks fails entirely. You will not even know there is a problem, and your data will be silently corrupted.

Conclusion

I have more to say about RAID - the differences between hardware and software RAID, and why it is critical that your RAID implementation is not proprietary - but that will wait for another day.

Despite the additional complexity it entails, RAID is invaluable. It can save you from sudden and catastrophic data loss due to disk failure, and that alone is worth the trouble. Just do not under any circumstances forget to back up your data.

Yet even if both RAID and filesystems were perfect (which they are not) and never had any inconsistencies or bugs, there is one more danger out there for your data - bit rot. In the final part of this article, I will tell you what that is and how to make it a non-issue with ZFS.

Footnotes

* The use of the same word - “consistency” - for RAID arrays and filesystems is intentional. The problem of RAID inconsistency is very similar to filesystem inconsistency.

Next: Part 4 - ZFS (coming soon)

Posted in Uncategorized

Programs and Poems

Published on April 7, 2009 by Godji

A friend of mine once read a poem and remarked: “I can’t believe a programmer wrote this!” At first it may sound strange that the same person can write both poems and computer programs. After all, we all know that programming is based on mathematics, which is cold and precise, and poetry is a form of art, which is warm and fuzzy. Are poetry and programming at the opposite ends of a spectrum?

Quite the contrary! Programming and poetry have remarkably many similarities. I will go as far as to that they are almost the same.

Both are forms of writing that serve a function. A poem is written to convey a message, some emotion, or both, while the purpose of a program is to precisely define an algorithm that a computer can execute. Writing either verse or code requires a comparable creative process. The author has a set of constructs and constraints to work with - whether words, expressions, grammar, rhyme, and metaphors or keywords, expressions, functions, and loop constructs. He or she uses these basic building blocks to assemble larger pieces - stanzas or modules - which then work together to form poems or programs.

Each building block has its own set of requirements. Certain words can only be arranged in a given way to make a meaningful expression. A function has an interface that needs to be followed. In both cases, one searches to find the “best” way, for some definition of “best”, to make all the small pieces work together with the intended result. The outcome of this creative process is a complete work that hopefully serves its function.

Besides function, a poem or program is also characterized by its form. Form is essential: a fact that is obvious for poetry, but less obvious (yet just as true) for code. You can say that a program is beautiful or ugly just as you can say that about a poem. In both cases, defining what “beautiful” means is highly subjective and very difficult. (The only exception is that it is immediately obvious if a poem, or a program, is very badly written.)

Given the same task (”express this particular message/emotion” or “write a program that does this“), two people will invariably come up with unique solutions. This is the case because both types of writing have many possible rules that can be applied in all sorts of combinations. Such freedom allows (and requires) a certain amount of creativity to take advantage of. Mediocre authors will come up with solutions that are “just good enough” - a poem that gets its message across but with no impact to the reader, or a program that does what it must but in a suboptimal or difficult to understand way. Good programmers and poets, on the other hand, pay great attention to form. In extreme cases, form will take over function and a striking poem will express a slightly different meaning from the initial intention of the poet. Similarly, a very elegant algorithm might perform its task in a slightly different way in contrast to a straightforward, but less “beautiful” solution.

This is the reason why both programming and poetry can be so pleasant to write. If you do not enjoy either one, you can never be good at it. No poem or program is ever finished or perfect.

One very easy way to frustrate a programmer is to take away the beauty of programming - by throwing ugly code at them or forcing them to write ugly code with top-down decisions they are required to accept. That is what makes writing software in a team so difficult: what looks like good elegant code to one may poke out the eyes of another. Can you imagine several people writing a poem together and agreeing on the best way to go about it?

Often a good piece of writing, whether code or verse, will have no meaning by itself, because its meaning and exact purpose is dependent on its context, and only in the presence of the larger body of work can the fragment find its unique role. This concept of reusability, highly desirable and often required in programming, is surprisingly common in (great) poetry as well. Reading poetry and programs might require a very deep and thorough thought process to decipher why some piece is written exactly as it is written and why it appears exactly where it appears.

Every poem is written in some human language, and every program is written in some programming language. The former type of languages are vaguely defined*, with complex grammar and large vocabularies. Programming languages also have complex grammar, but on the other hand they have orders of magnitude fewer words. In fact, most programming boils down to defining one’s own vocabulary. Everything about a programming language is very precisely defined.

Still, just like most human languages, programming languages are equivalent in their expressiveness**. Every written work can be translated. More precisely, for any written work you can write, there is another written work in a different language that has the same function as yours. The latter is a translation of the former.

However, translation is a lossy process. Although it preserves function, it cannot preserve form. For poetry this is obvious - a translated poem certainly loses most or all of its beauty. In its original incarnation, it either used some of the peculiarities of its language which do not exist in the other language, or it was bad poem to begin with. If you ever translate a poem, I think you should never call it by the original’s name, because it will have little in common with said original.

But a translated program loses its form, too. I remember learning Python with a heavy Java background - and what was a perfectly good Java design looked like an overdesigned ugly mess in Python (and it was). But then a good Python program directly translated in C++ would be the definition of chaos… and so on ad infinitum.

There are some differences between programs and poems, of course. For one, a typical program is much longer than a typical poem. The reason is that a poem is more dense in the sense that a sentence in a human language carries much more information per character (including emotion) than a statement in a program does. A secondary factor is the fact that programming languages, in being so much more formal, need to define everything very precisely, while human languages will get by with vague double meanings.

The similarities between poetry and programming are not well understood, partly because relatively few people are programmers, and partly because pure poetry is decreasing in popularity***. In addition, most programmers concern themselves entirely with function, and pay only as much attention to form as necessary to make their design workable. This is unfortunate, because the mindset required for writing good software and good poetry is so similar.

If you are a programmer, try writing a poem some time. If you have ever written a poem, check out one of the simpler programming languages and see if you like playing with it. If you have done neither, try either!

Footnotes

* Given a sentence in what looks like English, it is generally impossible to say whether it is in fact in English or not, because “English” has never been precisely defined. Otherwise we would have had very good natural language processing technology by now.

** There is a conjecture that every programming language that provides loops and branching is equivalent to every other programming language which does.

*** Music, which is how we experience most poetry today, is typically judged more by its musical qualities than as a written work.

Posted in Uncategorized

Newsflash: Your readers are not idiots! News at 11

Published on March 3, 2009 by Godji

Some websites, most notably news sites and marketing blogs, seem to think that their readers are morons. They highlight the key points of their text to the point where it becomes ridiculous and it no longer cathes anyone’s attention, because everything else is already highlighted anyway.

I really have to wonder what goes on in the head of the editor in question:
1. My readers could never tell which part of the text matters!
2. I need to point my readers’ attention to this very important point!
3. If my readers have no time to read the while article, I need to show them the high points quickly!

(OK, no more highlighting, you got the idea.)

Of course, all three come down to the same thing: the writer wants to tell us that some portions of the text are more important than others. Here is a list of things such writers (editors?) do not realize:
1. A single sentence or word does not tell me much outside of its context. I have to read the text around it anyway.
2. My attention span is good enough to allow me to notice your very very important point without you poking my eyes out.
3. Bold does not automatically create a stronger emotional impact.
4. There is a better way to guide a reader’s attention than highlighting, and it is covered in the first lesson of every writing class. It has something to do with topic sentences and the order of paragraphs in a written work.
5. When several things are highlighted, they no longer catch my attention, but distract me from other hightlighted things.
6. If your text is as boring as to require you to scream where the interesting parts are, not having highlights is not the problem.
7. Too much highlighting = unreadable mess.

When a text does not look like a Christmas tree, it shows trust in and respect for your readers. It says “Dear reader, here is some text. I know you are capable to decide on your own whether you want to read it, and assuming you do, where the important / interesting / sensational parts are.”

If you are considering whether to highlight some words or a sentence in your writing, you should follow this set of rules to make your decision:

Rules for Highlighting Text in a Written Work

1. Don’t.

Posted in Uncategorized

Filesystems Explained: Part 2 - Consistency and Journaling

Published on March 3, 2009 by Godji

Previous: Part 1 - What Is a Filesystem?

Filesystem Inconsistency

A filesystem can fail in one of three ways - inconsistency, block device failure, or bitrot.

A filesystem is just an algorithm to map blocks to files and vice versa. Not every sequence of block data constitutes a valid filesystem. When its blocks do not represent a valid input to the mapping algorithm, or contain wrong records about files, a filesystem is said to be inconsistent. Examples of inconsistency are a file that was created next month or data blocks that belong to no file. Filesystems provide tools that detect and correct inconsistencies. Sometimes the correction is harmless, such as resetting a future creation date to the current timestamp. Others herald data loss - such as declaring blocks that belong to no file to be free space.

Inconsistency has many causes, but the most common one by far is improper shutdown. When a computer is powered off abruptly, it does not have a chance to perform cleanup on the blocks, such as writing unwritten buffers in memory to disk.

If detection tools are not run, inconsistency can go undetected for a long time. Running an inconsistent filesystem is very dangerous - accessing damaged files or directories can destroy them, or even destroy otherwise healthy files. For this reason, most operating systems run filesystem checks periodically or when they detect improper shutdown. But because correction can sometimes destroy data as well, the best way to combat inconsistency is to prevent it from happening in the first place, either with journaling or with copy-on-write.

Journaling

What happens when power goes out abruptly? Suppose you were writing a 5-megabyte file. The filesystem would either first make an entry for the file and then write the data, or first write the data and then make an entry. Suppose power goes out between these two operations.

In the first case, you have your 5-megabyte file, but the data in it will not be the one you intended to write - but instead whatever just happens to be in those blocks already. Worst of all, you will have no idea about the problem. In the second case, your data will be written… somewhere. The filesystem will have no idea where it is or even that such a file is supposed to exist. Either way, you lose.

Furthermore, a partial write to a directory might corrupt existing entries, the free block count, or something else, thus damaging files that you did not even touch.

A journaling filesystem has a special area of blocks, called the journal, which records what the filesystem intends to do. When you write your 5-megabyte file, the filesystem might (1) enter the fact that it wants to write the file into the journal, (2) write the file, (3) enter the fact that the file was written into the journal, (4) create the entry for it, and (5) clear the journal. If power goes out, the filesystem will first replay the journal the next time it runs. This means it will figure out what to do in a way that will leave you with a consistent filesystem. Consider what happens if power goes out between each pair of steps:

After (1) or (2): The filesystem knows it wanted to write a file, but even if it wrote something, that write was incomplete. It frees any blocks it may have allocated for the file and clears the journal. The file is gone, but the filesystem is OK.
After (3) or (4): The filesystem knows it wrote the file completely, but the entry may not exist or be incorrect. Because the data is correct, the filesystem attempts to create a proper entry and ensure that there are no inconsistencies in that directory. If it succeeds, you have your file, and if it fails, it frees the blocks and deletes and partial entries. In the latter case you lose the file, but the filesystem is fine.
After (5): Everything was done successfully. The filesystem is OK. There is nothing to check or fix.

In all cases the filesystem is consistent after replaying the journal. You will either have a complete file with the correct data in it, or have nothing, but there will be no doubt if the data is correct. As an added benefit, the journal allows determining very quickly what may have gone wrong without running a full detection (which can easily take hours on multiterabyte filesystems).

Note that journaling is not perfect. In rare cases, the journal itself can become corrupted, making it useless at best or destructive at worst. There is an insignificantly small performance hit when journaling is enabled, and the blocks where the journal is will be written to repeatedly. This can wear out flash so never use journaling on cheap USB sticks*! Nevertheless, journaling is generally a very good thing.

Journaling is implemented on most filesystems used today, with the notable exceptions of FAT32 and ext2. Do not use either of these on a hard disk.

Of course, if a block device fails completely (and every hard disk eventually does), the entire filesystem goes with it. In the next part of this series, I will tell you about RAID, which is the classic safeguard against failing drives.

Footnotes

* Never ever store anything important on a USB stick unless you have another copy elsewhere.

Next: Part 3 - RAID

Posted in Uncategorized

Filesystems Explained: Part 1 - What Is a Filesystem?

Published on February 22, 2009 by Godji

Filesystems are a very important subject. One of the most important functions of any computer is, after all, to store data. Understanding how that works is important to every computer user for two reasons:

1. When you store data in a computer, you Fwant to retrieve it later, despite the fact that both hardware and software can fail. Any discussion about reliability, however, requires a minimal set of terminology to be useful.

2. Knowledge of filesystems and related concepts is necessary in order to manage operating systems. From reinstalling the system you use to making clever and reliable system backups to making several systems live in peace on the same machine - it all requires understanding how storage concepts fit together.

This 4-part piece is my attempt at a clear, understandable explanation of filesystem-related concepts:

Part 1 explains the basic concepts that everyone should know - filesystems, partitions, and block devices.
Part 2 discusses some of the things that can go wrong with a filesystem. It introduces consistency and journaling.
Part 3 goes into somewhat more advanced concepts such as RAID.
Part 4 discusses a little-known but extremely powerful “special” filesystem - ZFS.

Block devices

A hard disk is a type of block device. A block device stores blocks of 512 bytes each. That is all a hard disk does. It does not store files, directories, or anything larger or smaller than 512 bytes. In fact, it has no such concepts at all. A hard disk can only understand two* basic commands: “read block number n“, which gives you the nth series of 512 bytes on the disk, and “write the following 512 bytes at block number n“, which writes data to the disk.

Filesystems

So how does one store files of arbitrary length, directories (also called folders), and their attributes - size, date, permissions, etc. - on a block device? This is the job of a filesystem. A filesystem is an algorithm which translates complex commands (e.g. “write 600 bytes into the file hello.txt, set its creation time to the time now, and make it readable only by me”) into block device read and write commands. In the example above, a filesystem would probably find two blocks that it has not used for anything else, write 600 bytes of data (leaving 424 bytes unchanged), read one or more blocks that store the contents of the directory the file is to be written in, and change and write back one of them to add a hello.txt entry with its permissions and creation date.

You may have heard the expression “to format a disk”. It simply means to create a new filesystem on a given block device. Creating a filesystem is a process specific to the filesystem, but it generally means modifying a few blocks to represent a new empty filesystem. During that process blocks from the previous filesystem (if any) are overwritten, which destroys any files that may have existed there.

Partitions

One block device can contain at most one filesystem. For various reasons it is useful to store multiple (smaller) filesystems on a single block device, which is achieved using partitions. (Note: I only describe the simplest way to implement partitions, called DOS partitions. It is used by Linux and Windows and is thus the most common method. Other systems such as BSD or Solaris have an additional layer of complexity called slices, which I do not discuss here.) Each partition is a block device in itself and has independent block numbers, i.e. each one begins counting from 0. Partitions do not overlap - although theoretically they can, I have yet to see anyone do that for any reason.

Under the DOS partitioning scheme, the first block of a block device (block 0) is special. It contains two things - the block device’s partition table and master boot record (MBR). The partition table has up to 4 entries. Each entry consists of a starting block (the first block of the partition), the ending block (the last block of the partition), an ID (a number that indicates what kind of filesystem is on the partition), and a flag indicating whether the partition is bootable**. Note that Linux ignores the ID and bootable flag - it recognizes the filesystem type automatically and can boot from every partition.

By the way, there is a way to have more than 4 partitions on a single block device by using an extended partition. While I will not go into details, the basic idea is that one of the four partitions (typically the last one) is designated to be extended and contains no filesystem itself. Other partitions reside inside of the extended partition, which begins with something that resembles a variable-length partition table. While you are likely to see this used often, I tend to avoid creating extended partitions myself because they are an ugly hack. In most cases you can easily fit your needs into 4 partitions, e.g. two operating systems, swap space, and data. If you need more than that, you should probably consider multiple physical hard drives.

On Linux-based systems, you can use the command fdisk -l to see what your partition table looks like.

Partition and filesystem names

An operating systems needs a way to identify a partition, and thus the filesystem on it. There are two popular ways to do this.

The UNIX way implies a single filesystem tree starting at /, called root. Partitions (and all other devices) are nodes somewhere under the root. For example, hard drives under modern Linux are called /dev/sda, /dev/sdb and so on. Partitions are /dev/sda1, /dev/sda2, etc. on the first hard drive, /dev/sdb1, etc. on the second one, and so on. These nodes are files that are not actually stored anywhere, but generated by the operating system in order to provide access to the hard drives and partitions on them. A filesystem has a name different from the partition it is on, which is typically derived from the filesystem label, a string denoting the name of the filesystem. So a filesystem residing anywhere, called Penguin would probably be accessible under /media/Penguin/.

The Windows way is entirely different. There is no root filesystem, and neither hard drives nor partitions have a file that refers to them. (Instead, they have special names.) Every filesystem gets a letter. For historical reasons, A: and B: are reserved for floppy disk drives, and C: is the partition the OS was booted from. Letters after C are assigned somewhat arbitrarily. Each letter is the root in the filesystem it denotes. Windows takes all filesystems on all devices it can find, and gives each a successive letter. This is a bad design for several reasons:

1. You can only have 24 filesystems (C-Z). If that sounds like a lot, consider that each network share also gets a letter.

2. To refer to a file, you need to know the letter of its filesystem.

3. The letter a filesystem will be assigned depends on the number of filesystems present on drives before the one in question.

To mitigate the disastrous effect of 2. and 3., letters are made persistent only for devices present during installation (I think) and only for non-removable devices (I think) such as hard-drives, and (surprisingly) optical drives. So if you want that CD-ROM to always be D:, make sure the drive is plugged in during installation…

Popular filesystems

There are hundreds of different filesystems, but only a few are in widespread use. Here are descriptions of the most popular ones.

FAT and its variant FAT32 are the most basic filesystems still in use. Although FAT32 is increasingly rare on computers, it is the de facto standard on USB sticks, memory cards in cameras and phones, and anywhere else that simplicity is the most important goal. FAT employs no tricks to help the filesystem remain consistent, and its performance quickly degrades as files are written, deleted, and replaced by other files. It has no advanced features whatsoever. It has significant limitations; on FAT32, for example, files cannot be larger than 4 gigabytes. I recommend against using FAT32 for data that is either important at all or needs to be accessed and changed often.

NTFS is the filesystem used on Windows (NT4, 2000, XP, Vista, 7, and probably future versions). It has many features and is relatively fast. The main problem with NTFS is that its specification is secret - and while it has been largely reverse-engineered, in theory only Microsoft (and any corporation that pays them) knows exactly how it works. Linux can only read NTFS files - but not write to NTFS, unless an additional piece of software is used. It is also very Windows-centric, and while it supports Windows concepts such as Windows permissions, it does not support the (standardized) UNIX permissions. For these reasons, I only recommend NTFS if you use Windows exclusively. Recent Windows versions require NTFS for their C: partition.

The EXT family of filesystems are the most widely used filesystems on Linux. The family includes ext1, ext2, ext3, and ext4. Nobody at all uses ext1 anymore. ext2 is a very reliable filesystem that has stood the test of time (15 years at the time of this writing). ext3, the de facto standard Linux filesystem, is practically the same as ext2 except that it also supports journaling. While there are many filesystems that are faster than ext2 and ext3, I believe that reliability is a more important concern when it comes to data storage, and I have found ext3 to be very dependable. Because ext2 and ext3 are so common, there are implementations for all major (and many minor) operating systems. Thus, I recommend ext3 (or ext4) for important data unless performance is critical. I also recommend it as the shared data filesystem if you use both Linux and another operating system. Even Windows can be made to use ext3 by installing a driver for it.

ext4 is an evolution over ext3 that adds several desirable features that ext3 does not have. It is very new; the first production-quality implementation is part of the 2.6.28 Linux kernel, released as a Christmas present to all geeks in 2008. This version of Linux, and therefore ext4, is only just making it into distributions. It is as reliable as ext3, and I already use it for most of my filesystems. The only reason to stick to ext3 is interoperability with older Linux that does not support ext4 yet, or with other operating systems.

There are many other less popular filesystems such as reiser4, XFS, and more. They are all faster than ext3 and equally or less reliable than the latter (depending on who you ask). I have personally lost data to both, so I recommend against either of them. Btrfs is another ext3 replacement currently in heavy development. It has many great features and is the Linux answer to ZFS. It looks very promising and might some day become the best filesystem on Linux, although I cannot say if it will be as good as ZFS.

ZFS is, in my opinion, the ultimate filesystem. It is used on Solaris, although there are experimental and not epecially reliable implementations for Linux and some BSDs. Strictly speaking, ZFS goes beyond the definition of a filesystem by incorporating volume management and RAID into itself. I will discuss both RAID and ZFS itself in subsequent parts of this article.

There are many other filesystems, such as the ones used on MacOS X or the BSDs. There are literally tens of experimental ones, too. In this article I only mention the ones I have actually used, and I leave further research as an exercise for the reader.

Conclusion

These are the basic concepts everyone should know about filesystems. If you want to know more, read on to the next part, where I explain things that can go wrong with a filesystem, as well as concepts such as consistency and journaling.

Footnotes

* This is an oversimplification. Modern hard disks have other commands such as “write everything currently buffered in your internal cache now” or “check yourself for errors”. But the read and write commands are theoretically all you need to have a block device.

** To boot a system from a block device means to start the computer and execute the operating system found on that block device.

Next: Part 2 - Consistency and Journaling

Posted in Uncategorized

« Older Entries

Categories

Archives

Meta