Monday, January 22, 2007

ZFS features

Here's a post I just entered on the Nexenta/gnusolaris Beginners Forum that has some good info about ZFS. Apparently the formatting got eaten on the mailing list so I'm reposting it here:


Hi all,

Can I have it installed concurrently with linux and allocate linux partitions to the RAID Z? or RAID-Z takes the whole disks?


There are two "layers" of partitions in opensolaris; the first is managed with the "fdisk" utility, the second is managed with the "format" utility - these partitions are aka "slices". I am not an expert, but I believe that the "fdisk" managed partitions are the pieces that linux/windows/etc sees. You first would allocate one of these partitions to Solaris, and from there you can additionally split that fdisk partition into root/swap/data "slices". I believe that the linux partitions you'd see would be visible via the "fdisk" command.

According to some of the ZFS faq/wiki resources, ZFS is "better" if it manages the entire disk, however, it will work just fine managing either "partitions" or "slices". You can even make a ZFS pool with individual files.

Here is an example of one of my disks. There is one "fdisk" partition, and a few "slices":


root@medb01:~# fdisk -g /dev/rdsk/c0t0d0p0
* Label geometry for device /dev/rdsk/c0t0d0p0
* PCYL NCYL ACYL BCYL NHEAD NSECT SECSIZ
48638 48638 2 0 255 63 512

root@medb01:~# prtvtoc /dev/rdsk/c0t0d0p0
* /dev/rdsk/c0t0d0p0 partition map
*
* Dimensions:
* 512 bytes/sector
* 63 sectors/track
* 255 tracks/cylinder
* 16065 sectors/cylinder
* 48640 cylinders
* 48638 accessible cylinders
*
* Flags:
* 1: unmountable
* 10: read-only
*
* First Sector Last
* Partition Tag Flags Sector Count Sector Mount Directory
0 0 00 16065 8401995 8418059
1 0 00 8418060 16787925 25205984
2 5 01 0 781369470 781369469
6 0 00 25205985 756147420 781353404
7 0 00 781353405 16065 781369469
8 1 01 0 16065 16064


Note that in the following examples, I'll create ZFS pools with "c0tXd0s6", that is, the 6th "slice" listed in the solaris partition table.


Alternatively, Can I mount my Linux RAID partitions on Nexenta, at least for migration purposes? What about the LVM disks?


As far as I know, there is no LVM or linux-supported filesystem types built into Opensolaris/Nexenta. i.e. you could not just "mount -t ext3" a linux filesystem and be able to read it. Since you've mentioned that you're running a VMware server, I suppose it may be possible to have both guest operating systems running and copy the data over the 'network'. Also it's likely that Nexenta won't know about LVM managed partitions, it would have to be a real honest-to-goodness partition.


What about RAID-Z features:
Can I hot-swap a defective disk?


This should be possible, assuming that your hardware supports it. You may need to force a rescan of the devices if you replace a disk, check devfsadm. Reintegrating it into the pool would be accomplished with a "zpool replace pool device [new device]"


Can I add a disk to the server and tell it to enlarge the pool, to make more space available on the preexisting RAID?


Yes, with a caveat - ZFS doesn't do any magic stripe re-balancing. If you have a 4-disk pool, and add another disk, what you really have is a 4-disk raidz with a single disk tacked on at the end with no redundancy. Best practice would be to add space in 'chunks' of several disks. Fortunately I am in the middle of building a Nexenta-based box with 4 SATA drives so I can play around with some of the commands and show you the output:

Here is a 4-disk zpool using raidZ:


root@medb01:~# zpool create u01 raidz c0t0d0s6 c0t1d0s6 c0t2d0s6 c0t3d0s6
root@medb01:~# zpool status u01
pool: u01
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
u01 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c0t0d0s6 ONLINE 0 0 0
c0t1d0s6 ONLINE 0 0 0
c0t2d0s6 ONLINE 0 0 0
c0t3d0s6 ONLINE 0 0 0



Here is a 3-disk raidZ pool that I "grow" by adding a single additional disk. Note the subtle indentation difference on c0t3d0s6 in this example; it is not part of the original raidz1 and is just a standalone disk in the pool.


root@medb01:~# zpool destroy u01
root@medb01:~# zpool create u01 raidz c0t0d0s6 c0t1d0s6 c0t2d0s6
root@medb01:~# zpool add u01 c0t3d0s6
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: pool uses raidz and new vdev is disk
root@medb01:~# zpool add -f u01 c0t3d0s6
root@medb01:~# zpool status u01
pool: u01
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
u01 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c0t0d0s6 ONLINE 0 0 0
c0t1d0s6 ONLINE 0 0 0
c0t2d0s6 ONLINE 0 0 0
c0t3d0s6 ONLINE 0 0 0




Here is an example of adding space in "chunks", note the size of the volume is different in the "zpool list" before and after.


root@medb01:~# zpool destroy u01
root@medb01:~# zpool create u01 mirror c0t0d0s6 c0t1d0s6
root@medb01:~# zpool list u01
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
u01 360G 53.5K 360G 0% ONLINE -
root@medb01:~# zpool add u01 mirror c0t2d0s6 c0t3d0s6
root@medb01:~# zpool list u01
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
u01 720G 190K 720G 0% ONLINE -
root@medb01:~# zpool status u01
pool: u01
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
u01 ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t0d0s6 ONLINE 0 0 0
c0t1d0s6 ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t2d0s6 ONLINE 0 0 0
c0t3d0s6 ONLINE 0 0 0


PS, doing it this way appears to stripe writes across the two mirrored "subvolumes".


Does it have a facility similar to LVM, where I can create 'logical volumes' on top of the RAID and allocate/deallocate space as needed for flexible storage management (without putting the machine offline)?


Yes, there are two layers in ZFS, the pool management, managed through the "zpool" command, and the filesystem management, through the "zfs" command. Individual filesystems are created as subdirectories of the base pool, or can be relocated with the "zfs set mountpoint" option if you desire. Here I create a ZFS called /u01/opt with a 100MB quota, and then increase the quota to 250MB.


root@medb01:~# zfs create -oquota=100M u01/opt
root@medb01:~# df -k /u01 /u01/opt
Filesystem kbytes used avail capacity Mounted on
u01 743178240 26 743178105 1% /u01
u01/opt 102400 24 102375 1% /u01/opt
root@medb01:~# zfs set quota=250m u01/opt
root@medb01:~# df -k /u01 /u01/opt
Filesystem kbytes used avail capacity Mounted on
u01 743178240 26 743178105 1% /u01
u01/opt 256000 24 255975 1% /u01/opt


Also, things like atime update, compression, etc, can be set on a per filesystem basis.



Can I do fancy stuff like plug an e-sata disk to my machine and tell it to 'ghost' a 'logical volume' on-the-fly, online, without unmounting the volume?


Yes, this is possible. ZFS supports "snapshots" - moment in time copies of an entire ZFS filesystem. ZFS also supports a "send" and "receive" of a snapshot, so you can then take that moment in time copy of your filesystem and replicate it somewhere else. (Or just leave the snapshot laying around for recovery purpouses).

The procedure would be to create a ZFS volume on your external drive, and then "zpool import" that drive each time you plugged it in. Then create a snapshot on your filesystem and "send" it to the external drive, like so. (I don't have an external drive to import so I'll just create 2 pools). I test by creating a filesystem, creating a file in that filesystem, then snapshotting and sending that snapshot to a different pool. Note that the file I created exists in the destination when I'm done.


root@medb01:/# zpool destroy u01
root@medb01:/# zpool destroy u02
root@medb01:/# zpool create u01 mirror c0t0d0s6 c0t1d0s6
root@medb01:/# zpool create u02 mirror c0t2d0s6 c0t3d0s6
root@medb01:/# zfs create u01/data
root@medb01:/# echo "test test test" > /u01/data/testfile.txt
root@medb01:/# zfs snapshot u01/data@send_test
root@medb01:/# zfs send u01/data@send_test | zfs receive u02/u01_copy
root@medb01:/# ls -l /u02/u01_copy
total 1
-rw-r--r-- 1 root root 15 Jan 23 04:49 testfile.txt
root@medb01:/# cat /u02/u01_copy/testfile.txt
test test test
root@medb01:/#


Hope all this helps (and maybe makes it into the wiki too :-) )

No comments: