2016. május 19., csütörtök

GlusterFS in a simple way

Here is the story how I managed to install a 2 node glusterfs on CentOS and one client for test purposes.
In my case the hostnames and the IPs were:

192.168.183.235 s1
192.168.183.236 s2
192.168.183.237 c1

Append these to the end of /etc/hosts to make sure that simple name resolution will work.
Execute the followings on both servers.

rpm -ivh  http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm 
wget  -P /etc/yum.repos.d http://download.gluster.org/pub/gluster/glusterfs/3.7/3.7.5/CentOS/glusterfs-epel.repo 
yum -y install glusterfs glusterfs-fuse glusterfs-server

It's no need to install any of samba packages if you don't intend to use smb.

systemctl enable glusterd.service
Created symlink from /etc/systemd/system/multi-user.target.wants/glusterd.service to /usr/lib/systemd/system/glusterd.service.

Both servers had a second 20G capacity disk named sdb. I created two LV's for two bricks.

[root@s2 ~]# lvcreate -L 9G -n brick2 glustervg
 Logical volume "brick2" created.
[root@s2 ~]# lvcreate -L 9G -n brick1 glustervg
 Logical volume "brick1" created.
[root@s1 ~]# vgcreate glustervg /dev/sdb
 Volume group "glustervg" successfully created
[root@s1 ~]# lvcreate -L 9G -n brick2 glustervg
 Logical volume "brick2" created.
[root@s1 ~]# lvcreate -L 9G -n brick1 glustervg
 Logical volume "brick1" created.
[root@s2 ~]# pvdisplay

  --- Physical volume ---
  PV Name               /dev/sdb
  VG Name               glustervg
  PV Size               20.00 GiB / not usable 4.00 MiB
  Allocatable           yes
  PE Size               4.00 MiB
  Total PE              5119
  Free PE               511
  Allocated PE          4608
  PV UUID               filZyX-wR7W-luFX-Asyn-fYA3-f7tf-q4xGyU
[...]

[root@s2 ~]# lvdisplay

  --- Logical volume ---
  LV Path                /dev/glustervg/brick2
  LV Name                brick2
  VG Name                glustervg
  LV UUID                Rx3FPi-S3ps-x3Z0-FZrU-a2tq-IxS0-4gD2YQ
  LV Write Access        read/write
  LV Creation host, time s2, 2016-05-18 16:02:41 +0200
  LV Status              available
  # open                 0
  LV Size                9.00 GiB
  Current LE             2304
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:3

  --- Logical volume ---
  LV Path                /dev/glustervg/brick1
  LV Name                brick1
  VG Name                glustervg
  LV UUID                P5slcZ-dC7R-iFWv-e0pY-rvyb-YrPm-FM7YuP
  LV Write Access        read/write
  LV Creation host, time s2, 2016-05-18 16:02:43 +0200
  LV Status              available
  # open                 0
  LV Size                9.00 GiB
  Current LE             2304
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:4
[...]

 

[root@s1 ~]# lvdisplay
  --- Logical volume ---
  LV Path                /dev/glustervg/brick2
  LV Name                brick2
  VG Name                glustervg
  LV UUID                7yC2Wl-0lCJ-b7WZ-rgy4-4BMl-mT0I-CUtiM2
  LV Write Access        read/write
  LV Creation host, time s1, 2016-05-18 16:01:56 +0200
  LV Status              available
  # open                 0
  LV Size                9.00 GiB
  Current LE             2304
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:2

  --- Logical volume ---
  LV Path                /dev/glustervg/brick1
  LV Name                brick1
  VG Name                glustervg
  LV UUID                X6fzwM-qdRi-BNKH-63fa-q2O9-jvNw-u2geA2
  LV Write Access        read/write
  LV Creation host, time s1, 2016-05-18 16:02:05 +0200
  LV Status              available
  # open                 0
  LV Size                9.00 GiB
  Current LE             2304
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:3
[...]
 

[root@s1 ~]# mkfs.xfs /dev/glustervg/brick1
 

meta-data=/dev/glustervg/brick1  isize=256    agcount=4, agsize=589824 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0
data     =                       bsize=4096   blocks=2359296, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


[root@s1 ~]# mkfs.xfs /dev/glustervg/brick2

meta-data=/dev/glustervg/brick2  isize=256    agcount=4, agsize=589824 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0
data     =                       bsize=4096   blocks=2359296, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


[root@s1 ~]# mkdir -p /gluster/brick{1,2}
[root@s2 ~]# mkdir -p /gluster/brick{1,2}
[root@s1 ~]# mount /dev/glustervg/brick1 /gluster/brick1 && mount /dev/glustervg/brick2 /gluster/brick2
[root@s2 ~]# mount /dev/glustervg/brick1 /gluster/brick1 && mount /dev/glustervg/brick2 /gluster/brick2



Add the following to a newline in both /etc/fstab:


/dev/mapper/glustervg-brick1 /gluster/brick1 xfs rw,relatime,seclabel,attr2,inode64,noquota 0 0
/dev/mapper/glustervg-brick2 /gluster/brick2 xfs rw,relatime,seclabel,attr2,inode64,noquota 0 0


[root@s1 etc]# systemctl start glusterd.service

Making sure:
[root@s1 etc]# ps ax|grep gluster

 1010 ?        Ssl    0:00 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO[root@s1 etc]# gluster peer probe s2
peer probe: success.


[root@s2 etc]# gluster peer status
Number of Peers: 1
Hostname: 192.168.183.235
Uuid: f5bdc3f3-0b43-4a83-86c1-c174594566b9
State: Peer in Cluster (Connected)


[root@s1 etc]# gluster pool list
UUID                                    Hostname        State
01cf8a70-d00f-487f-875e-9e38d4529b57    s2              Connected
f5bdc3f3-0b43-4a83-86c1-c174594566b9    localhost       Connected

[root@s1 etc]# gluster volume status
No volumes present

[root@s2 etc]# gluster volume info
No volumes present

[root@s1 etc]# mkdir /gluster/brick1/mpoint1
[root@s2 etc]# mkdir /gluster/brick1/mpoint1
[root@s1 gluster]# gluster volume create myvol1 replica 2 transport tcp s1:/gluster/brick1/mpoint1 s2:/gluster/brick1/mpoint1

volume create: myvol1: failed: Staging failed on s2. Error: Host s1 is not in 'Peer in Cluster' state

Ooooops....
[root@s2 glusterfs]# ping s1ping: unknown host s1I forgot to check name resolution. When i fixed this and tried to create it again, i got:
[root@s1 glusterfs]# gluster volume create myvol1 replica 2 transport tcp s1:/gluster/brick1/mpoint1 s2:/gluster/brick1/mpoint1
volume create: myvol1: failed: /gluster/brick1/mpoint1 is already part of a volume
 
 WTF ??
[root@s1 glusterfs]# gluster volume get myvol1 all
volume get option: failed: Volume myvol1 does not exist
[root@s1 glusterfs]# gluster
gluster>
exit         global       help         nfs-ganesha  peer         pool         quit         snapshot     system::     volume
gluster> volume
add-brick      bitrot         delete         heal           inode-quota    profile        remove-brick   set            status         tier
attach-tier    clear-locks    detach-tier    help           list           quota          replace-brick  start          stop           top
barrier        create         get            info           log            rebalance      reset          statedump      sync

gluster> volume l
list  log
gluster> volume list
No volumes present in cluster

That's odd! Hmm. I thought it'd work: 
[root@s1 /]# rm /gluster/brick1/mpoint1
[root@s1 /]# gluster volume create myvol1 replica 2 transport tcp s1:/gluster/brick1/mpoint1 s2:/gluster/brick1/mpoint1volume create: myvol1: success: please start the volume to access data

[root@s1 /]# gluster volume list

myvol1

Yep. Success. Phuhh.
[root@s1 /]# gluster volume start myvol1
volume start: myvol1: success

[root@s2 etc]# gluster volume list

myvol1
[root@s2 etc]# gluster volume status
Status of volume: myvol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick s1:/gluster/brick1/mpoint1            49152     0          Y       2528
Brick s2:/gluster/brick1/mpoint1            49152     0          Y       10033
NFS Server on localhost                     2049      0          Y       10054
Self-heal Daemon on localhost               N/A       N/A        Y       10061
NFS Server on 192.168.183.235               2049      0          Y       2550
Self-heal Daemon on 192.168.183.235         N/A       N/A        Y       2555

Task Status of Volume myvol1
------------------------------------------------------------------------------
There are no active volume tasks

[root@s1 ~]# gluster volume create myvol2 s1:/gluster/brick2/mpoint2 s2:/gluster/brick2/mpoint2  force
volume create: myvol2: success: please start the volume to access data
[root@s1 ~]# gluster volume start myvol2
volume start: myvol2: success
[root@s1 ~]# gluster volume info
Volume Name: myvol1
Type: Replicate
Volume ID: 633b765b-c630-4007-91ca-dc42714bead4
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: s1:/gluster/brick1/mpoint1
Brick2: s2:/gluster/brick1/mpoint1
Options Reconfigured:
performance.readdir-ahead: on

Volume Name: myvol2
Type: Distribute
Volume ID: ebfa9134-0e6a-40be-8045-5b16436b88ed
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: s1:/gluster/brick2/mpoint2
Brick2: s2:/gluster/brick2/mpoint2
Options Reconfigured:
performance.readdir-ahead: on

On the client:

[root@c1 ~]# wget -P /etc/yum.repos.d http://download.gluster.org/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-epel.repo
[...]
[root@c1 ~]# yum -y install glusterfs glusterfs-fuse
[....]
[root@c1 ~]# mkdir  /g{1,2}
[root@c1 ~]# mount.glusterfs s1:/myvol1 /g1
[root@c1 ~]# mount.glusterfs s1:/myvol2 /g2
[root@c1 ~]# mount
[...]
s1:/myvol1 on /g1 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
s2:/myvol2 on /g2 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
[root@c1 ]# df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root   28G  1.1G   27G   4% /
devtmpfs                 422M     0  422M   0% /dev
tmpfs                    431M     0  431M   0% /dev/shm
tmpfs                    431M  5.7M  426M   2% /run
tmpfs                    431M     0  431M   0% /sys/fs/cgroup
/dev/sda1                494M  164M  331M  34% /boot
tmpfs                     87M     0   87M   0% /run/user/0
s1:/myvol1               9.0G   34M  9.0G   1% /g1 [9G,9G because of replicating (aka RAID1 over network))
s2:/myvol2                18G   66M   18G   1% /g2 (9G+9G because of distributing (aka JBOD over network))

What is the difference between distributing and striping? Here are two short sniplets from glusterhacker blog:
Distribute : A distribute volume is one, in which all the data of the volume, is distributed throughout the bricks. Based on an algorithm, that takes into account the size available in each brick, the data will be stored in any one of the available bricks. [...] The default volume type is distribute, hence my myvol2 got distributed.
Stripe: A stripe volume is one, in which the data being stored in the backend is striped into units of a particular size, among the bricks. The default unit size is 128KB, but it's configurable. If we create a striped volume of stripe count 3, and then create a 300 KB file at the mount point, the first 128KB will be stored in the first sub-volume(brick in our case), the next 128KB in the second, and the remaining 56KB in the third. The number of bricks should be a multiple of the stripe count.

The very useable official howto is here.
   
Performance test, split brain, to be continued....

Nincsenek megjegyzések:

Megjegyzés küldése