Data Deduplication ?

classic Classic list List threaded Threaded
Locked 1 message Options
Reply | Threaded
Open this post in threaded view
|

Data Deduplication ?

[nia]
Administrator

 Data Deduplication ?

Postby sloony67 » Fri Jan 22, 2010 10:59 pm

Just kidding .. :lol:
Last edited by sloony67 on Sat Feb 20, 2010 6:40 am, edited 1 time in total.
sloony67
Member
 
Posts: 10
Joined: Thu Jan 14, 2010 9:55 pm

Re: Data deduplication ?

Postby nia » Sat Jan 23, 2010 12:51 am

Nice wish actually ... Hardware based target de-duplication is becoming very popular with Enterprise VTLS nowadays ..
~nia
nia
Forum Founder
 
Posts: 273
Joined: Sat Dec 12, 2009 12:51 pm
Location: USA

Re: Data deduplication ?

Postby cup » Thu Feb 18, 2010 4:03 am

Together with lessfs http://www.lessfs.com/wordpress/ may work.
Or zfs deduplication.
8-)
cup
Registered
 
Posts: 8
Joined: Thu Feb 18, 2010 3:59 am

Re: Data deduplication ?

Postby nia » Fri Feb 19, 2010 1:51 pm

Together with lessfs http://www.lessfs.com/wordpress/ may work.
Or zfs deduplication.



Thanks, I did not know about lessfs. I was still waiting for zfs on Linux to support dedupe which I don't think it does yet.
~nia
nia
Forum Founder
 
Posts: 273
Joined: Sat Dec 12, 2009 12:51 pm
Location: USA

Re: Data deduplication ?

Postby nia » Fri Feb 19, 2010 11:59 pm

I found this also:
http://www.tummy.com/journals/entries/j ... 209_050553

Wednesday December 09, 2009 at 05:27
Subject: ZFS dedup Available in ZFS-FUSE
Keywords: Dedup, Linux, Technical, ZFS
Posted by: Sean Reifschneider


The 0.6.0 ZFS-FUSE release doesn't include dedup, not surprisingly. I did some digging around and I found this git repository which has a version of ZFS-FUSE that includes the dedup code: 

git clone 'http://rainemu.swishparty.co.uk/git/zfs' zfs-fuse-dedupe



I have not tested yet ..But would be interesting if it works as expected. :D
~nia
nia
Forum Founder
 
Posts: 273
Joined: Sat Dec 12, 2009 12:51 pm
Location: USA

Re: Data deduplication ?

Postby nia » Sat Feb 20, 2010 6:10 am

I have actually downloaded zfs-fuse, the dedup version, and got it installed already .. ;) 

I had to get it from http://rainemu.swishparty.co.uk/cgi-bin ... ;a=summary 
because git gave me trouble.

Now I have a Gentoo running mhvtl with iscsi target and /opt/mhvtl is "zfs" file system set with property dedup=on :D


CODE: SELECT ALL
scst-mhvtl ~ # zfs list mhvtl/library
NAME            USED  AVAIL  REFER  MOUNTPOINT
mhvtl/library  1.99G  1.95T  1.99G  /opt/mhvtl

scst-mhvtl ~ # zfs get -r dedup mhvtl/library
NAME           PROPERTY  VALUE          SOURCE
mhvtl/library  dedup     on             local
scst-mhvtl ~ #
~nia
nia
Forum Founder
 
Posts: 273
Joined: Sat Dec 12, 2009 12:51 pm
Location: USA

Re: Data Deduplication ?

Postby sloony67 » Sat Feb 20, 2010 6:44 am

WOW .. :o
Cool .... 8-) 

I thought I was really kidding, but it is true .. I did not know about "lessfs" and "zfs" filesystems that can do deduplication ... I learn something new every day ...

Good Stuff ..
sloony67
Member
 
Posts: 10
Joined: Thu Jan 14, 2010 9:55 pm

Re: Data Deduplication ?

Postby nia » Sun Feb 21, 2010 7:19 pm

I am still unable to verify if dedup is actually working !!! .. 

I still have not seen any change in status, see below:

CODE: SELECT ALL
scst-mhvtl ~ # zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
mhvtl  1.98T  14.7G  1.97T     0%  1.00x  ONLINE  -
scst-mhvtl ~ #
~nia
nia
Forum Founder
 
Posts: 273
Joined: Sat Dec 12, 2009 12:51 pm
Location: USA

Re: Data Deduplication ?

Postby nia » Mon Feb 22, 2010 1:00 am

zfs dedup does not work on mhvtl tape data format !!! :x 

CODE: SELECT ALL
scst-mhvtl ~ # file /opt/mhvtl/SDLT01S3/data
/opt/mhvtl/SDLT01S3/data: VAX COFF executable - version 7926
scst-mhvtl ~ #


I will have to read up more on this ..
~nia
nia
Forum Founder
 
Posts: 273
Joined: Sat Dec 12, 2009 12:51 pm
Location: USA

Re: Data Deduplication ?

Postby nia » Mon Feb 22, 2010 4:56 am

I found out with dedup enabled, ZFS will identify and remove duplicated regardless of the data format.

So in this case of mhvtl -- tape data files has to have the same data inside e.g:

CODE: SELECT ALL
2049646 -rw-rw---- 1 vtl vtl 2097157724 Feb 21 20:58 /opt/mhvtl/SDLT01S3/data
2049643 -rw-rw---- 1 vtl vtl 2097157724 Feb 21 21:02 /opt/mhvtl/SDLT02S3/data


Now show:

CODE: SELECT ALL
scst-mhvtl ~ # zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
mhvtl  1.98T  16.8G  1.97T     0%  1.12x  ONLINE  -


But as soon as I dump little more files into one of the tapes, dedup is gone. :cry: 

Conclusion:
ZFS dedup will not be practical solution for mhvtl use as the odd of duplicate data is highly unlikely.
~nia
nia
Forum Founder
 
Posts: 273
Joined: Sat Dec 12, 2009 12:51 pm
Location: USA

Re: Data Deduplication ?

Postby sloony67 » Mon Feb 22, 2010 4:04 pm

Bummer :x
sloony67
Member
 
Posts: 10
Joined: Thu Jan 14, 2010 9:55 pm

Re: Data Deduplication ?

Postby nia » Tue Feb 23, 2010 12:35 am

Update and Good news :) 

I am able to use ZFS deduplication feature with mhvtl after all :mrgreen: 

The key was to turn off compression in mhvtl .. yes.. This is just what I did and now I got some deduped data in zfs for /opt/mhvtl. 
File type is now called "data" instead of "VAX COFF"
CODE: SELECT ALL
scst-mhvtl ~ # file /opt/mhvtl/SDLT63S3/data
/opt/mhvtl/SDLT63S3/data: data


I ended up using zfs built-in compression instead of mhvtl native.


Now, ZFS shows deduped stats as listed below:
CODE: SELECT ALL
scst-mhvtl ~ # zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
mhvtl  49.8G  3.51G  46.2G     7%  1.37x  ONLINE  -


But ... :lol: it only appear to be working for backup application tape cloning, inline copy and duplication jobs .. This is what I have noticed so far. Since I have not tested enough, I will have to confirm ..

More to come later ..
~nia
nia
Forum Founder
 
Posts: 273
Joined: Sat Dec 12, 2009 12:51 pm
Location: USA

Re: Data Deduplication ?

Postby nia » Wed Feb 24, 2010 12:36 am

UPDATE:

Ok, I got it all wrong again .. This has been somewhat confusing.. 

First off mhvtl tape compression has nothing to do with not being able to dedup in zfs ..

Here is some more testing that I did which will make the picture more clear:

I am using mtx to control the library robot and tar to write data:


mtx -f /dev/sg10 load 1 1
mtx -f /dev/sg10 load 2 2


Test #1
>>> scst-mhvtl ~ # tar -cvf /dev/st1 /root/*
scst-mhvtl ~ # zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
mhvtl 398G 117M 398G 0% 1.00x ONLINE -

Test #2
>>> tar -cvf /dev/st2 /root/*
scst-mhvtl ~ # zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
mhvtl 398G 175M 398G 0% 1.99x ONLINE -

Test #3
>>> tar -rf /dev/st1 /root/*
scst-mhvtl ~ # zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
mhvtl 398G 253M 398G 0% 1.49x ONLINE -

Test #4
>>> tar -rf /dev/st2 /root/*
scst-mhvtl ~ # zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
mhvtl 398G 232M 398G 0% 1.99x ONLINE -

Test #5
>>> tar -rf /dev/st1 /usr/x86_64-pc-linux-gnu/*
scst-mhvtl ~ # zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
mhvtl 398G 235M 398G 0% 1.98x ONLINE -

Test #6
>>> tar -rf /dev/st2 /usr/x86_64-pc-linux-gnu/*
scst-mhvtl ~ # zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
mhvtl 398G 235M 398G 0% 1.99x ONLINE -


So far so good as expected and hoped, now this:

Test #7
>>> tar -cvf /dev/st1 /var/log/*
scst-mhvtl ~ # zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
mhvtl 398G 434M 398G 0% 1.00x ONLINE -

As you see we lost all deduped data, which is also expected.

Test #8
>>> tar -rf /dev/st2 /var/log/*
scst-mhvtl ~ # zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
mhvtl 398G 666M 397G 0% 1.00x ONLINE -

Test #9
>>> tar -rf /dev/st1 /root/*
scst-mhvtl ~ # zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
mhvtl 398G 838M 397G 0% 1.00x ONLINE -

Last two tests appended the same data as done previously on both tapes but in different order. Result is no dedup.


Conclusion:

As you can see, dedup is only achieved if we have a situation of an exact data across multiple volumes written to the same exact blocks or the same order on each tape ..

Make sense, right ! .. This is sequential tape and not random access disk 

I found NetBackup in-line copy and Vault duplication work pretty good, but not so much with NetWorker cloning.
~nia
nia
Forum Founder
 
Posts: 273
Joined: Sat Dec 12, 2009 12:51 pm
Location: USA

Re: Data Deduplication ?

Postby rami766 » Sat Feb 27, 2010 1:52 am

This maybe very useful in some cases. 
I still think mhvtl should develop into doing native dedupe in the future which could be a lot better than what zfs-fuse can do right now.
Rami
rami766
Member
 
Posts: 42
Joined: Sat Aug 14, 2010 12:04 am

Re: Data Deduplication ?

Postby markh794 » Sat Feb 27, 2010 6:28 am

By the time the data stream is cut & diced and within a SCSI command block, I feel the chances of finding another block the same will be slim.

It needs to be chopped & diced at the source (i.e. Reading the file(s) at the file system) rather than after <insert backup software here> has chopped & diced, added its bit to it and packaged it up for writing to the tape device.

Even if the backup software is tar or cpio.

If the 'de-duplication engine' could be signaled the start of each file and start the de-duplication at the start of the file, we might have some hope.

My 2c worth.

Cheers
Mark
markh794
MHVTL - Developer
 
Posts: 101
Joined: Sat Feb 20, 2010 6:30 pm
Location: Sydney, Australia

Re: Data Deduplication ?

Postby rami766 » Sat Feb 27, 2010 3:56 pm

It needs to be chopped & diced at the source


I don't know what kind of technology EMC® Data Domain® deduplication storage systems is using but they claim that it all happen on the target system, backup application does not know about any data being deduped.

Rami
Rami
rami766
Member
 
Posts: 42
Joined: Sat Aug 14, 2010 12:04 am

Re: Data Deduplication ?

Postby herve » Sat Mar 06, 2010 8:34 am

But as soon as I dump little more files into one of the tapes, dedup is gone. :cry:
Conclusion:
ZFS dedup will not be practical solution for mhvtl use as the odd of duplicate data is highly unlikely.


I don't understand, ZFS is supposed to be block level dedup, not file level, adding data at he end of a tape shouldn't have impact 
i use dedup with opensolaris, not fuse, and i am plainty happy with it 

perhaps porting mhvtl on opensolaris would be a good idea :lol: (robust file system with compression, dedup, robust and easy SAN acces with comstar)
herve
Registered
 
Posts: 9
Joined: Sat Mar 06, 2010 6:46 am

Re: Data Deduplication ?

Postby nia » Sat Mar 06, 2010 2:54 pm

I don't understand, ZFS is supposed to be block level dedup, not file level, adding data at he end of a tape shouldn't have impact 


I don't understand either, but not sure why it did when I was trying to get mhvtl tapes to dedupe.


CODE: SELECT ALL
i use dedup with opensolaris, not fuse, and i am plainty happy with it 


I am not sure about the Linux version I am using but sure would like to see zfs-fuse dedupe released as stable in Linux so I can test again.


CODE: SELECT ALL
perhaps porting mhvtl on opensolaris would be a good idea

It has been talked about it several times in OpenSolaris forums but no project started yet.

As of right now, mhvtl is the only open source VTL on the Market. 

We don't need to wait for OpenSolaris and COMSTAR, Linux + mhvtl will just do. 8-)
~nia
nia
Forum Founder
 
Posts: 273
Joined: Sat Dec 12, 2009 12:51 pm
Location: USA

Re: Data Deduplication ?

Postby herve » Sun Mar 07, 2010 8:47 pm

there is two projects for a VTL on solaris 

One from nexenta http://www.nexentastor.org/projects/vtape/repository
the second on git http://github.com/imp/stmfssd

but both are fare from beeing usable

You're right MHVTL is the only solution "closed to" a market solution
I am a bit sceptic on MHVTL + SCST, to many erros, and STGT seem's not to be a short term solution
i'll do an other test replacing IBM LTO with SDLT

We don't need to wait for OpenSolaris and COMSTAR, Linux + mhvtl will just do. 8-)


did you try ZFS + comstar ?
herve
Registered
 
Posts: 9
Joined: Sat Mar 06, 2010 6:46 am

Re: Data Deduplication ?

Postby nia » Sun Mar 07, 2010 10:08 pm

I am a bit sceptic on MHVTL + SCST, to many erros

I am surprised. I am actually very happy with it so far..I hardly have any issues. I have multiple systems connecting at the same time with no issues also .. Mine is the Gentoo setup.

did you try ZFS + comstar ?

Yes for disk. Tape+Changer is not supported yet.
~nia