Hi Folks,
Just ran into an oddity when trying to configure a library with multiple drives. It varies between launches, but eventually one of the drives will report a hardware error of 0x0403 (LOGICAL UNIT NOT READY, MANUAL INTERVENTION REQUIRED). I have verified that all drives are seen as IBM TD5 and all of the tapes in the stack are labeled as L5 media. You can duplicate this with the default configuration of 4 drives in the first library with the following loop: for x in 0 1 2 3 do mtx -f /dev/sg6 load 1 $x sleep 2 mtx -f /dev/sg6 unload 1 $x sleep 2 done I added the sleep calls thinking that it may be a timing error with no difference. Once it happens, executing a single load to the failing drive will continue to result in this error. Configuration - CentOS 7.5 1804 Current mhvtl pull from github as of this morning (04/02) |
Administrator
|
Can you post messages (syslog) ?
Perhaps change VERBOSE = 3 in /etc/mhvtl/mhvtl.conf Rerun until failure, then post messages Sent from my iPhone On 3 Apr 2019, at 04:01, Tim Jones [via mhVTL - A Linux Virtual Tape Library] <[hidden email]> wrote:
Regards from Australia
Mark Harvey |
Administrator
|
In reply to this post by Tim Jones
I’ve reproduced the error.
Interesting No idea ATM. But hopefully it won’t be too challenging
Sent from my iPhone
Regards from Australia
Mark Harvey |
Administrator
|
This post was updated on .
Just to confirm it is the same issue.
run loop until it errors (take note of the drive the tape load failed). Or brute-force check all tape drives: My dev config [3:0:0:0] mediumx STK L700 0106 /dev/sch0 /dev/sg8 [3:0:1:0] tape IBM ULT3580-TD5 0106 /dev/st5 /dev/sg7 [3:0:2:0] tape IBM ULT3580-TD5 0106 /dev/st7 /dev/sg11 [3:0:3:0] tape IBM ULT3580-TD4 0106 /dev/st3 /dev/sg5 [3:0:4:0] tape IBM ULT3580-TD4 0106 /dev/st6 /dev/sg10 Then, # for a in 5 7 3 6; do echo st$a; mt -f /dev/nst$a status; done st5 SCSI 2 tape drive: File number=0, block number=0, partition=0. Tape block size 0 bytes. Density code 0x46 (LTO-4). Soft error count since last status=0 General status bits on (41010000): BOT ONLINE IM_REP_EN st7 SCSI 2 tape drive: File number=0, block number=0, partition=0. Tape block size 0 bytes. Density code 0x46 (LTO-4). Soft error count since last status=0 General status bits on (41010000): BOT ONLINE IM_REP_EN st3 SCSI 2 tape drive: File number=0, block number=0, partition=0. Tape block size 0 bytes. Density code 0x46 (LTO-4). Soft error count since last status=0 General status bits on (41010000): BOT ONLINE IM_REP_EN st6 SCSI 2 tape drive: File number=-1, block number=-1, partition=0. Tape block size 0 bytes. Density code 0x46 (LTO-4). Soft error count since last status=0 General status bits on (1010000): ONLINE IM_REP_EN [root@tempcentos ~]# (which is really, really bad !) And to confirm it is really, really bad... lsof shows the same virtual media opened by 4 different devices (daemons).. # lsof | grep E01001 vtltape 17913 root 5u REG 253,0 0 67880002 /opt/mhvtl/E01001L4/data vtltape 17913 root 6u REG 253,0 0 67880003 /opt/mhvtl/E01001L4/indx vtltape 17913 root 7u REG 253,0 1536 67880005 /opt/mhvtl/E01001L4/meta vtltape 17914 root 5u REG 253,0 0 67880002 /opt/mhvtl/E01001L4/data vtltape 17914 root 6u REG 253,0 0 67880003 /opt/mhvtl/E01001L4/indx vtltape 17914 root 7u REG 253,0 1536 67880005 /opt/mhvtl/E01001L4/meta vtltape 17915 root 5u REG 253,0 0 67880002 /opt/mhvtl/E01001L4/data vtltape 17915 root 6u REG 253,0 0 67880003 /opt/mhvtl/E01001L4/indx vtltape 17915 root 7u REG 253,0 1536 67880005 /opt/mhvtl/E01001L4/meta vtltape 17918 root 5u REG 253,0 0 67880002 /opt/mhvtl/E01001L4/data vtltape 17918 root 6u REG 253,0 0 67880003 /opt/mhvtl/E01001L4/indx vtltape 17918 root 7u REG 253,0 1536 67880005 /opt/mhvtl/E01001L4/meta
Regards from Australia
Mark Harvey |
Administrator
|
I've identified the root cause and have a fix in testing..
The error is due to a race condition on media unload. i.e. vtllibrary -> send msg via queue to vtltape : On media load, vtltape send ack msg back to vtllibrary before acknowledging the load was successful. This handshake did not occur on media unload.
Regards from Australia
Mark Harvey |
Administrator
|
Commit the fix to GitHub
commit cbbe7d47f538bccd2aa7e24219a9b19da539a064 Author: Mark Harvey <markh794@gmail.com> Date: Wed Apr 3 14:11:47 2019 +1100 Fix a media unload race condition http://mhvtl-a-linux-virtual-tape-library.966029.n3.nabble.com/Hardware-Error-0x0403-on-load-td4026226.html Signed-off-by: Mark Harvey <markh794@gmail.com> Please check it resolves the issue you were seeing.
Regards from Australia
Mark Harvey |
In reply to this post by Mark Harvey
The race is what I suspected - which is the reason for my sleep calls in the loop.
Good find. I'll update things on the test systems this morning. Also - you misspelled mktape (mkape) in the else portion of the check for mktape in the make_vtl_media script. You don't notice this if you pass in the -m argument. Finally, opeSUSE LEAP: make cd kernel make sudo -s make install cd .. make install ldconfig systemctl enable mhvtl.target systemctl daemon-reload reboot et voila! ~> lsscsi -g [0:0:0:0] disk ATA WDC WD5000BPKT-0 1A01 /dev/sda /dev/sg0 [0:0:1:0] disk ATA WDC WD5000BPKT-0 1A01 /dev/sdd /dev/sg3 [0:0:2:0] disk ATA WDC WD5000BPKT-0 1A01 /dev/sde /dev/sg4 [0:0:3:0] disk ATA WDC WD5000BPKT-0 1A01 /dev/sdf /dev/sg5 [1:0:0:0] disk ATA OCZ-VERTEX4 1.5 /dev/sdb /dev/sg1 [2:0:0:0] disk ATA SPCC Solid State 7A /dev/sdc /dev/sg2 [7:0:0:0] mediumx STK L700 0106 /dev/sch0 /dev/sg8 [7:0:1:0] tape IBM ULT3580-TD5 0106 /dev/st1 /dev/sg7 [7:0:2:0] tape IBM ULT3580-TD5 0106 /dev/st0 /dev/sg6 |
In reply to this post by Mark Harvey
Hi Mark,
Using the patch from yesterday and now all load attempts fail with the 0403 ASC/ASCQ. Tim |
Administrator
|
Ouch..
Can I ask for the output of: - lsof | grep vtl - mt -f /dev/nstX status (of all of the mhVTL tape devices). - /var/log/messages might be good too Or at least: - egrep "lload|msg :" /var/log/messages
Regards from Australia
Mark Harvey |
It must be something tied to openSUSE LEAP. My other systems seem happy with this current code.
I'll increase the debug level and do a bit more sleuthing tomorrow. |
Administrator
|
Perhaps a stale libvtl*.so being picked up ?
e.g. # ls -l /usr/lib/libvtl* -rwxr-xr-x. 1 root root 2377338 Apr 5 15:43 /usr/lib/libvtlcart.so -rwxr-xr-x. 1 root root 410555 Apr 5 15:43 /usr/lib/libvtlscsi.so And check with "ldd /usr/bin/vtltape" & "ldd /usr/bin/vtllibrary" as to which ones are being loaded.
Regards from Australia
Mark Harvey |
Administrator
|
My testing has exposed a logic flaw with the existing 'fix'.
if you "mt -f /dev/nstX offline' *before* attempting to move the tape from the slot, the move fails.. Looks like I need a 'tape loaded, but not threaded' and a 'tape loaded, threaded and ready for action'. More patches coming :)
Regards from Australia
Mark Harvey |
Here's two that we look for explicitly:
case 0x0401: strcat ( str, "Logical unit not ready, in process of becoming ready" ); break; case 0x0402: strcat ( str, "Logical unit not ready, initializing command required" ); break; The second is representative of the tape being unloaded from the drive (REWOFF), but still in the mouth of the drive. Sometime, we also see: case 0x3F11: strcat ( str, "Medium auxiliary memory accessible" ); break; This occurs when tape is offline (in the mouth of the drive) because the RFID/MIC reader can still read the MAM/MIC. |
I think I've found the culprit -
I changed the drive types to TD6 instead of TD5. When I changed back to TD5 and remade the tapes as L5 media, it's working again. I'll look into the model entries and get the 6, 7, and 8 devices added for IBM and HPE. Tim |
Free forum by Nabble | Edit this page |