MHVTL 1.6.2 instability on ppc64le

classic Classic list List threaded Threaded
44 messages Options
123
Reply | Threaded
Open this post in threaded view
|

MHVTL 1.6.2 instability on ppc64le

rohr22
Hi, Mark.

We have been trying to get MHVTL 1.6.2 to work properly on our RHEL7 and RHEL8 machines that have the ppc64le  (Power PC Little Endian) architecture. The software seems to be unstable. (We have gotten the same 1.6.2 version to work well on our RHEL8 x86_64 system.) Often when we write files to tapes, the files get changed so that every byte is a NULL character. This doesn't happen all the time, but it seems to happen very frequently. The problem seems to go away for a while after recreating the virtual tapes.

Have you been able to test the software on a ppc64le architecture?

Attached are messages from /var/log/messages when files were transferred from disk to an MHVTL tape volume. Our test created 38 files on disk and then copied them to tape. I found 38 of these messages in the log file:

ERROR: retrieve_CDB_data(): Failed retriving data via ioctl(): Input/output error

Note that "retrieving" is misspelled as "retriving".

Also, I found 38 of these messages:
kernel: scatter-gather segment size larger than SG_SEGMENT_SZ (65536 > 8192)

Thank you,
Peter

mhvtl_log.mhvtl_log
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

Mark Harvey
Administrator
Thanks for the bug report. Do you have a specific work load that can reproduce the error ?

I do not have any applications to perform any load/stress testing on PPC platform
Regards from Australia
Mark Harvey
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

rohr22
Yes, I can reproduce the error with a set of automated tests that we run. Did the log file I provided help narrow down the source of the problem? Is there some additional logging that I can capture to help pinpoint the location of the problem? Do you have a patch file to add even more logging that I may apply? We would really like to get this problem resolved since we have several machines with the ppc64le architecture that we cannot use real tape drives on that we would like to be able to use mhvtl on.
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

Mark Harvey
Administrator
In a position to review the logs - thanks.

Looking into it today.
Regards from Australia
Mark Harvey
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

rohr22
I have logging going back further in time if you need it.
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

Mark Harvey
Administrator
I have increased the default kmem_cache bounce buffer size to 64k (from 8k)

My brief testing on PPC (dd if=/dev/zero of=/dev/st4 bs=256k count=4) succeeded with the latest in place.

Changes have been pushed into github..
Regards from Australia
Mark Harvey
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

rohr22
OK, great. I'll try some tests with this new version today and let you know how it goes.
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

rohr22
Hi, Mark. I tried your latest version of mhvtl on our RHEL 8.1 ppc64le machine. Unfortunately, when we ran a test that tried to move a little over 2 GBs of data to tape, it appears that mhvtl crashed and the machine rebooted. The /var/log/messages log file shows:

Jan 28 20:45:10 tam /usr/bin/vtltape[99938]: processCommand(): 50th contiguous READ_6 request (38047) (delay 37850)
Jan 28 20:45:11 tam /usr/bin/vtltape[99948]: processCommand(): 50th contiguous WRITE_6 request (38056) (delay 41450)
Jan 28 20:45:11 tam /usr/bin/vtltape[99938]: processCommand(): 100th contiguous READ_6 request (38147) (delay 30650)
Jan 28 20:45:11 tam /usr/bin/vtltape[99948]: processCommand(): 100th contiguous WRITE_6 request (38155) (delay 30650)
Jan 28 20:45:11 tam /usr/bin/vtltape[99938]: processCommand(): 150th contiguous READ_6 request (38247) (delay 31850)
Jan 28 20:45:11 tam /usr/bin/vtltape[99948]: processCommand(): 150th contiguous WRITE_6 request (38256) (delay 33450)
Jan 28 20:45:11 tam /usr/bin/vtltape[99938]: processCommand(): 200th contiguous READ_6 request (38349) (delay 28650)
Jan 28 20:45:11 tam /usr/bin/vtltape[99948]: processCommand(): 200th contiguous WRITE_6 request (38355) (delay 31050)
Jan 28 20:45:11 tam /usr/bin/vtltape[99938]: processCommand(): 250th contiguous READ_6 request (38447) (delay 27050)
Jan 28 20:45:11 tam /usr/bin/vtltape[99948]: processCommand(): 250th contiguous WRITE_6 request (38456) (delay 30650)
Jan 28 20:45:11 tam /usr/bin/vtltape[99938]: processCommand(): 300th contiguous READ_6 request (38548) (delay 26650)
Jan 28 20:45:11 tam /usr/bin/vtltape[99948]: processCommand(): 300th contiguous WRITE_6 request (38556) (delay 30250)
Jan 28 20:45:11 tam /usr/bin/vtltape[99938]: processCommand(): 350th contiguous READ_6 request (38647) (delay 27850)
Jan 28 20:45:11 tam /usr/bin/vtltape[99948]: processCommand(): 350th contiguous WRITE_6 request (38656) (delay 31850)
Jan 28 20:45:11 tam /usr/bin/vtltape[99938]: processCommand(): 400th contiguous READ_6 request (38746) (delay 25050)
Jan 28 20:45:11 tam /usr/bin/vtltape[99948]: processCommand(): 400th contiguous WRITE_6 request (38756) (delay 29850)
Jan 28 20:45:11 tam /usr/bin/vtltape[99938]: processCommand(): 450th contiguous READ_6 request (38847) (delay 24650)
Jan 28 20:45:11 tam /usr/bin/vtltape[99948]: processCommand(): 450th contiguous WRITE_6 request (38856) (delay 29450)
Jan 28 20:45:11 tam /usr/bin/vtltape[99938]: processCommand(): 500th contiguous READ_6 request (38948) (delay 23850)
Jan 28 20:45:11 tam /usr/bin/vtltape[99948]: processCommand(): 500th contiguous WRITE_6 request (38956) (delay 29050)
Jan 28 20:45:11 tam /usr/bin/vtltape[99938]: processCommand(): 550th contiguous READ_6 request (39047) (delay 23850)
Jan 28 20:45:11 tam /usr/bin/vtltape[99948]: processCommand(): 550th contiguous WRITE_6 request (39056) (delay 29450)
Jan 28 20:45:11 tam /usr/bin/vtltape[99938]: processCommand(): 600th contiguous READ_6 request (39147) (delay 24650)
Jan 28 20:45:12 tam /usr/bin/vtltape[99948]: processCommand(): 600th contiguous WRITE_6 request (39155) (delay 29850)
Jan 28 20:45:12 tam /usr/bin/vtltape[99938]: processCommand(): 650th contiguous READ_6 request (39247) (delay 24250)
Jan 28 20:45:12 tam /usr/bin/vtltape[99948]: processCommand(): 650th contiguous WRITE_6 request (39256) (delay 29050)
Jan 28 20:45:12 tam /usr/bin/vtltape[99938]: processCommand(): 700th contiguous READ_6 request (39348) (delay 23850)
Jan 28 20:45:12 tam /usr/bin/vtltape[99948]: processCommand(): 700th contiguous WRITE_6 request (39356) (delay 28250)
Jan 28 20:45:12 tam /usr/bin/vtltape[99938]: processCommand(): 750th contiguous READ_6 request (39447) (delay 24250)
Jan 28 20:45:12 tam /usr/bin/vtltape[99948]: processCommand(): 750th contiguous WRITE_6 request (39456) (delay 28250)
Jan 28 20:45:12 tam /usr/bin/vtltape[99938]: processCommand(): 800th contiguous READ_6 request (39547) (delay 22250)
Jan 28 20:45:12 tam /usr/bin/vtltape[99948]: processCommand(): 800th contiguous WRITE_6 request (39556) (delay 26650)
Jan 28 20:45:12 tam /usr/bin/vtltape[99938]: processCommand(): 850th contiguous READ_6 request (39647) (delay 22650)
Jan 28 20:45:12 tam /usr/bin/vtltape[99948]: processCommand(): 850th contiguous WRITE_6 request (39656) (delay 28650)

Jan 28 20:45:12 tam /usr/bin/vtltape[99938]: processCommand(): 900th contiguous READ_6 request (39748) (delay 23850)
Jan 28 20:45:12 tam /usr/bin/vtltape[99948]: processCommand(): 900th contiguous WRITE_6 request (39757) (delay 28650)
Jan 28 20:45:12 tam /usr/bin/vtltape[99938]: processCommand(): 950th contiguous READ_6 request (39847) (delay 22650)
Jan 28 20:45:12 tam /usr/bin/vtltape[99948]: processCommand(): 950th contiguous WRITE_6 request (39856) (delay 28250)
Jan 28 20:45:12 tam /usr/bin/vtltape[99938]: processCommand(): 1000th contiguous READ_6 request (39947) (delay 24250)
Jan 28 20:45:12 tam /usr/bin/vtltape[99948]: processCommand(): 1000th contiguous WRITE_6 request (39955) (delay 27850)
Jan 28 20:45:12 tam /usr/bin/vtltape[99938]: processCommand(): 1050th contiguous READ_6 request (40047) (delay 25450)
Jan 28 20:45:12 tam /usr/bin/vtltape[99948]: processCommand(): 1050th contiguous WRITE_6 request (40056) (delay 28250)
Jan 28 20:45:12 tam /usr/bin/vtltape[99938]: processCommand(): 1100th contiguous READ_6 request (40147) (delay 26250)
Jan 28 20:45:12 tam /usr/bin/vtltape[99948]: processCommand(): 1100th contiguous WRITE_6 request (40155) (delay 28250)
Jan 28 20:45:12 tam /usr/bin/vtltape[99938]: processCommand(): 1150th contiguous READ_6 request (40248) (delay 29050)
Jan 28 20:45:12 tam /usr/bin/vtltape[99948]: processCommand(): 1150th contiguous WRITE_6 request (40256) (delay 30650)
Jan 28 20:45:12 tam /usr/bin/vtltape[99938]: processCommand(): 1200th contiguous READ_6 request (40349) (delay 31850)
Jan 28 20:45:12 tam /usr/bin/vtltape[99948]: processCommand(): 1200th contiguous WRITE_6 request (40356) (delay 31050)
Jan 28 20:45:13 tam /usr/bin/vtltape[99938]: processCommand(): 1250th contiguous READ_6 request (40447) (delay 26250)
Jan 28 20:45:13 tam /usr/bin/vtltape[99948]: processCommand(): 1250th contiguous WRITE_6 request (40456) (delay 30250)
Jan 28 20:45:13 tam /usr/bin/vtltape[99938]: processCommand(): 1300th contiguous READ_6 request (40549) (delay 24650)
Jan 28 20:45:13 tam /usr/bin/vtltape[99948]: processCommand(): 1300th contiguous WRITE_6 request (40557) (delay 29450)
Jan 28 20:45:13 tam /usr/bin/vtltape[99938]: processCommand(): 1350th contiguous READ_6 request (40647) (delay 27850)
Jan 28 20:45:13 tam /usr/bin/vtltape[99948]: processCommand(): 1350th contiguous WRITE_6 request (40656) (delay 31850)
Jan 28 20:45:13 tam /usr/bin/vtltape[99938]: processCommand(): 1400th contiguous READ_6 request (40748) (delay 27050)
Jan 28 20:45:13 tam /usr/bin/vtltape[99948]: processCommand(): 1400th contiguous WRITE_6 request (40755) (delay 30250)
Jan 28 20:45:13 tam /usr/bin/vtltape[99938]: processCommand(): 1450th contiguous READ_6 request (40847) (delay 27450)
Jan 28 20:45:13 tam /usr/bin/vtltape[99948]: processCommand(): 1450th contiguous WRITE_6 request (40856) (delay 29450)
Jan 28 20:45:13 tam /usr/bin/vtltape[99938]: processCommand(): 1500th contiguous READ_6 request (40947) (delay 27850)
Jan 28 20:45:13 tam /usr/bin/vtltape[99948]: processCommand(): 1500th contiguous WRITE_6 request (40955) (delay 29450)
Jan 28 20:45:13 tam /usr/bin/vtltape[99938]: processCommand(): 1550th contiguous READ_6 request (41047) (delay 30250)
Jan 28 20:45:13 tam /usr/bin/vtltape[99948]: processCommand(): 1550th contiguous WRITE_6 request (41056) (delay 31450)
Jan 28 20:45:13 tam /usr/bin/vtltape[99938]: processCommand(): 1600th contiguous READ_6 request (41147) (delay 28650)
Jan 28 20:45:13 tam /usr/bin/vtltape[99948]: processCommand(): 1600th contiguous WRITE_6 request (41156) (delay 29850)
Jan 28 20:45:13 tam /usr/bin/vtltape[99938]: processCommand(): 1650th contiguous READ_6 request (41247) (delay 24650)
Jan 28 20:45:13 tam /usr/bin/vtltape[99948]: processCommand(): 1650th contiguous WRITE_6 request (41256) (delay 29050)
Jan 28 20:45:13 tam /usr/bin/vtltape[99938]: processCommand(): 1700th contiguous READ_6 request (41349) (delay 29050)
Jan 28 20:45:13 tam /usr/bin/vtltape[99948]: processCommand(): 1700th contiguous WRITE_6 request (41356) (delay 31850)

^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jan 28 21:09:56 tam kernel: systemd: 21 output lines suppressed due to ratelimiting

I'll try to rerun the test with rate limiting turned off.
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

rohr22
Hi, Mark. Here are MHVTL-related entries from /var/log/messages when a kernel crash occurred. It looks like MHVTL crashed around 14:00 on a read attempt.

mhvtl_crash.gz

The test wrote about 10 GB of data to the tape device and then was trying to read it back. I have set up my mhvtl.conf file with a capacity value of 51200 to allow 50 GB files.
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

Mark Harvey
Administrator
The kernel oops reports a bad memory access from vtl_c_ioctl()

The only 'variable' is the inode - so I've added a null check for this and return error (plus log it)..

Change pushed to GitHub now..

I've tar'ed up the VM a few times now - using various block sizes and this change hasn't resulted in a kernel oops on my test setup.
Regards from Australia
Mark Harvey
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

rohr22
Hi, Mark.

Could you explain where you noticed the kernel oops in the log file I included from a bad memory access in vtl_c_ioctl()?

Also, your changes in kernel.c are to verify that the value of inode is not NULL. I can see how this would avoid a NULL pointer access, but doesn't it cause a new problem since the code is leaving the vtl_c_ioctl function without doing what it needs to?

In any case, I'll try out your new version and run some tests today. Also, on our test system, I ran 'dmesg -n 7' so I can see the messages from printk(KERN_DEBUG, ...). Won't that be helpful to find the source of the problem?

Thank you,
Peter
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

Mark Harvey
Administrator
I did not find anything "useful" in the logs you provided - just lots of NULL chars..
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

However, for the short while I had access to a PPC setup (CentOS 8.1 along with Kdump ) - I managed to obtain a couple of oops/panic/reboots while tar'ing to the virtual media.. All oops I saw were all vtl_c_ioctl().

While I would like to say I understand the backtrace below, however I simply reviewed the code for any values that could be NULL and added a test for it. If the changes prevent the panic - we're at least on the right track. A more robust solution can be determined from there. But adding a whole heap of logic to a piece of code I'm not sure is problematic is a bit of an overkill at the moment.

crash> bt
PID: 13040  TASK: c000000003eb1f00  CPU: 4   COMMAND: "vtltape"
 #0 [c000000003f377e0] crash_kexec at c000000000247be0
 #1 [c000000003f37820] oops_end at c000000000029cf8
 #2 [c000000003f378a0] bad_page_fault at c00000000007b6ec
 #3 [c000000003f37910] slb_miss_large_addr at c000000000084030
 #4 [c000000003f37960] large_addr_slb at c000000000008bcc
 Data SLB Access [380] exception frame:
 R0:  d000000002e22638    R1:  c000000003f37c50    R2:  d000000002e2ef00   
 R3:  0000000000000000    R4:  0000000000000200    R5:  00007fffedf5e5d8   
 R6:  00007fffedf5e5d8    R7:  00000000000fffff    R8:  0000000000000000   
 R9:  0009000300010000    R10: c0000001f94a3518    R11: d000000002e241e0   
 R12: c000000000d47a80    R13: c00000003fff9c00    R14: 000000000000000d   
 R15: 0000000010056820    R16: 0000000000000000    R17: 00007fffbcd40788   
 R18: 0000000000000003    R19: 0000000010056818    R20: 0000000000000010   
 R21: 0000000010053dd0    R22: 0000000000000008    R23: 0000000010055e80   
 R24: 0000000010056828    R25: 00007fffedf5e5d8    R26: 000000000000000b   
 R27: 00007fffedf5e5d8    R28: c0000001f1a40068    R29: d000000002e27250   
 R30: d000000002e27380    R31: 000900030000ff78   
 NIP: d000000002e228c0    MSR: 8000000000009033    OR3: d000000002e228c8
 CTR: c000000000d47a80    LR:  d000000002e22638    XER: 0000000020000000
 CCR: 0000000028004448    MQ:  0000000000000000    DAR: 000900030000ff78
 DSISR: c000000003f37b70     Syscall Result: 0000000000000000
 [NIP  : vtl_c_ioctl+728]
 [LR   : vtl_c_ioctl+80]
 #5 [c000000003f37c50] vtl_c_ioctl at d000000002e228c0 [mhvtl]  (unreliable)
 #6 [c000000003f37d20] do_vfs_ioctl at c0000000004eb074
 #7 [c000000003f37de0] sys_ioctl at c0000000004ebbe4
 #8 [c000000003f37e30] system_call at c00000000000b388
 System Call [c00] exception frame:
 R0:  0000000000000036    R1:  00007fffedf5e370    R2:  00007fffbccb7100   
 R3:  0000000000000003    R4:  0000000000000200    R5:  00007fffedf5e5d8   
 R6:  0000000000000000    R7:  0000000000000000    R8:  0000000000000003   
 R9:  0000000000000000    R10: 0000000000000000    R11: 0000000000000000   
 R12: 0000000000000000    R13: 00007fffbce1b9a0    R14: 000000000000000d   
 R15: 0000000010056820    R16: 0000000000000000    R17: 00007fffbcd40788   
 R18: 0000000000000003    R19: 0000000010056818    R20: 0000000000000010   
 R21: 0000000010053dd0    R22: 0000000000000008    R23: 0000000010055e80   
 R24: 0000000010056828    R25: 0000000000000000    R26: 0000000000000195   
 R27: 0000000010027ed0    R28: 0000000010056940    R29: 0000000010056d50   
 R30: 0000000010027ee0    R31: 00007fffedf5e5d8   
 NIP: 00007fffbcbd0940    MSR: 800000000000f033    OR3: 0000000000000003
 CTR: 0000000000000000    LR:  0000000010007890    XER: 0000000000000000
 CCR: 0000000048004448    MQ:  0000000000000000    DAR: 00000100389e0000
 DSISR: 0000000042000000     Syscall Result: 0000000000000000
crash> 
Regards from Australia
Mark Harvey
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

rohr22
Hi, Mark.

I ran tests with your updated code on three different machines yesterday. On two machines all the tests passed. However, on a third machine, a test failed when it was trying to read two 5 GB files from tape. It appears that mhvtl crashed and rebooted the system.

The last messages logged were these:

Jan 31 02:27:53 thom /usr/bin/vtltape[7236]: processCommand(): 11700th contiguous READ_6 request (180509) (delay 94250)
Jan 31 02:27:54 thom /usr/bin/vtltape[7236]: processCommand(): 11750th contiguous READ_6 request (180559) (delay 103050)
Jan 31 02:49:13 thom kernel: systemd: 19 output lines suppressed due to ratelimiting

The reboot occurred at Jan 31 02:49:02:
# uptime -s
2020-01-31 02:49:02

So it is surprising no messages were logged between 02:27:54 and 02:49:13. It seems that rate limiting is still turned on for this machine. We have a script we wrote to disable rate limiting which we have been using for RHEL 7 machines. However, we are now testing on RHEL 8 machines so the process must be a bit different there.

I think I figured out how to set my kernel log level correctly. I need to run this:

 echo 8 > /proc/sys/kernel/printk

I was writing 7 instead. By writing the value '8' I should get any KERN_DEBUG or lower severity messages. By writing a '7' I was missing the printk(KERN_DEBUG ...) messages in your kernel module.

I'll try to rerun the tests again after properly disabling rate limiting and setting the kernel printk level. Hopefully, I will have some better log files over the weekend or next week.

Thank you,
Peter
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

rohr22
Mark, I found that a /var/crash/127.0.0.1-2020-01-31-02:34:07 directory had been created. It looked at the vmcore-dmesg.txt file and near the end found this:

[69749.019155] mhvtl: CDB (182011) 6 bytes
[69749.019157]  08 00 04 00 00 00
[69749.021404] mhvtl: CDB (182012) 6 bytes
[69749.021405]  08 00 04 00 00 00
[69749.022042] mhvtl: CDB (182013) 6 bytes
[69749.022043]  08 00 04 00 00 00
[69749.028750] mhvtl: CDB (182014) 6 bytes
[69749.028752]  08 00 04 00 00 00
[69749.079016] mhvtl: CDB (182015) 6 bytes
[69749.079017]  08 00 04 00 00 00
[69749.088754] mhvtl: CDB (182016) 6 bytes
[69749.088755]  08 00 04 00 00 00
[69749.090235] mhvtl: CDB (182017) 6 bytes
[69749.090236]  08 00 04 00 00 00
[69749.090884] mhvtl: CDB (182018) 6 bytes
[69749.090885]  08 00 04 00 00 00
[69749.159178] mhvtl: CDB (182019) 6 bytes
[69749.159180]  08 00 04 00 00 00
[69749.161425] mhvtl: CDB (182020) 6 bytes
[69749.161426]  08 00 04 00 00 00
[69749.169388] mhvtl: CDB (182021) 6 bytes
[69749.169390]  08 00 04 00 00 00
[69749.169396] Unable to handle kernel paging request for data at address 0x5deadbeef0000078
[69749.169403] Faulting instruction address: 0xd0000000025128c0
[69749.169408] Oops: Kernel access of bad area, sig: 11 [#1]
[69749.169410] LE SMP NR_CPUS=2048 NUMA pSeries
[69749.169415] Modules linked in: dm_mod mhvtl(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfsv3 nfs fscache st ch sg pseries_rng xts vmx_crypto nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables ext4 mbcache jbd2 sd_mod ibmvfc ibmvscsi scsi_transport_fc ibmveth scsi_transport_srp [last unloaded: mhvtl]
[69749.169446] CPU: 147 PID: 7236 Comm: vtltape Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-147.el8.ppc64le #1
[69749.169451] NIP:  d0000000025128c0 LR: d000000002512644 CTR: c000000000d47060
[69749.169454] REGS: c00000027b0d79d0 TRAP: 0380   Tainted: G           OE    --------- -  -  (4.18.0-147.el8.ppc64le)
[69749.169458] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28004448  XER: 20000000
[69749.169466] CFAR: d0000000025128c8 IRQMASK: 0
GPR00: d000000002512644 c00000027b0d7c50 d00000000251ee00 0000000000000000
GPR04: 0000000000000200 00007fffd556fdf8 00007fffd556fdf8 0000000000000000
GPR08: 0000000000000000 5deadbeef0000100 c0000002e4ef7f58 d000000002514180
GPR12: c000000000d47060 c000000007ec4200 000000000000000c 0000000010056820
GPR16: 0000000000000000 00007fffaf4e0788 0000000000000003 0000000010056818
GPR20: 0000000000000010 0000000010053dd0 0000000000000008 0000000010055e80
GPR24: 0000000010056828 00007fffd556fdf8 0000000000000010 00007fffd556fdf8
GPR28: c00000027b06e500 d000000002517158 d000000002517288 5deadbeef0000078
[69749.169506] NIP [d0000000025128c0] vtl_c_ioctl+0x2d8/0xb20 [mhvtl]
[69749.169510] LR [d000000002512644] vtl_c_ioctl+0x5c/0xb20 [mhvtl]
[69749.169513] Call Trace:
[69749.169516] [c00000027b0d7c50] [d000000002512644] vtl_c_ioctl+0x5c/0xb20 [mhvtl] (unreliable)
[69749.169522] [c00000027b0d7d20] [c0000000004eb0c4] do_vfs_ioctl+0xd4/0xa70
[69749.169526] [c00000027b0d7de0] [c0000000004ebc34] sys_ioctl+0xc4/0x160
[69749.169531] [c00000027b0d7e30] [c00000000000b388] system_call+0x5c/0x70
[69749.169534] Instruction dump:
[69749.169537] 394a0098 e92a0000 7faa4800 3be9ff78 409e0020 4800054c 60000000 60000000
[69749.169544] e93f0088 7faa4800 3be9ff78 419e0534 <8129ff78> 2f890001 409effe8 7f23cb78
[69749.169552] ---[ end trace e30efff85a7c7b99 ]---
[69749.251953]
[69749.252038] Sending IPI to other CPUs
[69749.268998] IPI complete
[69749.439350] kexec: Starting switchover sequence.

Does that shed any light on what the problem might be?
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

rohr22
I'll try to install the kernel-debuginfo RPM so that I can run the 'crash' tool on the vmcore file. I'll let you know if I am able to do this and find something interesting.
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

rohr22
I installed the kernel-debuginfo and kernel-debuginfo-common RPMs on our test machine. Here is what I saw:

root@thom /var/crash/127.0.0.1-2020-01-31-02:34:07
# crash vmcore /usr/lib/debug/lib/modules/4.18.0-147.el8.ppc64le/vmlinux

This GDB was configured as "powerpc64le-unknown-linux-gnu"...

      KERNEL: /usr/lib/debug/lib/modules/4.18.0-147.el8.ppc64le/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 160
        DATE: Fri Jan 31 02:28:18 2020
      UPTIME: 02:00:44
LOAD AVERAGE: 0.27, 0.22, 0.19
       TASKS: 1741
    NODENAME: thom
     RELEASE: 4.18.0-147.el8.ppc64le
     VERSION: #1 SMP Thu Sep 26 15:44:34 UTC 2019
     MACHINE: ppc64le  (3425 Mhz)
      MEMORY: 12 GB
       PANIC: "Unable to handle kernel paging request for data at address 0x5deadbeef0000078"
         PID: 7236
     COMMAND: "vtltape"
        TASK: c00000027cf71f00  [THREAD_INFO: c00000027b0d4000]
         CPU: 147
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 7236   TASK: c00000027cf71f00  CPU: 147  COMMAND: "vtltape"
 #0 [c00000027b0d77e0] crash_kexec at c000000000247d40
 #1 [c00000027b0d7820] oops_end at c000000000029cf8
 #2 [c00000027b0d78a0] bad_page_fault at c00000000007b6dc
 #3 [c00000027b0d7910] slb_miss_large_addr at c000000000084020
 #4 [c00000027b0d7960] large_addr_slb at c000000000008bcc
 Data SLB Access [380] exception frame:
 R0:  d000000002512644    R1:  c00000027b0d7c50    R2:  d00000000251ee00  
 R3:  0000000000000000    R4:  0000000000000200    R5:  00007fffd556fdf8  
 R6:  00007fffd556fdf8    R7:  0000000000000000    R8:  0000000000000000  
 R9:  5deadbeef0000100    R10: c0000002e4ef7f58    R11: d000000002514180  
 R12: c000000000d47060    R13: c000000007ec4200    R14: 000000000000000c  
 R15: 0000000010056820    R16: 0000000000000000    R17: 00007fffaf4e0788  
 R18: 0000000000000003    R19: 0000000010056818    R20: 0000000000000010  
 R21: 0000000010053dd0    R22: 0000000000000008    R23: 0000000010055e80  
 R24: 0000000010056828    R25: 00007fffd556fdf8    R26: 0000000000000010  
 R27: 00007fffd556fdf8    R28: c00000027b06e500    R29: d000000002517158  
 R30: d000000002517288    R31: 5deadbeef0000078  
 NIP: d0000000025128c0    MSR: 8000000000009033    OR3: d0000000025128c8
 CTR: c000000000d47060    LR:  d000000002512644    XER: 0000000020000000
 CCR: 0000000028004448    MQ:  0000000000000000    DAR: 5deadbeef0000078
 DSISR: c00000027cf72b60     Syscall Result: 0000000000000000
 [NIP  : vtl_c_ioctl+728]
 [LR   : vtl_c_ioctl+92]
 #5 [c00000027b0d7c50] vtl_c_ioctl at d0000000025128c0 [mhvtl]  (unreliable)
 #6 [c00000027b0d7d20] do_vfs_ioctl at c0000000004eb0c4
 #7 [c00000027b0d7de0] sys_ioctl at c0000000004ebc34
 #8 [c00000027b0d7e30] system_call at c00000000000b388
 System Call [c00] exception frame:
 R0:  0000000000000036    R1:  00007fffd556fb90    R2:  00007fffaf457100  
 R3:  0000000000000003    R4:  0000000000000200    R5:  00007fffd556fdf8  
 R6:  0000000000000000    R7:  0000000000000000    R8:  0000000000000003  
 R9:  0000000000000000    R10: 0000000000000000    R11: 0000000000000000  
 R12: 0000000000000000    R13: 00007fffaf5bb9a0    R14: 000000000000000c  
 R15: 0000000010056820    R16: 0000000000000000    R17: 00007fffaf4e0788  
 R18: 0000000000000003    R19: 0000000010056818    R20: 0000000000000010  
 R21: 0000000010053dd0    R22: 0000000000000008    R23: 0000000010055e80  
 R24: 0000000010056828    R25: 0000000000000000    R26: 0000000000000195  
 R27: 0000000010027eb0    R28: 0000000010056940    R29: 0000000010056d50  
 R30: 0000000010027ec0    R31: 00007fffd556fdf8  
 NIP: 00007fffaf362bf0    MSR: 800000000000d033    OR3: 0000000000000003
 CTR: 0000000000000000    LR:  0000000010007890    XER: 0000000000000000
 CCR: 0000000048004448    MQ:  0000000000000000    DAR: 00007fffaf2e0160
 DSISR: 0000000040000000     Syscall Result: 0000000000000002
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

rohr22
I'm not familiar with debugging kernel crashes. Is there any way to correlate the address referenced in the crash backtrace to the line in the source code?
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

rohr22
Hi, Mark. I rebuilt the mhvtl code with VERBOSE=3.

I recreated the crash by running one of our test multiple times. On the 5th execution of the test, I got a kernel crash. When this occurred, the last lines in /var/log/messages before the crash were:

Feb  2 17:14:58 tam /usr/bin/vtltape[74219]: rw_6(): READ: 1 block of 262144 bytes (134179) **
Feb  2 17:14:58 tam /usr/bin/vtltape[74219]: readBlock(): Request to read: 262144 bytes, SILI: 0
Feb  2 17:14:58 tam /usr/bin/vtltape[74219]: read_tape_block(): Reading blk 2345, size: 263176
Feb  2 17:14:58 tam /usr/bin/vtltape[74219]: read_tape_block(): Read block, now positioning to next header: 2346
Feb  2 17:14:58 tam /usr/bin/vtltape[74219]: read_header(): Reading header for block 2346
Feb  2 17:14:58 tam /usr/bin/vtltape[74219]: read_header(): Reading header 2346 at offset 613990111, type: DATA, size: 262144

I analyzed the kernel crash at /var/crash/127.0.0.1-2020-02-02-17:20:59:
      KERNEL: ../vmlinux                        
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 160
        DATE: Sun Feb  2 17:15:03 2020
      UPTIME: 03:17:44
LOAD AVERAGE: 2.73, 0.90, 0.36
       TASKS: 1767
    NODENAME: tam.clearlake.ibm.com
     RELEASE: 4.18.0-147.el8.ppc64le
     VERSION: #1 SMP Thu Sep 26 15:44:34 UTC 2019
     MACHINE: ppc64le  (3425 Mhz)
      MEMORY: 12 GB
       PANIC: "Unable to handle kernel paging request for data at address 0x33312865686360db"
         PID: 74219
     COMMAND: "vtltape"
        TASK: c0000002badfac00  [THREAD_INFO: c0000002bafe0000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 74219  TASK: c0000002badfac00  CPU: 1   COMMAND: "vtltape"
 #0 [c0000002bafe37e0] crash_kexec at c000000000247d40
 #1 [c0000002bafe3820] oops_end at c000000000029cf8
 #2 [c0000002bafe38a0] bad_page_fault at c00000000007b6dc
 #3 [c0000002bafe3910] slb_miss_large_addr at c000000000084020
 #4 [c0000002bafe3960] large_addr_slb at c000000000008bcc
 Data SLB Access [380] exception frame:
 R0:  d0000000025e2b28    R1:  c0000002bafe3c50    R2:  d0000000025ef000  
 R3:  0000000000000000    R4:  0000000000000200    R5:  00007fffc7485e98  
 R6:  00007fffc7485e98    R7:  0000000000000000    R8:  0000000000000000  
 R9:  3331286568636163    R10: c0000002d2f10f98    R11: d0000000025e42b0  
 R12: c000000000d47060    R13: c000000007faee00    R14: 000000000000000c  
 R15: 0000000010056820    R16: 0000000000000000    R17: 00007fffb6830788  
 R18: 0000000000000003    R19: 0000000010056818    R20: 0000000000000010  
 R21: 0000000010053dd0    R22: 0000000000000008    R23: 0000000010055e80  
 R24: 0000000010056828    R25: 00007fffc7485e98    R26: 0000000000000013  
 R27: 00007fffc7485e98    R28: c0000002764d47a8    R29: d0000000025e7360  
 R30: d0000000025e7490    R31: 33312865686360db  
 NIP: d0000000025e2da0    MSR: 8000000000009033    OR3: d0000000025e2da8
 CTR: c000000000d47060    LR:  d0000000025e2b28    XER: 0000000020000000
 CCR: 0000000028004428    MQ:  0000000000000000    DAR: 33312865686360db
 DSISR: c0000002badfb860     Syscall Result: 0000000000000000
 [NIP  : vtl_c_ioctl+728]
 [LR   : vtl_c_ioctl+96]
 #5 [c0000002bafe3c50] vtl_c_ioctl at d0000000025e2da0 [mhvtl]  (unreliable)
 #6 [c0000002bafe3d20] do_vfs_ioctl at c0000000004eb0c4
 #7 [c0000002bafe3de0] sys_ioctl at c0000000004ebc34
 #8 [c0000002bafe3e30] system_call at c00000000000b388
 System Call [c00] exception frame:
 R0:  0000000000000036    R1:  00007fffc7485c30    R2:  00007fffb67a7100  
 R3:  0000000000000003    R4:  0000000000000200    R5:  00007fffc7485e98  
 R6:  0000000000000000    R7:  0000000000000000    R8:  0000000000000003  
 R9:  0000000000000000    R10: 0000000000000000    R11: 0000000000000000  
 R12: 0000000000000000    R13: 00007fffb690b9a0    R14: 000000000000000c  
 R15: 0000000010056820    R16: 0000000000000000    R17: 00007fffb6830788  
 R18: 0000000000000003    R19: 0000000010056818    R20: 0000000000000010  
 R21: 0000000010053dd0    R22: 0000000000000008    R23: 0000000010055e80  
 R24: 0000000010056828    R25: 0000000000000000    R26: 0000000000000195  
 R27: 0000000010027eb0    R28: 0000000010056940    R29: 0000000010056d50  
 R30: 0000000010027ec0    R31: 00007fffc7485e98  
 NIP: 00007fffb66b2bf0    MSR: 800000000000d033    OR3: 0000000000000003
 CTR: 0000000000000000    LR:  0000000010007890    XER: 0000000000000000
 CCR: 0000000048004428    MQ:  0000000000000000    DAR: 00000100031d7678
 DSISR: 0000000042000000     Syscall Result: 0000000000000003

crash> files
PID: 74219  TASK: c0000002badfac00  CPU: 1   COMMAND: "vtltape"
ROOT: /    CWD: /
 FD       FILE            DENTRY           INODE       TYPE PATH
  0 c0000002e554d000 c0000002f905c300 c0000002f08cf3d8 CHR  /dev/null
  1 c0000002e5547200 c0000002ca450c00 c0000002ca3e8430 SOCK UNIX
  2 c0000002e5547200 c0000002ca450c00 c0000002ca3e8430 SOCK UNIX
  3 c00000026e881900 c0000002808d5c40 c0000002764d47a8 CHR  /dev/mhvtl19
  4 c00000027b7a6f00 c0000001d8dbd5c0 c0000001d8de0f30 SOCK UNIX
  5 c00000026e88bf00 c0000001c9b14bc0 c0000001b4351c78 REG  /opt/vtl/F01031L5/data
  6 c00000026e88cd00 c0000001c9b1d080 c0000001b4351348 REG  /opt/vtl/F01031L5/indx
  7 c00000026e885300 c0000001c9b1f6c0 c0000001b4354138 REG  /opt/vtl/F01031L5/meta

Is there any other info I can display to help diagnose the problem?

The address specified (0x33312865686360db) is way too high considering we have 12 GB of RAM. What would be causing the vtltape executable to be referencing an out of bounds memory address? According to sites like https://www.linuxquestions.org/questions/linux-kernel-70/kgdb-debugging-bug-unable-to-handle-kernel-paging-request-4175618282/, this could be caused by an invalid pointer reference in the code. I wonder if you can verify all pointer references before trying to use them. Could there be something unique about the ppc64le architecture that is causing these issues? Is there a setting that we need to change on our operating system to make mhvtl work correctly?

Thank you,
Peter
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

rohr22
Hi, Mark.

Any clues as to what could be wrong? We are continuing to get some of these crashes on our ppc64le machine using your latest code.

Thank you,
Peter
Reply | Threaded
Open this post in threaded view
|

Re: MHVTL 1.6.2 instability on ppc64le

Mark Harvey
Administrator
Some where in the ioctl() is an access to incorrect pointer.. But only seems to be on PPC

The whole ioctl() "interface" I cobbled together all those years ago (while seems to have held together not too bad) is one of those 'todo' items that needs work.

I'm starting on an implementation based on 'netlink'..
It should do away with the hacked on 'char device driver' to the scsi hba driver.

The end goal will be that when a SCSI OP code hits the lld driver - it will send a msg to use-space daemon via netlink for processing..

st -> lld -> netlink -> vtltape

The current flow is
st -> lld (where lld 'stores outstanding opcode flag'), The vtltape polls ioctl() the lld and if the flag is set, the command is copied from kernel lld and processed..otherwise the poll returns 'no flag set' and sits in a continual loop until a 'opcode set' is detected..

Regards from Australia
Mark Harvey
123