Kernel crash with mhvtl

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Kernel crash with mhvtl

rohr22
I received the following kernel crash while trying to write to write to a mhvtl tape with version 1.5.3:

      KERNEL: vmlinux                          
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 8
        DATE: Fri Dec 11 17:24:09 2015
      UPTIME: 1 days, 06:44:55
LOAD AVERAGE: 1.88, 1.45, 1.15
       TASKS: 10897
    NODENAME: ---------------
     RELEASE: 2.6.32-573.el6.ppc64
     VERSION: #1 SMP Wed Jul 1 18:21:11 EDT 2015
     MACHINE: ppc64  (3550 Mhz)
      MEMORY: 24 GB
       PANIC: "Unable to handle kernel paging request for data at address 0x00100070"
         PID: 30724
     COMMAND: "vtltape"
        TASK: c00000043794ce00  [THREAD_INFO: c0000005ea85c000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

The backtrace showed:
crash> bt
PID: 30724  TASK: c00000043794ce00  CPU: 0   COMMAND: "vtltape"
 #0 [c0000005ea85f4f0] .crash_kexec at c0000000000ec0e4
 #1 [c0000005ea85f6f0] .die at c000000000031638
 #2 [c0000005ea85f7a0] .bad_page_fault at c000000000044bd8
 #3 [c0000005ea85f820] handle_page_fault at c000000000005228
 Data Access error  [300] exception frame:
 R0:  0000000000000002    R1:  c0000005ea85fb10    R2:  d000000005fbcbf8  
 R3:  c00000044c9fb780    R4:  0000000000000200    R5:  00000fffffffe868  
 R6:  00000fffffffe868    R7:  0000000000000000    R8:  0000000000000005  
 R9:  0000000000100100    R10: c0000000001e6660    R11: c0000005eabc9718  
 R12: d000000005fb2ea8    R13: c000000001072500    R14: 0000000000000003  
 R15: 0000000000000000    R16: 00000000100377a0    R17: 000000802d59a980  
 R18: 0000000010021e10    R19: 00000fffffffe8b0    R20: 00000fffffffea70  
 R21: 00000000100376e0    R22: 0000000010021e18    R23: 0000000010037868  
 R24: 0000000000000000    R25: 00000000100378b8    R26: 0000000000000000  
 R27: 00000fffffffe868    R28: 0000000000000200    R29: ffffffffffffffed  
 R30: d000000005fbcc08    R31: 0000000000100070  
 NIP: d000000005fb2310    MSR: 8000000000009032    OR3: c000000000f1cb10
 CTR: c0000000005e7b80    LR:  d000000005fb202c    XER: 0000000000000000
 CCR: 0000000022002248    MQ:  0000000000000001    DAR: 0000000000100070
 DSISR: 0000000040000000     Syscall Result: 0000000000000000
 #4 [c0000005ea85fb10] .vtl_c_ioctl at d000000005fb2310 [mhvtl]
 [Link Register ]  [c0000005ea85fb10] .vtl_c_ioctl at d000000005fb202c  (unreliable)
 #5 [c0000005ea85fc00] .vfs_ioctl at c0000000001e5ce4
 #6 [c0000005ea85fc90] .do_vfs_ioctl at c0000000001e5f30
 #7 [c0000005ea85fd80] .sys_ioctl at c0000000001e6714
 #8 [c0000005ea85fe30] syscall_exit at c000000000008564
 syscall  [c00] exception frame:
 R0:  0000000000000036    R1:  00000fffffffe760    R2:  00000080785332d8  
 R3:  0000000000000003    R4:  0000000000000200    R5:  00000fffffffe868  
 R6:  0000000000000000    R7:  0000000000000000    R8:  0000000000000005  
 R9:  0000000000000000    R10: 0000000000000000    R11: 0000000000000000  
 R12: 0000000000000000    R13: 000000807834df80    R14: 0000000000000003  
 R15: 0000000000000000    R16: 00000000100377a0    R17: 000000802d59a980  
 R18: 0000000010021e10    R19: 00000fffffffe8b0    R20: 00000fffffffea70  
 R21: 00000000100376e0    R22: 0000000010021e18    R23: 0000000010037868  
 R24: 0000000000000000    R25: 00000000100378b8    R26: 00000000100376e0  
 R27: 0000000010037790    R28: 00000000100377a0    R29: 0000000010038258  
 R30: 00000fffffffe858    R31: 00000fffffffe868  
 NIP: 0000008078470270    MSR: 800000000000d032    OR3: 0000000000000003
 CTR: 00000080784701d0    LR:  000000001000cbf0    XER: 0000000000000000
 CCR: 0000000048002248    MQ:  0000000000000001    DAR: 0000008078467d70
 DSISR: 0000000040000000     Syscall Result: 00000000014a8000

Is this a known issue and is a fix available?

Thank you,
Peter
Reply | Threaded
Open this post in threaded view
|

Re: Kernel crash with mhvtl

Mark Harvey
Administrator
Hello Peter,

Unfortunately, this is a new bug report. It's also the first report I've seen of the vtl running on PPC :)

I have no method to troubleshoot/diagnose this. Analyzing kernel oops (unfortunately) exceeds my debug skills. Hopefully the syslog will show what ioctl() was being utilised at the time of the crash.

Do you have the syslog (typically /var/log/messages) leading up to this crash ?

Enabling kernel debugging may throw more light what was occurring at the time.

Note: I would dearly love to move away from this custom (a hacked scsi_debug) kernel module and to the newer SCSI target driver now shipped with linux kernel. I've not found the time to make the changes. With Christmas/New Year fast approaching, I can not see any free time to do this until February at the earliest..

I wish I had better news for you.
Regards from Australia
Mark Harvey
Reply | Threaded
Open this post in threaded view
|

Re: Kernel crash with mhvtl

rohr22
Mark, that is interesting that this is the first time you are aware of vtl running on PPC. Our PPC system uses the big-endian format for storing words. I think the kernel crash only occurs when we are trying to create files over 4 GB (> 32 bits) in size. Maybe something with that combination is causing the crash. Maybe you could briefly analyze the code to see if this could be the mix that causes the crash.

Thank you,
Peter
Reply | Threaded
Open this post in threaded view
|

Re: Kernel crash with mhvtl

rohr22
Hi, Mark. We are still running mhvtl on PPC64 systems and still are getting periodic kernel crashes. Yesterday a kernel crash occurred and crash vmcore /usr/lib/debug/lib/modules/2.6.32-573.el6.ppc64/vmlinux showed:

This GDB was configured as "powerpc64-unknown-linux-gnu"...

      KERNEL: /usr/lib/debug/lib/modules/2.6.32-573.el6.ppc64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 80
        DATE: Wed Oct  5 20:23:03 2016
      UPTIME: 05:00:15
LOAD AVERAGE: 4.87, 1.74, 1.24
       TASKS: 4549
    NODENAME: ................................
     RELEASE: 2.6.32-573.el6.ppc64
     VERSION: #1 SMP Wed Jul 1 18:21:11 EDT 2015
     MACHINE: ppc64  (3000 Mhz)
      MEMORY: 30 GB
       PANIC: "Unable to handle kernel paging request for data at address 0x5bc020000fffe8"
         PID: 8512
     COMMAND: "vtltape"
        TASK: c00000075ce5e5c0  [THREAD_INFO: c00000076c230000]
         CPU: 40
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 8512   TASK: c00000075ce5e5c0  CPU: 40  COMMAND: "vtltape"
 #0 [c00000076c2334f0] .crash_kexec at c0000000000ec0e4
 #1 [c00000076c2336f0] .die at c000000000031638
 #2 [c00000076c2337a0] .bad_page_fault at c000000000044bd8
 #3 [c00000076c233820] handle_page_fault at c000000000005228
 Data Access error  [300] exception frame:
 R0:  0000000000000000    R1:  c00000076c233b10    R2:  d000000005d8cbf8  
 R3:  c000000001041a00    R4:  0000000000000200    R5:  00000000008881f8  
 R6:  00000ffffb35c638    R7:  0000000000000000    R8:  0000000000000005  
 R9:  005bc02000100078    R10: c000000000d92000    R11: c00000075d183718  
 R12: d000000005d82ea8    R13: c000000001078900    R14: 0000000000000003  
 R15: 0000000000000000    R16: 00000000100377a0    R17: 000000801b38a980  
 R18: 0000000010021e10    R19: 00000ffffb35c680    R20: 00000ffffb35c840  
 R21: 00000000100376e0    R22: 0000000010021e18    R23: 0000000010037868  
 R24: 0000000000000000    R25: 00000000100378b8    R26: 0000000000000000  
 R27: 00000ffffb35c638    R28: 0000000000000200    R29: ffffffffffffffed  
 R30: d000000005d8cc08    R31: 005bc020000fffe8  
 NIP: d000000005d82310    MSR: 8000000000009032    OR3: c000000000f1cb10
 CTR: c0000000005e7b80    LR:  d000000005d8202c    XER: 0000000000000000
 CCR: 0000000022002248    MQ:  0000000000000001    DAR: 005bc020000fffe8
 DSISR: 0000000040000000     Syscall Result: 0000000000000000
 #4 [c00000076c233b10] .vtl_c_ioctl at d000000005d82310 [mhvtl]
 [Link Register ]  [c00000076c233b10] .vtl_c_ioctl at d000000005d8202c  (unreliable)
 #5 [c00000076c233c00] .vfs_ioctl at c0000000001e5ce4
 #6 [c00000076c233c90] .do_vfs_ioctl at c0000000001e5f30
 #7 [c00000076c233d80] .sys_ioctl at c0000000001e6714
 #8 [c00000076c233e30] syscall_exit at c000000000008564
 syscall  [c00] exception frame:
 R0:  0000000000000036    R1:  00000ffffb35c530    R2:  00000080227132d8  
 R3:  0000000000000003    R4:  0000000000000200    R5:  00000ffffb35c638  
 R6:  0000000000000000    R7:  0000000000000000    R8:  0000000000000005  
 R9:  0000000000000000    R10: 0000000000000000    R11: 0000000000000000  
 R12: 0000000000000000    R13: 000000802252dfa0    R14: 0000000000000003  
 R15: 0000000000000000    R16: 00000000100377a0    R17: 000000801b38a980  
 R18: 0000000010021e10    R19: 00000ffffb35c680    R20: 00000ffffb35c840  
 R21: 00000000100376e0    R22: 0000000010021e18    R23: 0000000010037868  
 R24: 0000000000000000    R25: 00000000100378b8    R26: 00000000100376e0  
 R27: 0000000010037790    R28: 00000000100377a0    R29: 0000000010038258  
 R30: 00000ffffb35c628    R31: 00000ffffb35c638  
 NIP: 0000008022650270    MSR: 800000000000d032    OR3: 0000000000000003
 CTR: 00000080226501d0    LR:  000000001000cbf0    XER: 0000000000000000
 CCR: 0000000048002248    MQ:  0000000000000001    DAR: 0000010011900000
 DSISR: 0000000042000000     Syscall Result: 0000000000000000

I am using the mhvtl from mhvtl-2015-04-14.tgz, Do you think the above problem was resolved in the most current version of mhvtl?

Thank you,
Peter
Reply | Threaded
Open this post in threaded view
|

Re: Kernel crash with mhvtl

rohr22
This bug has probably already been fixed with mhvtl 1.6.2 with the kernel 0.18.28. I have verified the problem no longer happens with ppc64le but not ppc64. We might not use mhvtl on ppc64 again so it may be hard to verify the problem no longer occurs on that environment.
Reply | Threaded
Open this post in threaded view
|

Re: Kernel crash with mhvtl

rohr22
Mark, I noticed a new crash occurred with mhvtl 1.6.3 with mhvtl kernel version 0.18.28 (with date 20200303-0) on a RHEL 8.2 machine. I ran the 'crash vmcore' on the crash dump file and saw:

crash> sys
      KERNEL: /usr/lib/debug/lib/modules/4.18.0-193.el8.ppc64le/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 160
        DATE: Sun Nov 29 03:34:14 2020
      UPTIME: 4 days, 08:45:23
LOAD AVERAGE: 27.03, 22.54, 12.25
       TASKS: 1665
    NODENAME: bain <domain ommitted>
     RELEASE: 4.18.0-193.el8.ppc64le
     VERSION: #1 SMP Fri Mar 27 14:40:12 UTC 2020
     MACHINE: ppc64le  (3425 Mhz)
      MEMORY: 12 GB
       PANIC: "Unable to handle kernel paging request for data at address 0x00000000"

crash> bt
PID: 2088558  TASK: c0000002e9363400  CPU: 27  COMMAND: "vtltape"
 #0 [c0000002c7f537e0] crash_kexec at c000000000261fd0
 #1 [c0000002c7f53820] oops_end at c00000000002b918
 #2 [c0000002c7f538a0] bad_page_fault at c00000000007f42c
 #3 [c0000002c7f53910] handle_page_fault at c00000000000a720
 Data Access [300] exception frame:
 R0:  d00000000250285c    R1:  c0000002c7f53c00    R2:  c000000001920a00  
 R3:  0000000000000000    R4:  c0000002cb1a4bc0    R5:  c0000002cb1a4bc0  
 R6:  c0000002e633d540    R7:  0000000000000001    R8:  035ffffc00000101  
 R9:  c0000002d22a0b40    R10: c0000002cb2c6540    R11: d000000002503fe8  
 R12: c00000000095a190    R13: c000000007f78200    R14: 0000000000000001  
 R15: 0000000010056820    R16: 0000000000000000    R17: 00007fff8cc80788  
 R18: 0000000000000003    R19: 0000000010056818    R20: 5deadbeef0000100  
 R21: 5deadbeef0000200    R22: 0000000000000000    R23: d0000000025073b8  
 R24: c0000002825dd810    R25: c0000002cb1a4bc0    R26: 0000000000000070  
 R27: d000000002507f80    R28: 0000000000000000    R29: d000000002507268  
 R30: d000000002507398    R31: c0000002cb2c6540  
 NIP: c00000000095a1b4    MSR: 800000000280b033    OR3: c000000000008934
 CTR: c00000000095a190    LR:  d00000000250285c    XER: 0000000020000000
 CCR: 0000000028004244    MQ:  0000000000000000    DAR: 0000000000000000
 DSISR: 0000000040000000     Syscall Result: 0000000000000000
 [NIP  : scsi_remove_device+36]
 [LR   : vtl_c_ioctl+660]
 #4 [c0000002c7f53c00] scsi_remove_device at c00000000095a1b4
 #5 [c0000002c7f53c30] vtl_c_ioctl at d00000000250285c [mhvtl]  (unreliable)
 #6 [c0000002c7f53d10] do_vfs_ioctl at c0000000005245f0
 #7 [c0000002c7f53de0] sys_ioctl at c000000000525184
 #8 [c0000002c7f53e30] system_call at c00000000000b388
 System Call [c00] exception frame:
 R0:  0000000000000036    R1:  00007fffc56584c0    R2:  00007fff8cbf7100  
 R3:  0000000000000003    R4:  0000000000000205    R5:  00007fffc5658748  
 R6:  0000ff00ffffffff    R7:  0000007470002074    R8:  0000000000000003  
 R9:  0000000000000000    R10: 0000000000000000    R11: 0000000000000000  
 R12: 0000000000000000    R13: 00007fff8cd5ba20    R14: 0000000000000001  
 R15: 0000000010056820    R16: 0000000000000000    R17: 00007fff8cc80788  
 R18: 0000000000000003    R19: 0000000010056818    R20: 0000000000000010  
 R21: 0000000010053dd0    R22: 000000000000000f    R23: 0000000010055e80  
 R24: 0000000010056828    R25: 0000000010055e80    R26: 00000000000598d5  
 R27: 0000000010027eb0    R28: 0000000010056940    R29: 0000000010056d50  
 R30: 0000000010027ec0    R31: 0000000000000080  
 NIP: 00007fff8cb02ab0    MSR: 800000000000d033    OR3: 0000000000000003
 CTR: 0000000000000000    LR:  00000000100083b4    XER: 0000000000000000
 CCR: 0000000048004428    MQ:  0000000000000000    DAR: 00007fff8caf9860
 DSISR: 0000000040000000     Syscall Result: 0000000000000070

It looks like there may be an access of a NULL pointer (Unable to handle kernel paging request for data at address 0x00000000). Any clues as to what went wrong?

Since the updates earlier in the year for ppc64le, mhvtl has been very reliable. I think this is the first crash since then that I have noticed.

Thank you,
Peter