I received the following kernel crash while trying to write to write to a mhvtl tape with version 1.5.3:
KERNEL: vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 8 DATE: Fri Dec 11 17:24:09 2015 UPTIME: 1 days, 06:44:55 LOAD AVERAGE: 1.88, 1.45, 1.15 TASKS: 10897 NODENAME: --------------- RELEASE: 2.6.32-573.el6.ppc64 VERSION: #1 SMP Wed Jul 1 18:21:11 EDT 2015 MACHINE: ppc64 (3550 Mhz) MEMORY: 24 GB PANIC: "Unable to handle kernel paging request for data at address 0x00100070" PID: 30724 COMMAND: "vtltape" TASK: c00000043794ce00 [THREAD_INFO: c0000005ea85c000] CPU: 0 STATE: TASK_RUNNING (PANIC) The backtrace showed: crash> bt PID: 30724 TASK: c00000043794ce00 CPU: 0 COMMAND: "vtltape" #0 [c0000005ea85f4f0] .crash_kexec at c0000000000ec0e4 #1 [c0000005ea85f6f0] .die at c000000000031638 #2 [c0000005ea85f7a0] .bad_page_fault at c000000000044bd8 #3 [c0000005ea85f820] handle_page_fault at c000000000005228 Data Access error [300] exception frame: R0: 0000000000000002 R1: c0000005ea85fb10 R2: d000000005fbcbf8 R3: c00000044c9fb780 R4: 0000000000000200 R5: 00000fffffffe868 R6: 00000fffffffe868 R7: 0000000000000000 R8: 0000000000000005 R9: 0000000000100100 R10: c0000000001e6660 R11: c0000005eabc9718 R12: d000000005fb2ea8 R13: c000000001072500 R14: 0000000000000003 R15: 0000000000000000 R16: 00000000100377a0 R17: 000000802d59a980 R18: 0000000010021e10 R19: 00000fffffffe8b0 R20: 00000fffffffea70 R21: 00000000100376e0 R22: 0000000010021e18 R23: 0000000010037868 R24: 0000000000000000 R25: 00000000100378b8 R26: 0000000000000000 R27: 00000fffffffe868 R28: 0000000000000200 R29: ffffffffffffffed R30: d000000005fbcc08 R31: 0000000000100070 NIP: d000000005fb2310 MSR: 8000000000009032 OR3: c000000000f1cb10 CTR: c0000000005e7b80 LR: d000000005fb202c XER: 0000000000000000 CCR: 0000000022002248 MQ: 0000000000000001 DAR: 0000000000100070 DSISR: 0000000040000000 Syscall Result: 0000000000000000 #4 [c0000005ea85fb10] .vtl_c_ioctl at d000000005fb2310 [mhvtl] [Link Register ] [c0000005ea85fb10] .vtl_c_ioctl at d000000005fb202c (unreliable) #5 [c0000005ea85fc00] .vfs_ioctl at c0000000001e5ce4 #6 [c0000005ea85fc90] .do_vfs_ioctl at c0000000001e5f30 #7 [c0000005ea85fd80] .sys_ioctl at c0000000001e6714 #8 [c0000005ea85fe30] syscall_exit at c000000000008564 syscall [c00] exception frame: R0: 0000000000000036 R1: 00000fffffffe760 R2: 00000080785332d8 R3: 0000000000000003 R4: 0000000000000200 R5: 00000fffffffe868 R6: 0000000000000000 R7: 0000000000000000 R8: 0000000000000005 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 000000807834df80 R14: 0000000000000003 R15: 0000000000000000 R16: 00000000100377a0 R17: 000000802d59a980 R18: 0000000010021e10 R19: 00000fffffffe8b0 R20: 00000fffffffea70 R21: 00000000100376e0 R22: 0000000010021e18 R23: 0000000010037868 R24: 0000000000000000 R25: 00000000100378b8 R26: 00000000100376e0 R27: 0000000010037790 R28: 00000000100377a0 R29: 0000000010038258 R30: 00000fffffffe858 R31: 00000fffffffe868 NIP: 0000008078470270 MSR: 800000000000d032 OR3: 0000000000000003 CTR: 00000080784701d0 LR: 000000001000cbf0 XER: 0000000000000000 CCR: 0000000048002248 MQ: 0000000000000001 DAR: 0000008078467d70 DSISR: 0000000040000000 Syscall Result: 00000000014a8000 Is this a known issue and is a fix available? Thank you, Peter |
Administrator
|
Hello Peter,
Unfortunately, this is a new bug report. It's also the first report I've seen of the vtl running on PPC :) I have no method to troubleshoot/diagnose this. Analyzing kernel oops (unfortunately) exceeds my debug skills. Hopefully the syslog will show what ioctl() was being utilised at the time of the crash. Do you have the syslog (typically /var/log/messages) leading up to this crash ? Enabling kernel debugging may throw more light what was occurring at the time. Note: I would dearly love to move away from this custom (a hacked scsi_debug) kernel module and to the newer SCSI target driver now shipped with linux kernel. I've not found the time to make the changes. With Christmas/New Year fast approaching, I can not see any free time to do this until February at the earliest.. I wish I had better news for you.
Regards from Australia
Mark Harvey |
Mark, that is interesting that this is the first time you are aware of vtl running on PPC. Our PPC system uses the big-endian format for storing words. I think the kernel crash only occurs when we are trying to create files over 4 GB (> 32 bits) in size. Maybe something with that combination is causing the crash. Maybe you could briefly analyze the code to see if this could be the mix that causes the crash.
Thank you, Peter |
Hi, Mark. We are still running mhvtl on PPC64 systems and still are getting periodic kernel crashes. Yesterday a kernel crash occurred and crash vmcore /usr/lib/debug/lib/modules/2.6.32-573.el6.ppc64/vmlinux showed:
This GDB was configured as "powerpc64-unknown-linux-gnu"... KERNEL: /usr/lib/debug/lib/modules/2.6.32-573.el6.ppc64/vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 80 DATE: Wed Oct 5 20:23:03 2016 UPTIME: 05:00:15 LOAD AVERAGE: 4.87, 1.74, 1.24 TASKS: 4549 NODENAME: ................................ RELEASE: 2.6.32-573.el6.ppc64 VERSION: #1 SMP Wed Jul 1 18:21:11 EDT 2015 MACHINE: ppc64 (3000 Mhz) MEMORY: 30 GB PANIC: "Unable to handle kernel paging request for data at address 0x5bc020000fffe8" PID: 8512 COMMAND: "vtltape" TASK: c00000075ce5e5c0 [THREAD_INFO: c00000076c230000] CPU: 40 STATE: TASK_RUNNING (PANIC) crash> bt PID: 8512 TASK: c00000075ce5e5c0 CPU: 40 COMMAND: "vtltape" #0 [c00000076c2334f0] .crash_kexec at c0000000000ec0e4 #1 [c00000076c2336f0] .die at c000000000031638 #2 [c00000076c2337a0] .bad_page_fault at c000000000044bd8 #3 [c00000076c233820] handle_page_fault at c000000000005228 Data Access error [300] exception frame: R0: 0000000000000000 R1: c00000076c233b10 R2: d000000005d8cbf8 R3: c000000001041a00 R4: 0000000000000200 R5: 00000000008881f8 R6: 00000ffffb35c638 R7: 0000000000000000 R8: 0000000000000005 R9: 005bc02000100078 R10: c000000000d92000 R11: c00000075d183718 R12: d000000005d82ea8 R13: c000000001078900 R14: 0000000000000003 R15: 0000000000000000 R16: 00000000100377a0 R17: 000000801b38a980 R18: 0000000010021e10 R19: 00000ffffb35c680 R20: 00000ffffb35c840 R21: 00000000100376e0 R22: 0000000010021e18 R23: 0000000010037868 R24: 0000000000000000 R25: 00000000100378b8 R26: 0000000000000000 R27: 00000ffffb35c638 R28: 0000000000000200 R29: ffffffffffffffed R30: d000000005d8cc08 R31: 005bc020000fffe8 NIP: d000000005d82310 MSR: 8000000000009032 OR3: c000000000f1cb10 CTR: c0000000005e7b80 LR: d000000005d8202c XER: 0000000000000000 CCR: 0000000022002248 MQ: 0000000000000001 DAR: 005bc020000fffe8 DSISR: 0000000040000000 Syscall Result: 0000000000000000 #4 [c00000076c233b10] .vtl_c_ioctl at d000000005d82310 [mhvtl] [Link Register ] [c00000076c233b10] .vtl_c_ioctl at d000000005d8202c (unreliable) #5 [c00000076c233c00] .vfs_ioctl at c0000000001e5ce4 #6 [c00000076c233c90] .do_vfs_ioctl at c0000000001e5f30 #7 [c00000076c233d80] .sys_ioctl at c0000000001e6714 #8 [c00000076c233e30] syscall_exit at c000000000008564 syscall [c00] exception frame: R0: 0000000000000036 R1: 00000ffffb35c530 R2: 00000080227132d8 R3: 0000000000000003 R4: 0000000000000200 R5: 00000ffffb35c638 R6: 0000000000000000 R7: 0000000000000000 R8: 0000000000000005 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 000000802252dfa0 R14: 0000000000000003 R15: 0000000000000000 R16: 00000000100377a0 R17: 000000801b38a980 R18: 0000000010021e10 R19: 00000ffffb35c680 R20: 00000ffffb35c840 R21: 00000000100376e0 R22: 0000000010021e18 R23: 0000000010037868 R24: 0000000000000000 R25: 00000000100378b8 R26: 00000000100376e0 R27: 0000000010037790 R28: 00000000100377a0 R29: 0000000010038258 R30: 00000ffffb35c628 R31: 00000ffffb35c638 NIP: 0000008022650270 MSR: 800000000000d032 OR3: 0000000000000003 CTR: 00000080226501d0 LR: 000000001000cbf0 XER: 0000000000000000 CCR: 0000000048002248 MQ: 0000000000000001 DAR: 0000010011900000 DSISR: 0000000042000000 Syscall Result: 0000000000000000 I am using the mhvtl from mhvtl-2015-04-14.tgz, Do you think the above problem was resolved in the most current version of mhvtl? Thank you, Peter |
This bug has probably already been fixed with mhvtl 1.6.2 with the kernel 0.18.28. I have verified the problem no longer happens with ppc64le but not ppc64. We might not use mhvtl on ppc64 again so it may be hard to verify the problem no longer occurs on that environment.
|
Mark, I noticed a new crash occurred with mhvtl 1.6.3 with mhvtl kernel version 0.18.28 (with date 20200303-0) on a RHEL 8.2 machine. I ran the 'crash vmcore' on the crash dump file and saw:
crash> sys KERNEL: /usr/lib/debug/lib/modules/4.18.0-193.el8.ppc64le/vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 160 DATE: Sun Nov 29 03:34:14 2020 UPTIME: 4 days, 08:45:23 LOAD AVERAGE: 27.03, 22.54, 12.25 TASKS: 1665 NODENAME: bain <domain ommitted> RELEASE: 4.18.0-193.el8.ppc64le VERSION: #1 SMP Fri Mar 27 14:40:12 UTC 2020 MACHINE: ppc64le (3425 Mhz) MEMORY: 12 GB PANIC: "Unable to handle kernel paging request for data at address 0x00000000" crash> bt PID: 2088558 TASK: c0000002e9363400 CPU: 27 COMMAND: "vtltape" #0 [c0000002c7f537e0] crash_kexec at c000000000261fd0 #1 [c0000002c7f53820] oops_end at c00000000002b918 #2 [c0000002c7f538a0] bad_page_fault at c00000000007f42c #3 [c0000002c7f53910] handle_page_fault at c00000000000a720 Data Access [300] exception frame: R0: d00000000250285c R1: c0000002c7f53c00 R2: c000000001920a00 R3: 0000000000000000 R4: c0000002cb1a4bc0 R5: c0000002cb1a4bc0 R6: c0000002e633d540 R7: 0000000000000001 R8: 035ffffc00000101 R9: c0000002d22a0b40 R10: c0000002cb2c6540 R11: d000000002503fe8 R12: c00000000095a190 R13: c000000007f78200 R14: 0000000000000001 R15: 0000000010056820 R16: 0000000000000000 R17: 00007fff8cc80788 R18: 0000000000000003 R19: 0000000010056818 R20: 5deadbeef0000100 R21: 5deadbeef0000200 R22: 0000000000000000 R23: d0000000025073b8 R24: c0000002825dd810 R25: c0000002cb1a4bc0 R26: 0000000000000070 R27: d000000002507f80 R28: 0000000000000000 R29: d000000002507268 R30: d000000002507398 R31: c0000002cb2c6540 NIP: c00000000095a1b4 MSR: 800000000280b033 OR3: c000000000008934 CTR: c00000000095a190 LR: d00000000250285c XER: 0000000020000000 CCR: 0000000028004244 MQ: 0000000000000000 DAR: 0000000000000000 DSISR: 0000000040000000 Syscall Result: 0000000000000000 [NIP : scsi_remove_device+36] [LR : vtl_c_ioctl+660] #4 [c0000002c7f53c00] scsi_remove_device at c00000000095a1b4 #5 [c0000002c7f53c30] vtl_c_ioctl at d00000000250285c [mhvtl] (unreliable) #6 [c0000002c7f53d10] do_vfs_ioctl at c0000000005245f0 #7 [c0000002c7f53de0] sys_ioctl at c000000000525184 #8 [c0000002c7f53e30] system_call at c00000000000b388 System Call [c00] exception frame: R0: 0000000000000036 R1: 00007fffc56584c0 R2: 00007fff8cbf7100 R3: 0000000000000003 R4: 0000000000000205 R5: 00007fffc5658748 R6: 0000ff00ffffffff R7: 0000007470002074 R8: 0000000000000003 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 00007fff8cd5ba20 R14: 0000000000000001 R15: 0000000010056820 R16: 0000000000000000 R17: 00007fff8cc80788 R18: 0000000000000003 R19: 0000000010056818 R20: 0000000000000010 R21: 0000000010053dd0 R22: 000000000000000f R23: 0000000010055e80 R24: 0000000010056828 R25: 0000000010055e80 R26: 00000000000598d5 R27: 0000000010027eb0 R28: 0000000010056940 R29: 0000000010056d50 R30: 0000000010027ec0 R31: 0000000000000080 NIP: 00007fff8cb02ab0 MSR: 800000000000d033 OR3: 0000000000000003 CTR: 0000000000000000 LR: 00000000100083b4 XER: 0000000000000000 CCR: 0000000048004428 MQ: 0000000000000000 DAR: 00007fff8caf9860 DSISR: 0000000040000000 Syscall Result: 0000000000000070 It looks like there may be an access of a NULL pointer (Unable to handle kernel paging request for data at address 0x00000000). Any clues as to what went wrong? Since the updates earlier in the year for ppc64le, mhvtl has been very reliable. I think this is the first crash since then that I have noticed. Thank you, Peter |
Free forum by Nabble | Edit this page |