An ad-hoc collection of notes on IA64 MCA and INIT processing

Feel free to update it with notes about any area that is not clear.

MCA/INIT are completely asynchronous. They can occur at any time, whenthe OS is in any state. Including when one of the cpus is alreadyholding a spinlock. Trying to get any lock from MCA/INIT state isasking for deadlock. Also the state of structures that are protectedby locks is indeterminate, including linked lists.

The complicated ia64 MCA process. All of this is mandated by Intel’sspecification for ia64 SAL, error recovery and unwind, it is not asif we have a choice here.

  • MCA occurs on one cpu, usually due to a double bit memory error.This is the monarch cpu.
  • SAL sends an MCA rendezvous interrupt (which is a normal interrupt)to all the other cpus, the slaves.
  • Slave cpus that receive the MCA interrupt call down into SAL, theyend up spinning disabled while the MCA is being serviced.
  • If any slave cpu was already spinning disabled when the MCA occurredthen it cannot service the MCA interrupt. SAL waits ~20 seconds thensends an unmaskable INIT event to the slave cpus that have notalready rendezvoused.
  • Because MCA/INIT can be delivered at any time, including when the cpuis down in PAL in physical mode, the registers at the time of theevent are _completely_ undefined. In particular the MCA/INIThandlers cannot rely on the thread pointer, PAL physical mode can(and does) modify TP. It is allowed to do that as long as it resetsTP on return. However MCA/INIT events expose us to these PALinternal TP changes. Hence curr_task().
  • If an MCA/INIT event occurs while the kernel was running (not userspace) and the kernel has called PAL then the MCA/INIT handler cannotassume that the kernel stack is in a fit state to be used. Mainlybecause PAL may or may not maintain the stack pointer internally.Because the MCA/INIT handlers cannot trust the kernel stack, theyhave to use their own, per-cpu stacks. The MCA/INIT stacks arepreformatted with just enough task state to let the relevant handlersdo their job.
  • Unlike most other architectures, the ia64 struct task is embedded inthe kernel stack[1]. So switching to a new kernel stack means thatwe switch to a new task as well. Because various bits of the kernelassume that current points into the struct task, switching to a newstack also means a new value for current.
  • Once all slaves have rendezvoused and are spinning disabled, themonarch is entered. The monarch now tries to diagnose the problemand decide if it can recover or not.
  • Part of the monarch’s job is to look at the state of all the othertasks. The only way to do that on ia64 is to call the unwinder,as mandated by Intel.
  • The starting point for the unwind depends on whether a task isrunning or not. That is, whether it is on a cpu or is blocked. Themonarch has to determine whether or not a task is on a cpu before itknows how to start unwinding it. The tasks that received an MCA orINIT event are no longer running, they have been converted to blockedtasks. But (and its a big but), the cpus that received the MCArendezvous interrupt are still running on their normal kernel stacks!
  • To distinguish between these two cases, the monarch must know whichtasks are on a cpu and which are not. Hence each slave cpu thatswitches to an MCA/INIT stack, registers its new stack usingset_curr_task(), so the monarch can tell that the _original_ task isno longer running on that cpu. That gives us a decent chance ofgetting a valid backtrace of the _original_ task.
  • MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of anested error, we want diagnostics on the MCA/INIT handler thatfailed, not on the task that was originally running. Again thisrequires set_curr_task() so the MCA/INIT handlers can register theirown stack as running on that cpu. Then a recursive error gets atrace of the failing handler’s “task”.
[1]
My (Keith Owens) original design called for ia64 to separate itsstruct task and the kernel stacks. Then the MCA/INIT data would bechained stacks like i386 interrupt stacks. But that requiredradical surgery on the rest of ia64, plus extra hard wired TLBentries with its associated performance degradation. DavidMosberger vetoed that approach. Which meant that separate kernelstacks meant separate “tasks” for the MCA/INIT handlers.

INIT is less complicated than MCA. Pressing the nmi button or usingthe equivalent command on the management console sends INIT to allcpus. SAL picks one of the cpus as the monarch and the rest areslaves. All the OS INIT handlers are entered at approximately the sametime. The OS monarch prints the state of all tasks and returns, afterwhich the slaves return and the system resumes.

At least that is what is supposed to happen. Alas there are brokenversions of SAL out there. Some drive all the cpus as monarchs. Somedrive them all as slaves. Some drive one cpu as monarch, wait for thatcpu to return from the OS then drive the rest as slaves. Some versionsof SAL cannot even cope with returning from the OS, they spin insideSAL on resume. The OS INIT code has workarounds for some of thesebroken SAL symptoms, but some simply cannot be fixed from the OS side.

The scheduler hooks used by ia64 (curr_task, set_curr_task) are layerviolations. Unfortunately MCA/INIT start off as massive layerviolations (can occur at _any_ time) and they build from there.

At least ia64 makes an attempt at recovering from hardware errors, butit is a difficult problem because of the asynchronous nature of theseerrors. When processing an unmaskable interrupt we sometimes needspecial code to cope with our inability to take any locks.

How is ia64 MCA/INIT different from x86 NMI?

  • x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent toall cpus.
  • x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2per cpu.
  • x86 has a separate struct task which points to one of multiple kernelstacks. ia64 has the struct task embedded in the single kernelstack, so switching stack means switching task.
  • x86 does not call the BIOS so the NMI handler does not have to worryabout any registers having changed. MCA/INIT can occur while the cpuis in PAL in physical mode, with undefined registers and an undefinedkernel stack.
  • i386 backtrace is not very sensitive to whether a process is runningor not. ia64 unwind is very, very sensitive to whether a process isrunning or not.

What happens when MCA/INIT is delivered what a cpu is running userspace code?

The user mode registers are stored in the RSE area of the MCA/INIT onentry to the OS and are restored from there on return to SAL, so usermode registers are preserved across a recoverable MCA/INIT. Since theOS has no idea what unwind data is available for the user space stack,MCA/INIT never tries to backtrace user space. Which means that the OSdoes not bother making the user space process look like a blocked task,i.e. the OS does not copy pt_regs and switch_stack to the user spacestack. Also the OS has no idea how big the user space RSE and memorystacks are, which makes it too risky to copy the saved state to a usermode stack.

How do we get a backtrace on the tasks that were running when MCA/INITwas delivered?

mca.c:::ia64_mca_modify_original_stack(). That identifies andverifies the original kernel stack, copies the dirty registers fromthe MCA/INIT stack’s RSE to the original stack’s RSE, copies theskeleton struct pt_regs and switch_stack to the original stack, fillsin the skeleton structures from the PAL minstate area and updates theoriginal stack’s thread.ksp. That makes the original stack lookexactly like any other blocked task, i.e. it now appears to besleeping. To get a backtrace, just start with thread.ksp for theoriginal task and unwind like any other sleeping task.

How do we identify the tasks that were running when MCA/INIT wasdelivered?

If the previous task has been verified and converted to a blockedstate, then sos->prev_task on the MCA/INIT stack is updated to point tothe previous task. You can look at that field in dumps or debuggers.To help distinguish between the handler and the original tasks,handlers have _TIF_MCA_INIT set in thread_info.flags.

The sos data is always in the MCA/INIT handler stack, at offsetMCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate itas KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(structia64_sal_os_state), with 16 byte alignment for all structures.

Also the comm field of the MCA/INIT task is modified to include the pidof the original task, for humans to use. For example, a comm field of‘MCA 12159’ means that pid 12159 was running when the MCA wasdelivered.