Opened 13 years ago

Last modified 12 years ago

#118 new task

SMP: mprotect-based membars don't work well

Reported by: dmik Owned by:
Priority: major Milestone: Next
Component: general Version: 1.6.0-b22 WSE
Severity: highest Keywords:
Cc:

Description (last modified by dmik)

Investigation within #96 has shown that the newer mprotect-based membar technique used in recent Java versions doesn't work well on OS/2 under the SMP kernel for some reason.

The new technique was added to Java to reduce the overhead of calling membar instructions after each state transition, see this Java bug record http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5075546 for more information.

The main idea of the new technique is to use a special page where each thread has its own word it writes to when it changes its state. The VM dispatcher thread changes the memory protection flags of this page to READONLY when it wants to be sure about the thread state after some change and then immediately switches it back to READWRITE which (as I understand) should cause the CPU to flush caches and make sure all cores see the same values of the thread state variables. As it is sometimes possible that this memory protection change happens when one of the threads is writing to the page, an access violation exception may be thrown. This exception needs to be handled gracefully by waiting until the VM thread restores the READWRITE protection mode and retrying the operation. And this is how it is done in Java.

When running under the SMP kernel on OS/2, this technique usually works. However, sometimes retrying the write operation to this special serialize page after restoring the READWRITE mode causes corruption of some registers (that are normally saved upon an exception and should be restored if the exception handler continues execution) so that weird (random) memory locations get written afterwards and/or random functions get called which eventually leads to an application crash due to another access memory violation, bad stack condition and so on.

This ticket is to attempt to solve the mentioned issues.

Note that in r297 the workaround from #96 has been applied that forces the -XX:+UseMembar? option to all JVM invocations.

Attachments (2)

No_FS_abuse_Odin.diff (31.9 KB) - added by dmik 13 years ago.
No_FS_abuse_Java.diff (1.6 KB) - added by dmik 13 years ago.

Download all attachments as: .zip

Change History (8)

comment:1 Changed 13 years ago by dmik

I don't have any valid guess about the cause of these problems (as well as why it happens in SMP mode only). In order to get one, we first need to write a simple test case that will not involve Odin at all and just implement the described membar technique:

  1. The main thread allocates a special serialize page in memory and sets its initial protection to READWRITE.
  2. The main thread starts a bunch of worker threads.
  3. Each worker thread writes to its own word on the serialize page in a loop with some random (but very short -- several dozen ms or so) interval. If an access violation exception is thrown when attempting to write to that word, the worker thread temporarily grabs a special mutex flag (also used by the main thread, see below) and retries.
  4. The main thread occasionally grabs the mutex flag, changes the protection of the serialize page from READWRITE to READONLY and then immediately back, and releases the mutex. It does it in a loop with some very short random interval as well.

This test case should emulate more or less what is going on in Java when no deprecated -XX:+UseMembar option is used.

comment:2 Changed 13 years ago by dmik

Description: modified (diff)

Changed 13 years ago by dmik

Attachment: No_FS_abuse_Odin.diff added

Changed 13 years ago by dmik

Attachment: No_FS_abuse_Java.diff added

comment:3 Changed 13 years ago by dmik

Just for the record. Added some patches I created for Odin/Java? to remove FS switching when I was trying to prove that this switching was the reason why the mprotect scheme fails. However I couldn't prove that: it turned out that similar failures happen when no FS switching takes place and we only maintain a single OS/2 exception chain.

comment:4 Changed 13 years ago by Silvan Scherrer

Milestone: EnhancedGA2

comment:5 Changed 13 years ago by dmik

This ticket contains information about some other problems related to memory corruption on SMP during exception processing: http://svn.netlabs.org/odin32/ticket/37.

comment:6 Changed 12 years ago by Silvan Scherrer

Milestone: GA2Next
Note: See TracTickets for help on using tickets.