CPU Page Fault Handling In DOSBox: A Deep Dive

by Omar Yusuf 47 views

Introduction

Hey guys! Let's dive into a crucial topic for DOSBox Staging: CPU page fault handling. Right now, DOSBox Staging lacks a proper implementation for this, which is a major bummer for running Windows 98 smoothly. It's also causing crashes with some Windows 3.1x games, as highlighted in this issue. Addressing CPU page fault handling is paramount for enhancing DOSBox Staging's compatibility and stability. The current absence of robust CPU page fault handling not only hinders the emulation of complex operating systems like Windows 98 but also impacts the reliability of running older games, underscoring the urgent need for a comprehensive solution. A well-implemented CPU page fault handling mechanism is essential to prevent crashes and ensure that the emulator can manage memory effectively, providing a seamless user experience. Tackling the CPU page fault handling deficiency will significantly improve the overall functionality and user satisfaction with DOSBox Staging, making it a more versatile and dependable emulation platform. Furthermore, by effectively managing CPU page fault handling, DOSBox Staging can better replicate the behavior of legacy systems, thus preserving the integrity and authenticity of the emulated environment. The challenges posed by the lack of proper CPU page fault handling are significant, but with careful planning and implementation, DOSBox Staging can overcome these hurdles and deliver a superior emulation experience.

The DOSBox-X Advantage

Now, here's some good news! DOSBox-X has a robust, battle-tested page fault implementation that's working like a charm. SarahPowers from the eXoDOS team put it to the test with a whopping 1864 Win9x games from 1994-1998, and only about 15 had issues. That's a 99% success rate, folks! So, we can confidently say that DOSBox-X's approach is fit for the job. The impressive performance of DOSBox-X in handling CPU page fault handling underscores the importance of adopting a similar strategy for DOSBox Staging. By leveraging the proven methods used in DOSBox-X, DOSBox Staging can potentially achieve a similar level of stability and compatibility, greatly enhancing its ability to emulate complex operating systems and games. The success of DOSBox-X's CPU page fault handling implementation provides a valuable blueprint for DOSBox Staging, offering insights into effective techniques and best practices. This advantage not only accelerates the development process but also increases the likelihood of a successful outcome, ensuring that DOSBox Staging can reliably manage CPU page fault handling and deliver a consistent emulation experience. Furthermore, the high success rate of DOSBox-X serves as a benchmark, motivating the DOSBox Staging team to strive for excellence in CPU page fault handling implementation.

Why is this important?

Think of CPU page fault handling as the traffic controller of your computer's memory. When the CPU tries to access a piece of memory that isn't currently available, a page fault occurs. A proper CPU page fault handling system steps in to fetch the data from storage (like your hard drive) and load it into memory, allowing the CPU to continue its work. Without this system, the computer would crash whenever it encountered a missing piece of data. In the context of DOSBox Staging, effective CPU page fault handling is essential for running complex operating systems like Windows 98, which frequently swap data between memory and storage. The intricacies of CPU page fault handling involve managing virtual memory, physical memory, and the translation between them. When a page fault occurs, the operating system must determine the location of the requested data, allocate a free page in physical memory, load the data, and update the page tables to reflect the new mapping. This process must be handled efficiently to minimize performance overhead. Poorly managed CPU page fault handling can lead to system instability, data corruption, and a frustrating user experience. Therefore, implementing a robust and reliable CPU page fault handling system is a critical task for any emulator aiming to accurately replicate the behavior of a real computer system.

Diving Deeper: The Technical Challenges

Related Issues and Discussions

Before we get into the nitty-gritty, let's take a look at some related discussions and issues that highlight the importance of CPU page fault handling:

These links provide valuable context and insights into the problems users are facing and the complexities involved in implementing proper CPU page fault handling. Examining these discussions can reveal common pain points, specific scenarios where page faults occur, and potential strategies for addressing them. The exchange of ideas and experiences within the community can also lead to innovative solutions and a deeper understanding of the issues at hand. Moreover, tracking related issues and discussions helps ensure that the development efforts are aligned with the needs and expectations of the users. By actively engaging with the community and addressing their concerns, the DOSBox Staging team can create a more robust and user-friendly emulator. The challenges of CPU page fault handling are multifaceted, requiring a combination of technical expertise, careful planning, and effective communication with the user community.

References and Resources

To get a solid understanding of how to tackle this, let's check out some key references, which discuss the core concepts of CPU page fault handling:

These resources offer a deep dive into the technical aspects of CPU page fault handling, providing valuable insights into the strategies and challenges involved. The DOSBox-X pull request and issues provide specific examples of how CPU page fault handling has been implemented and addressed in that project, offering a potential blueprint for DOSBox Staging. The VOGONS discussion brings together a community of experts and enthusiasts to discuss the intricacies of CPU page fault handling in the dyn_x86 core, offering diverse perspectives and potential solutions. By studying these references, the DOSBox Staging team can gain a comprehensive understanding of the issues and best practices related to CPU page fault handling, enabling them to develop an effective and reliable implementation. The complexities of CPU page fault handling require a thorough understanding of the underlying hardware and software interactions, making these resources invaluable for the development process.

Jon Campbell's Insights: A Goldmine of Information

Jon Campbell from the DOSBox-X project has shared some incredibly helpful notes on Discord, giving us a peek into the challenges and solutions for CPU page fault handling. Let's break down some key takeaways:

The Problem with the Traditional DOSBox Approach

The normal DOSBox approach to page faults is to try to resolve it by recursing into another Normal() execution loop. That's fine for DOS games and Windows 3.1 that always return directly from the page fault, but falls apart in the preemptive multitasking world of Windows 95.

Jon highlights that the traditional recursive approach works well for DOS and Windows 3.1 but crumbles under the multitasking environment of Windows 95. In essence, this traditional method for handling CPU page fault handling is akin to trying to manage a busy intersection with only one traffic officer who can only handle one car at a time. In the simpler environments of DOS and Windows 3.1, this works because traffic (the flow of tasks) is relatively light and predictable. However, when Windows 95 enters the scene, it's like suddenly having a massive surge of cars all trying to navigate the intersection simultaneously. The single traffic officer (the recursive approach) becomes overwhelmed and unable to manage the complexity, leading to crashes and system instability. The preemptive multitasking nature of Windows 95 means that tasks can be interrupted and switched out at any moment, including during the handling of a page fault. This creates a chaotic situation where the emulator loses track of which task caused the page fault and how to properly resume execution. The limitations of the traditional CPU page fault handling approach become glaringly apparent when faced with the demands of a modern, multitasking operating system. A more robust and sophisticated solution is needed to handle the intricacies of CPU page fault handling in such environments.

Why Windows 95 Breaks the Mold

When a page fault occurs in Windows 95 it's often reflected to the application (Structured Exception Handling) which either handles it or falls to the default "crash" handler. A task switch can occur during the resolution of the page fault, and that's what screws up DOSBox.

The recursive mode can get confused because task A page faults, then during resolution, Windows 95 goes to task B, >then C, then D, which page faults, then during that Windows 95 returns to task A and that's when DOSBox sees the page fault as "resolved". But a lot changed out from under the emulator during that time (including CPU state!) and that's where things go wrong.

This is the crux of the issue. Windows 95's preemptive multitasking can interrupt the CPU page fault handling process, leading to confusion and crashes. Imagine a juggling act where you're trying to keep multiple balls in the air, and someone keeps bumping into you, causing you to lose track of which ball you were about to catch. That's essentially what's happening with the recursive CPU page fault handling approach in Windows 95. The emulator gets interrupted mid-process, loses its context, and ends up in a state of disarray. The CPU page fault handling mechanism is further complicated by the fact that Windows 95 uses Structured Exception Handling (SEH), which allows applications to handle page faults themselves. This means that a page fault can trigger a complex series of events, potentially involving multiple tasks and handlers. The recursive approach struggles to manage this complexity, often leading to inconsistencies and crashes. The key takeaway here is that the traditional CPU page fault handling method is simply not equipped to handle the dynamic and unpredictable nature of a multitasking operating system like Windows 95. A more sophisticated and non-recursive approach is needed to ensure stability and reliability.

The DOSBox-X Solution: A C++ Exception Approach

The way DOSBox-X handles it is to make guest page faults (when not running a callback instruction) a C++ exception that throws emulator execution back up the stack to the Normal() function which can then non-recursively execute the guest page fault handler.

DOSBox-X cleverly uses C++ exceptions to handle page faults non-recursively, providing a much more robust solution. Think of it as having a dedicated emergency hotline that immediately connects you to a specialist who can handle the CPU page fault handling without getting bogged down in other tasks. This approach allows the emulator to quickly and efficiently respond to page faults without the risk of getting lost in a maze of recursive calls. The C++ exception mechanism provides a clean and structured way to interrupt the normal flow of execution and transfer control to the CPU page fault handling routine. This ensures that the emulator can maintain a clear understanding of the system state and avoid the confusion that plagues the recursive approach. The non-recursive nature of this solution is crucial for handling the preemptive multitasking environment of Windows 95, as it prevents the emulator from getting interrupted mid-process and losing its context. The DOSBox-X approach to CPU page fault handling demonstrates the power of leveraging modern programming techniques to solve complex emulation challenges.

The Importance of Instruction Integrity

Then, normal core instructions need to be re-written so that they either complete, or if interrupted by a page fault, leaves the CPU in the state it was before the instruction started.

This is a crucial point. To handle page faults correctly, instructions must either complete fully or be rolled back to their initial state. Imagine a construction project where you're building a house, and halfway through, a power outage forces you to stop. If you don't have a proper backup system, the house might be left in a partially built, unstable state. Similarly, if a CPU page fault handling interrupts an instruction without proper rollback, the CPU state can become inconsistent, leading to crashes and unpredictable behavior. The key is to ensure that every instruction is atomic, meaning that it either completes entirely or leaves no trace of its execution. This requires careful design and implementation of the CPU emulation core, ensuring that all instructions can be safely interrupted and resumed without corrupting the system state. This atomic nature of instructions is essential for maintaining the integrity of the emulated environment and ensuring that the CPU page fault handling process can be handled reliably. The effort to rewrite the normal core instructions to ensure their interruptibility highlights the depth of the technical challenges involved in CPU page fault handling.

C++ Exceptions and Dynamic Core: A Tricky Combination

Dynamic core as written has a problem with the C++ exception trick. Mostly because on anything other than 32-bit Windows, the C++ runtime cannot identify which exception handler frame to run when it stack traces into the dynamically generated code. Here on Linux, the result is that whether or not you had any try ... catch blocks set up the C++ runtime will always reflect to the default handler which aborts the emulator. This is why the reset handling logic warns the way it does.

32-bit Windows is an exception because the structured exception handling is directed by a DWORD in segment FS: and that continues to work no matter where the instruction pointer existed. If you've ever wondered what those .eh_data segments in binaries on Linux are, that's the structured exception handling stuff.

Here's a tricky bit. The C++ exception approach works well with the normal core but has issues with the dynamic core, especially on Linux. It's like having a powerful engine that doesn't quite fit into the car's chassis. The dynamic core, which uses Just-In-Time (JIT) compilation to improve performance, generates code on the fly. This makes it difficult for the C++ runtime to track exception handler frames, leading to crashes. The core of the issue lies in how different operating systems handle structured exception handling. On 32-bit Windows, the exception handling is directed by a DWORD in segment FS:, which provides a consistent way to track exception handlers regardless of where the instruction pointer is. However, on other platforms like Linux, the exception handling mechanism relies on .eh_data segments in binaries, which are not easily integrated with dynamically generated code. This incompatibility between the C++ exception mechanism and the dynamic core presents a significant challenge for implementing CPU page fault handling in DOSBox Staging. The DOSBox-X solution addresses this by deferring to the normal core when 386 paging is enabled, but this comes at the cost of performance. Finding a way to effectively integrate C++ exceptions with the dynamic core on all platforms remains an open area for research and development.

Dynamic Recompilation for ARM: More Insights

Let's dig into some more insights, this time focusing on dynamic recompilation for ARM architectures. These notes further illustrate the complexities of CPU page fault handling in different environments.

The Original DOSBox SVN Approach: Revisited

The original way DOSBox SVN handled page faults is to push a page fault frame onto the stack and recurse into another execution loop to handle the page fault. When the page fault returns, it exits the sub-loop and continues where it left off.

That's perfectly fine for DOS games and Windows 3.1

This reinforces the point that the original recursive approach is adequate for simpler environments but falls short in more complex scenarios. It’s like using a bicycle to commute in a small town – it works perfectly fine until you try to navigate a bustling city highway. In the context of CPU page fault handling, the recursive approach is sufficient for the relatively straightforward memory management of DOS and Windows 3.1. However, the preemptive multitasking and virtual memory management of Windows 95 and later operating systems introduce a level of complexity that the recursive approach cannot handle. The core issue is that the recursive approach relies on the assumption that the page fault will be resolved and control will return to the original task in a predictable manner. This assumption breaks down in a multitasking environment where task switches can occur at any time, including during the handling of a page fault. The limitations of the original DOSBox SVN approach highlight the need for a more sophisticated and adaptable solution for CPU page fault handling.

The Multitasking Nightmare: Out-of-Order Page Fault Handling

The problem is Windows 95, NT, and Linux, where a page fault is handled in user-space and can be interrupted by the kernel when it decides it's time to switch tasks (to resume handling the page fault when the task is revived). Windows 95 doesn't always return to the task that faulted in the same order it left it, and if another task page faults, then DOSBox will recurse again to handle it.

For example, say Windows 95 is running task A, which page faults (to pull in swap) but the kernel decides instead to switch to task B. Task B during it's run page faults (to pull in swap) but Windows 95 decides to switch back to task A, which then completes it's page fault handler and returns, causing DOSBox to exit the loop and resume whatever it was doing. Problem is, DOSBox is resuming whatever it was doing for task B in task A. That isn't enough to cause problems usually, though eventually it does cause something to get missed and Windows 95 starts to run a little funny and eventually crash. However this can happen in a way that DOSBox never returns from the sub-loop it spawns, and continues running a sub-loop on another page fault. Left to run long enough this can fill up the stack and crash DOSBox.

This vividly illustrates the core problem with the recursive approach in multitasking environments. Imagine a chaotic call center where calls are answered out of order, and operators end up handling the wrong customer's issues. This is analogous to what happens with CPU page fault handling when task switches occur during the resolution of page faults. The emulator loses track of which task triggered the page fault and ends up resuming execution in the wrong context. The cascading effect of this out-of-order CPU page fault handling can lead to subtle errors that accumulate over time, eventually causing system instability and crashes. In extreme cases, the emulator can get stuck in an infinite loop of recursive calls, exhausting the stack and crashing the entire system. The complexity of this scenario underscores the need for a non-recursive CPU page fault handling approach that can maintain a clear understanding of the system state and ensure that page faults are handled in the correct order. The challenges of managing CPU page fault handling in a multitasking environment are significant, but understanding the root cause of the problems is the first step towards finding a solution.

The Importance of Consistent CPU State

In the original DOSBox SVN project the CPU cores were written in such a way that if an instruction was interrupted by a page fault, and Windows 95 task switches during a page fault such that they are completed out of order, the CPU state becomes inconsistent and things in Windows 95 start crashing. One of the first things I did in DOSBox-X around 2013 was fix up the normal core so that CPU state is consistent: it either completes and continues, or a page fault happens and the CPU state is rolled back to what it was when the instruction started. Doing that made Windows 95 a lot more stable in DOSBox-X using the normal core!

This highlights the importance of maintaining a consistent CPU state during CPU page fault handling. It's like ensuring that all the pieces of a puzzle are in their correct positions before trying to put them together. If the CPU state is inconsistent, the emulator will be operating on incorrect data, leading to unpredictable behavior and crashes. The key to maintaining a consistent CPU state is to ensure that instructions are atomic, meaning that they either complete fully or leave no trace of their execution if interrupted. This requires careful design of the CPU emulation core and the CPU page fault handling mechanism. The DOSBox-X solution, as described by Jon Campbell, involves rolling back the CPU state to its previous state if a page fault occurs, ensuring that the emulator can resume execution from a known good state. This approach has proven to be highly effective in improving the stability of Windows 95 emulation. The effort to ensure consistent CPU state during CPU page fault handling is a critical aspect of building a robust and reliable emulator.

Non-Recursive Page Fault Handling: The Key to Stability

I made substantial modifications to the normal core to throw a guest page fault as a C++ exception so that the CPU core is interrupted immediately and given to the Normal loop as a page fault to handle. This is the "non recursive page fault" handling I implemented.

This reiterates the importance of a non-recursive approach to CPU page fault handling. It's like having a dedicated team of firefighters who can respond to emergencies without getting caught up in other tasks. The C++ exception mechanism provides a clean and efficient way to interrupt the CPU core and transfer control to the CPU page fault handling routine. This avoids the complexities and potential pitfalls of the recursive approach, ensuring that page faults are handled in a timely and consistent manner. The non-recursive nature of this solution is particularly crucial for handling the preemptive multitasking environment of Windows 95, where task switches can occur at any time. By avoiding recursion, the emulator can maintain a clear understanding of the system state and prevent the out-of-order CPU page fault handling issues that plague the recursive approach. The DOSBox-X implementation of non-recursive CPU page fault handling serves as a valuable example for DOSBox Staging to follow.

The Dynamic Core Challenge: Revisited

It even happens to work with the dynamic core... sort of, at least on Windows. On Linux unfortunately the dynamic core doesn't mesh well with the GCC exception handling frame data, so if a C++ exception is thrown while within dynamic core the C++ runtime acts as if nobody were there to catch it (because it can't trace the stack) and DOSBox-X crashes with an uncaught C++ exception even though one or more callers up the stack were prepared to catch it using try...except. This is why the code was brought back with the restriction to defer to normal core if 386 paging is enabled. Without 386 paging, no C++ exceptions are thrown and it's safe to use.

This is a recurring theme: the dynamic core presents a significant challenge for C++ exception-based CPU page fault handling, especially on Linux. It’s like having a high-performance sports car that can't be driven on certain roads due to its suspension system. The incompatibility between the dynamic core and the C++ exception mechanism stems from the way that dynamically generated code interacts with the C++ runtime's exception handling system. On Linux, the C++ runtime relies on stack tracing to identify exception handler frames, but this mechanism doesn't work reliably with dynamically generated code. This means that when a C++ exception is thrown within the dynamic core, the runtime may not be able to find the appropriate exception handler, leading to a crash. The DOSBox-X solution to this problem is to defer to the normal core when 386 paging is enabled, but this comes at the cost of performance. Finding a way to overcome this limitation and effectively integrate C++ exceptions with the dynamic core on all platforms is a key challenge for DOSBox Staging. The complexities of this issue highlight the need for careful consideration of the tradeoffs between performance and compatibility when implementing CPU page fault handling.

More Insights on