web space | free website | Business Hosting Services | Free Website Submission | shopping cart | php hosting

Building an Operating System, Chapter 1

If you have not already done so, you are going to want to download a copy of FEMOS to look at as we explore that OS.

Now to decompress the program in Unix it should be a simple matter of typing:

gzip -d rtos.tar.gz
tar -xvf rtos.tar
assuming you did not change the file name from the default value. If you are using Windows you will need to find yourself a copy of Winzip (or some other decompression utility that can handle tarballs) and... well... uhmmmm... good luck decompressing! (I will try to put out a zip file copy as soon as I get around to it.)

Alright so we have this set of files and directories now what? Well lets start by taking a quick high level view of things. Basically the OS code, using C style syntax looks sort of like this:

main() {
  int request;
  initialize_kernel_data_structures();
  initialize_devices();
  while(1) {			/* Want to loop forever */
    request = getNextRequest();
    if (is_unhandled_interrupt(request)) break;
    serviceRequest(request);
  }
}

Alright a quick line by line interpretation of the above:

main() {

If you have never programmed in C/C++/Java before this might hurt a bit but basically this is where we tell the C compiler start your point of execution HERE. Yes that's right, an Operating System is just another C program, it just happens to be that typically OSs are really big and complex C programs.

In short what happens when we compile this code? Well the compiler puts the first line of code from main() into the address of the first instruction the CPU (Central Processing Unit, usually an Intel chip like the Pentium) is to execute. Thus when the machine warms up the first thing the CPU will do is start running what ever is in main().

Okay all you C/assembly buffs, I know, really when the CPU starts ticking it will first run some code that is in the Read Only Memory (ROM) area, then after a while it will come around to winding its way through the code generated by the C compiler, in my case GCC. At this point the CPU will run a short assembly program, The Loader, usually a file named something like crt0. crt0 will either do a call main or a jmp main either way at this point we are in the main routine.


int request;

Here we tell the compiler to set aside some space for an integer (int as in integer not int as in interrupt, there is a big difference!) actually the real code has lots of variables but I don't want to confuse anyone too quickly.


initialize_kernel_data_structures();

If only there was an actual instruction or standard library call initialize_kernel_data. Since Operating Systems are usually pretty big there is usually a lot of stuff that has to be initialised, like the list of tasks that are ready to run. (Initially zero because no tasks have been created.) This is where we call some other routine, we wrote, or will write later that sets up some stuff for us, again we do it elsewhere to keep the complexity under control.


initialize_devices();

We are going to build a micro-kernel Operating System today so some of you might be thinking, shouldn't device initialisation occur outside the kernel? And you would be very wise to ask such a question, except the sort of devices I am talking about here are things like the interrupt vectors and various tables, such as the Global Descriptor Table (GDT) inside the CPU. These sorts of things must be looked after before we try to context switch.

Why not initialise devices at the same time that we initialise data structures? Well since we are going to write both routines it makes some sense to try to keep each routine as simple as possible and there is a logical division between the two tasks.


while(1) {                                      /* Want to loop forever */

If you have come from Java or C++ you are probably thinking while(true), except that C does not have any clue what "true" means. Rather, in C, 0 is evaluated to false and all non zero values are true. For the nonprogrammers here what this line does is to say, the next bunch of instructions, do each in turn and then when you reach the next close brace "}" come back to the top and repeat until 1 is false. (Except in C 1 is true so basically repeat the next bunch of instructions forever.)


request = getNextRequest();

This routine, getNextRequest(), I will talk about in greater detail later. This is where the magic, or damn near the closest thing you will ever see in computing to magic, of context switching takes place. Basically getNextRequest() gets the next task to run from the list of ready to run tasks. If there is no task ready to run a Microkernel runs the idle task which will be covered later on.

At this point the immortal question comes up, alright we initialise the system and we are ready to context switch for the first time. Now tell me, besides the idle task, how do we start up regular tasks? Well this is where I tell you I lied before, somewhere in the data structures' initialisation code we would probably create space for a special process, Unix users will be familiar with this concept they call the process init.

Here you have a lot of freedom, typically init will fire up some system services and maybe an interface (perhaps a shell) and then init will either die or become a system service depending on the number of Task Descriptors or PCBs you want to support. (I will explain Task Descriptors and PCBs in more detail then you could ever care to know later on.) Anyway after init fires up for the first time it does some stuff and eventually init will make a request of the OS.

The way a process makes a request is to trigger a software interrupt, which I will also talk about later. But the key is, an interrupt is triggered which causes the CPU to halt execution and run code pointed to by the interrupt vectors. This code, pointed to by the interrupt vectors is code inside the operating system often labelled interrupt handler (for obvious reasons). The interrupt handler in our case would simply convert the interrupt vector back into a number which we assign to the variable 'request' by returning this number in the function getNextRequest().


if (is_unhandled_interrupt(request)) break;

This is where we deal with unhandled interrupts, a sort of unhandled interrupt handler. This code is purely optional and probably ought to be integrated into serviceRequest() except I wanted to emphasise two points:

  1. There should be some thing that handles unhandled interrupts.
  2. The unhandled interrupt handler should force the OS to halt (before things really get messed up).
With respect to point 2. in C a break statement forces code to exit the inner most loop. Thus if that if statement were ever true the OS would stop spinning around that getNextRequest() / serviceRequest() code and exit to... nothing. Yes that's right, the OS would have no place to go. (Maybe after that while loop someone ought to insert BlueScreenOfDeath() or perhaps just ShutDownNow() - depending on your mood.) But this is an example of a place where there is no really effective solution, besides crashing, available to the OS designer.


serviceRequest(request);

serviceRequest() is sort of euphemism for a case statement here. In other words, requests come in, and will probably be numbers between 0 and 255 (hint in the Intel x86 World interrupts are numbered between 0 and 255 with interrupts between 32 and 47 being the familiar IRQ's.) Anyway interrupt lines are triggered, the result is processed by the serviceRequest() routine, which will probably move some tasks to the Ready Queue and/or possibly move tasks to the Blocked Queue.

I will talk a great deal about the Ready and Blocked Queue later, for now the one point I will make is a Ready Queue is basically a list of tasks that are ready to run.

Experienced programmers ought to cringe at the thought of having a case statement with 256 cases that must be processed at every context switch. Not to worry, consider an array of size 256 that is made of 256 function pointers. Now when an interrupt number comes in, simply jump to the function pointed to by the array element. Isn't that tidier?


}

This is the part where we mark the end of the loop. From here the C compiler would say, now re-evaluate: while(1) { and if it is still true start over from the line under the while, otherwise go to the line under this line.

Memory

Alright, now lets have a look at what the system memory ought to look like.

Did you ever use Micro$oft DOS? Remember the bad old days when you had 4 Megabytes of memory but had to ensure that you kept the lowest 640 KB (KiloBytes) as empty as possible? The original Intel x86 chips were a bit of a quirky design in that the lowest region of memory was treated differently then everything else, because early Intel chips did not support more than 640 KB of memory. Intel added more and more memory support, the bad part was when Intel started to migrate in the 80286 towards one solid monolith of memory Mico$oft held Intel back because Microsoft did not want their OS to work on only one type of chip. (Alright to be fair Intel also wanted the 80286 to support legacy software so Intel kept many annoying features from early chips lying around.)

The result of all of this is that the lowest 640 KB of memory are still pretty hard to play around in - mostly because you can not fit much in 640 KB anymore. The next 384 KB are also pretty useless for the same reason. Of course even if 640 KB was still enough memory for anybody's needs it would still be difficult to access because you need to do direct addressing which I don't really want to talk about. Only when you clear that legacy region and enter the second Megabyte does the memory start to become a little more sane. Depending on your mood you can use the lowest Megabyte for various data structures the OS uses or you can ignore it completely.

In the above diagram I shove the GDT (Global Descriptor Table) and IDT (Interrupt Descriptor Table - Interrupt Vectors) into the lowest 640 KB. This is actually odd, as it is much easier to put both data structures in the kernel data segment - and that is what Insop and I did in FEMOS. (For more on GDTs and IDTs please see Chapter 2.

So what goes on in that lowest megabyte? Not very much, typically your boot loader - the thing that loads the OS will use that low region. And you might elect to leave the GDT and IDT down there if you really, and I mean really want to save space. The reason its probably not the greatest idea in the World to conserve that space is that memory is so cheap and those tables are pretty small. The theoretical maximum size of a GDT is 64 KB which works out to about $0.05 in memory assuming you are paying $1 / Megabyte of memory.

Now unless you are working with a system that has very little memory - in which case it is probably not Intel so this whole discussion means nothing to you - its probably not worth your time to save that tiny spec of space.

Continuing where I left off, chances are you are going to want to run in protected mode, see Intel documents for more details. But what about this Kernel Data Segment I spoke of earlier? Well, lets take a look at a random application's memory area.

Now lets consider a really, and I mean really simple program.

#include 
main() {
  printf("Hello World!\n");
}
Chances are, if you have actually read this far, you know exactly what that code will do when you sit down and compile and run it. (Hint: You see the text "Hello World" appear on your screen, followed by a new line.)

But how does the computer know that the user wants to display anything. Well when the code gets compiled something like the following code will be generated. (Note: This code was actually generated on an Ultra Sparc using a really ancient version of the GCC cross compiler but it works for me. :) )

.file   "hello.c"
gcc2_compiled.:
___gnu_compiled_c:
.data
LC0:
        .ascii "Hello World\12\0"
.text
        .align 4
.globl _main
_main:
        pushl %ebp
        movl %esp,%ebp
        call ___main
        pushl $LC0
        call _printf
        addl $4,%esp
L1:
        movl %ebp,%esp
        popl %ebp
        ret
Alright so what does all this have to do with data segments? Well have a look at the top few lines. First there is a sort of credits, this is the result of a compilation of the file hello.c. Notice how once the file is compiled we have data and text? Don't worry about that .globl, really the important thing for us today is the .data and .text.

Lets see, what is the data? Well it is encoded in the American Standard Code for Information Interchange also known as ASCII, hence the .ascii Then there is the actual data the computer is supposed to display. But what about the \12\0? Well \12 tells the operating system to move the cursor over to the far left (carriage return) and \0 tells the OS to move the cursor down one line.

Now lets suppose you want to impress some old time systems designer. There is two basic things a computer stores, data and instructions. The instructions are the code part of the above C program and are often referred to as text. Data comes in a variety of forms but usually there are three types of data, initialised data, BSS (or uninitialised data) and read only data. What does BSS stand for? I have no idea, and believe me I tried to find out. Why is code called text and not, code or instructions? Maybe because data can be binary or it can be text but code is always text? Except that when the above instructions are converted into something the computer will understand it becomes binary anyway so go figure! So how do you impress the system designer? Tell him about your BSS segment and your text segment - not to many non system designers know about BSS data.

Back in the land of answering the question about data segments. Well as the compiled version of the C program illustrates all programs have two components, well actually a whole bunch of components, but only two of those components concern us, data and text. Now the text is typically slapped down any old place in memory, the data is different. As far as the user program, like that hello world program is concerned, the data must follow in the logical addresses right after the last address used by the code.

Of course there are hacks that let you get around statements like the one I just made. For example, by modifying the linker it is possible to allow the data to go in the first free page after the last page used by code, thus you can have page aligned segments. Which you pretty much have to do if you are making a general purpose OS since such an OS will require paging.

The moral of my little ramble is that basically all programs have a text segment and a data segment. The operating system is no exception, you need space to store things like the integer request. Remember int request; well it has to go somewhere, and it should logically be stored in the data segment.

What about the proverbial stack? Well to novice programmers a stack is basically a LIFO, Last In First Out, data structure that supports two operations, Push and Pop. You push data on to the stack and pop data off the stack. (Think of a stack of plates at a buffet, some clean plates get pushed onto the stack of plates already there, some hungry people pop the hot, clean, plates off and make them dirty.) Stacks are critical for computer programs, every program needs a stack (unless the program is written in Fortran but we won't go there) so where is this stack? Well lets take our hello world example. "Hello World\12\0" would end up at the beginning of the data segment. The stack would start at the end of the data segment and slowly work through the data segment towards "Hello World\12\0", What happens if the stack overruns your data? Well then you are in trouble.(1) What if the stack gets passed all the data and starts overwriting text? Well hopefully you marked your text as read-only to the Pentium, or whatever, chip. So at that point the CPU would raise an interrupt and your OS would, hopefully, kill the process before the process tried to convince the CPU to print fifty thousand pages of letter 'q' on the ink jet by your desk.

Lets have another look at that data segment, but we will remove the code segment.

Notice how the DS and the SS are in one segment? This is more or less typical. Usually what you do, if you are allocating a DS is look at the program's header. For example Windows would look at the EXE file header, Linux, and newer versions of Unix, would look at the ELF header. Old versions of Unix might use an a.out header format. Anyway inside that file header you will find a data structure which tells the operating system the size of the code, the size of the data (BSS, initialised and read-only data) and most likely a pointer to the first instruction to execute. What the OS would do is look at the size of the data (all types of data) sum up the numbers allocate some extra space for the stack and then say, okay the size of DS will be the result of the above sum.

How much space should the OS give to the stack? Well that's a rather difficult question - it really depends on things like, are you paging? Is this a general purpose OS? Is the code highly recursive in nature? (More recursion requires bigger stack.)

So we know how to load an application into memory, but what of the OS itself? Well the OS is loaded into memory by the boot loader, Linux users would use a handy little program called the LInux LOader, or LILO. Windows users have my sympathy, (you guys are on your own figuring out what your boot loader is called.) The boot loader allocates space for text and data. Then the boot loader copies down the text and copies down the initialised data. Finally the boot loader jumps to the very first instruction in the OS text area, in the above example that would be initialize_kernel_data_structures(); Well actually the first instruction would be slightly different probably have more to do with giving space to the integer request, but who cares, its just a simple example! So as you can see although the OS is a special case in that the OS is loaded by the Boot Loader, whereas everything else is loaded by the OS, but the OS has all the same memory allocation schemes as any other application.

That last point is really important, an OS might have a fancy graphical user interface. It might not have a user interface at all! It might be 15 million lines of buggy C code, it might be a few thousand lines of obtuse assembly, but an OS is just another computer program. As far as your computer's CPU is concerned the only difference between an OS and anything else is that the OS has access to a the GDT and the IDT which I will look at in more detail shortly.

Before we go on to GDTs and IDTs there are a few loose ends. First what is this paging I speak of? Well how about, its complicated and beyond the scope of FEMOS. Well actually in theory its not very hard, its just that embedded systems like FEMOS do not really need paging anyway so Insop and I never implemented it. If you are, as I hope, reading this for general interest than obviously an answer like that should be less than satisfactory. The problem is I don't really feel like making all the diagrams entailed in describing a paging system, at least not now. So what I will suggest is you go look at a good book and learn about it. I suggested a few books back in the OS home page.

Context Switching On Task Descriptors

Some time ago, before the World was turned upside down by a lunatic in the desert, the New York Times magazine ran an article about how to do things. And one of the things they listed was 'how to multitask.' There was a line in that article that went something to the effect of, "humans, unlike computers can not do multiple things at once".

The author of that article got it all wrong, it should have read something like "humans and computers, contrary to appearances can not multitask." Both the thing between your ears and the thing on your desk are up to something and you probably are not even aware of it. Both devices are task switching. Which begs the question, what is task switching?

Task Switching is the concept of running one task for a given period of time and then switching to a new task for a little while. Example, I run my brains little process that says "gee wiz, I think I have been looking at this screen too long, my eyes hurt" then after a couple thousandths of a second I switch to another task, that says "gee wiz, I should go to the washroom" then I switch to the task of writing up this web page then I switch back to the rub my eyes task. If I switch between each task fast enough to the outside observer it looks like I am writing this web page, whilst rubbing eyes and ruining my pants. Isn't task switching great?

I'll grant you this, to that piece of grey matter, is something like 90% parallel, that is the brain really is doing several things at once. For example, how often does a little process go "gee wiz, maybe I ought to breath now"? That first 90% is called the automatic system. The remaining 10% that answers the cell phone whilst driving - its not multitasking sorry to say. Now stop reading this web page on your car phone and drive to work before you cause an accident!

Returning to the subject of multitasking, in order for a computer to multitask a lot of stuff has to happen. First remember I said "Task Switching is the concept of running one task for a given period of time and then switching to a new task". Well that is not entirely true. I actually provided a typical example of a task switch, often called time slicing.

In Time Slicing the Operating System sets a clock inside the computer to raise an interrupt so many times a second. In most Intel Operating Systems the Programmable Interrupt Timer or PIT is the clock that is used. The PIT is an evil piece of hardware and if you are lucky you will never have to write code for it. Suppose for a moment we decide to use the PIT to raise the interrupt. Well maybe we are executing user code, (as opposed to kernel code) so interrupts are on, and the CPU is ticking away in user code and then BANG, the PIT counts down and reaches the point where it decides to flag the World that a time slice ended. The PIT sends current down a wire to the Interrupt Control Unit 0 or ICU, the ICU sends a signal to a second ICU which sends the signal to the CPU (to electrical engineers, ICU's are binary coders that convert a signal on any one of eight wires to a binary number from one to eight, they are staggered with one ICU feeding into the other ICU, thus providing 15 desecrate Interrupt levels).

The signal sent by the ICU is called an Interrupt ReQuest or IRQ. The IRQ number (between 0 and 15) is added to 32 and then labelled an Interrupt level. The PIT has Interrupt level 32, hence the PIT is wired onto IRQ 0 - stop and think about that one for a second, make sure you understand. IRQs start at 0 but are actually Interrupts levels 32 - 47 inclusive. Incidentally what of the other IRQs, like 1, 2, 3 and so on? Well all IRQs are for hardware, the PIT, your hard drive controller, that sort of thing, so IRQ 10 for example might go to your sound card or perhaps your video card. What of the other Interrupt numbers like 0 to 31 and 48 up to 255? (There are 256 Interrupts supported by the Intel x86 processor.) Well Interrupts are prioritised, that is an Interrupt coming in on level 0 is higher priority than the PIT on level 32 (IRQ 0). Hence the Interrupt coming in on level 0 is going to be serviced first. (Incidentally, that Interrupt, on level 0, is called the Non-Maskable Interrupt or NMI and is used, typically, for debugging OSs during the implementation, or development stage.) Most of the interrupt levels are used by the Operating System for passing various data, as we shall see later.

Back to time slicing, we had this interrupt line go from low voltage to a comparatively high voltage. This change in relative voltages triggered the ICU which sent an IRQ to the CPU. The IRQ got converted into an interrupt level which caused the CPU to halt execution of the code it was running, plough into the Interrupt Descriptor Table (IDT) and find the address of the code that the CPU is to run in place of the code it was running. The CPU then begins to run, hopefully code in the Operating system that will:

  1. Save the state of the CPU so that it can resume running the user program again later on.
  2. Process the interrupt. That is service any hardware that may need servicing and then find the next task that is ready to run and run it.

Note that saving the state of the CPU is an awfully big deal. Its made worse because I am assuming the reason for this context switch is a simple time slice. The fact is every time a different request is made of the OS - be it for a letter to be displayed on the printer or for a new process to be started - a different type of context switch is going to take place. Hence the interrupts from 48 to 255, hopefully one can squeeze all of their system service calls into 208 interrupt levels.

Okay, qualifier to the last paragraph, the truth is you do not need to squeeze all the system calls into 208 interrupt levels. You can if you want but you do not need to, Insop and I did not, Linux does not and Windows does not either. All you need do is use one standard interrupt, we used 69, which I believe is the Windows Software interrupt, Linux uses 0x80 I think. Anyway use one standard software interrupt, and just before calling the interrupt instruction, int push a description of the system call (read: constant associated with the system call) on to the stack. Then let the kernel unscramble which system call was made. See below for the gory details.

But, before we can really talk in a civilized way about what happens when we want to save context, we have to talk about what a process really is. That is, what is it that lurks inside the OS that allows user processes to work and to context switch.

Remember I spoke of PCB's or Task Descriptors? Well PCBs (Process Control Block) or Task Descriptors store some very basic information about the state the processor (Pentium chip or whatever else you are using) when it is to run a process. If you look down at the diagram in note 1 notice how all those processes have adjacent Data Segments? Well here is the $64,000 question, how does the OS know which data segment belongs to each process? More to the point, when it comes time to switch from Kernel mode to user mode, where is the first instruction the OS should execute? Oh sure we know that the ELF header, or EXE header for Windows users, contains the first instruction the OS should execute the very first time. But where is the first instruction to execute after we ran for some length of time, context switched out and context switched back into the process?

An observant reader might point out that if we execute code for a fixed length of time and then context switch we know that we will be at a certain line of code right? Wrong! Remember, different chips run at different speeds, time slicing is only one of many reasons to context switch, the very quartz crystal that synchronizes timing varies from computer to computer and most significantly is this the first or the hundred and first time we are context switching into a task? No, we need to store the line of code we want the system to execute next. We store this data in the process control block.

Alright, lets have a look at the FEMOS source code, now take a close look at the file task.h. Up at the top we have some of Insop's comments, then we have some defines, these are pre-processor directives.(2) After the pre-processor stuff we get some type definitions, I want to look at the class definition for Task Class. Now I know, here I am going on about C and then bang! C++ in your face. Actually Insop wrote this code. I wrote the original version which was pure C but Insop is a big C++ guy so what was typedef struct becomes class Task. Now look really closely at the public integers in Task class, in particular the ones under the label context. userCS, userDS, userSP, userSS, remarkable, imagine, what are the odds, userCS and the CS register, which points to the code segment. userDS and the DS register, which points to the data segment, userSP and the SP register (well ESP) which points into the stack and of course SS for Stack Segment. For now this is the really critical stuff, you store all your basic task information in a massive table of these values. Of course FEMOS does more than just context switch that is why there is more 'stuff' in the Task Descriptor.

Lets consider this, you have an array of Process Control Blocks (PCB, or as I like to call it a TDT, Task Descriptor Table) in this array you store all the information you need to context switch into a task, right? Well experienced assembly programmers know that you need more than just the code segment. You see the code segment only marks the area where all the text (remember text is code) associated with this program resides. Don't you need to store the EIP (Extended Instruction Pointer)? Actually no, when a user program calls the assembly routine int (int as in interrupt, not integer) or when an IRQ is triggered the CPU halts the execution of the user code and pushes the current value in the EIP onto the stack. When you are about to resume execution you call the routine IRET which pops the current value off the stack and into the EIP.

Context Switching The Details

Okay so what are the steps in a context switch?

First lets look at another trivial program, this one would work in FEMOS and not on a normal OS, unlike the Hello World program which would work on a normal OS and not in FEMOS. This program we will call pass.c. At this point do not worry what the pass.c program does, as will become clear later, pass.c does not do much at all.

#include sysCall.h
main() {
   Pass();
}
and when compiled would look like this:
        .file   "pass.c"
        .text
        .align 2
.globl main
        .type   main,@function
main:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        andl    $-16, %esp
        movl    $0, %eax
        subl    %eax, %esp
        call    Pass
        leave
        ret
.Lfe1:
        .size   main,.Lfe1-main
        .ident  "GCC: (GNU) 3.2 20020903 (Red Hat Linux 8.0 3.2-7)"
Notice that I compiled it on a different compiler? (For our purposes it does not really matter where I compile it.) Also I bolded the line call Pass this is where we go and context switch. Now what happens in that one line?
  1. The user program makes a system call, in this case, Pass();
  2. This call invokes code, from the files syscall.c and syscall.h, that was linked into the user code at compile time.
  3. This linked in code throws a software interrupt which causes,
  4. the CPU to save EIP onto the stack, switch into kernel mode and start executing code pointed to in the interrupt vectors.
  5. Now in kernel mode the kernel code saves all the other registers onto the stack (except for the SS and ESP register which have to go in the TD or PCB).
  6. Then the kernel code restores the kernel registers SS and ESP from static variables.
  7. The kernel code then restores the kernel stack.
  8. The kernel code now runs an event handler.
Before we take a look at a real live example, in FEMOS, lets just examine the code for resuming user level code, that way we can consider exiting user land and returning to user land in one sitting. That is, what are the steps to go from kernel mode back into user mode?
  1. The kernel code saves all of its registers on the kernel stack.
  2. Very likely (at least in FEMOS this is what we did) the kernel will also save ESP and SS into a static variable that will already contain the same data from previous context switches.
  3. The kernel will write the values from userSS and userSP from the TD (or PCB) into the SS and ESP register respectively.
  4. The kernel will pop the top entries off of the stack and into the general purpose registers that the entries came from.
  5. The kernel will execute the return from interrupt instruction, iret, which will cause the CPU to be back in user mode with the EIP pointing to the user instruction right after the instruction in step 3 (in the previous 8 step process), the instruction that caused the interrupt in the first place.
Note that you can see the kernel code in the file kernelMain.cc, look at the routine getNextRequest().

Pretty confusing huh? Lets see some if some diagrams help out a bit.


What is this stuff about Exit Kernel and Enter Kernel? Well Enter Kernel is the first eight step algorithm I just listed and Exit Kernel is the five step algorithm under Enter Kernel. Notice how the CPU starts by running code in Program A switches into the Kernel then into Program B and then back into the Kernel before returning to Program A? That is always, always the case.2 The CPU must run in kernel mode for at least a little while in order to sort out which user program to run next, what to do with the program that just ran, and so on. Now I want to stress a point, the above diagram is what the outside observer sees, as time passes, first the CPU runs one program, then the CPU runs Kernel code, then the CPU runs another program.

Now how does the CPU feel about things?


Well holy smokes! What a difference, notice how as far as Program A is concerned, Program A will execute a system call and then return from the system call? No lines of code are skipped nothing crazy happened, its as if program A calls a routine, Pass(); and then Program A just goes on as if nothing happened. Oh sure, any observer with super fast vision would see that Program A has stopped running, the CPU has started to run Kernel code, then Program B starts to Run, then the CPU runs more Kernel code and then magically Program A runs again as if the Kernel and Program B had never been there. But to program A nothing magical is going on at all.

Now also notice that different processes context switch into different portions of kernel code, why? Because different processes call different Interrupts (or at least the processes make different system calls). Also these diagrams assume only two user programs, FEMOS supports up to 64 user programs (if we had more than 4 MB of memory FEMOS could have had thousands of user programs).

Who wants to see some magical computer code? Lets have a look at the FEMOS source code. In particular check out the following files:

So now lets consider yet another really trivial program which would work in FEMOS:
  1. #include "syscall.h"
  2. main() {
  3. int i;
  4. for (i = 0; i < 10; i++) {
  5. Pass();
  6. }
  7. }
For those of you who like looking at assembly, the source code for the above program is listed here.

Really quickly what this code is saying is, create an integer i, line 3. Note that in C int means integer, as in a number, in Assembly int means interrupt. Now set i to zero, line 4. Execute all the code between the braces "{", line 4 and "}", line 6, each time you execute that code check to see if "i < 10", is true, if it is true, than add 1 to i, otherwise go to the next line of code after the "}", line 7.

What of the stuff between the braces (line 5)? That Pass() is a FEMOS system call. What does it do? It basically does nothing. Really. Pass() just says, give up the rest of my CPU time slice to somebody else. It is a useless system call but it helps when you are testing your context switches.

So what does this program do? Ten times in a row it gets to be the active process and ten times in a row it quickly gives up its time slice to another process.

Now lets consider what happens in the system call code. First we should note that line: #include "syscall.h" means that the preprocessor3 should copy the file syscall.h right into the beginning of your user program. This file, syscall.h, in turn invokes user code in syscall.c, lets take a closer look at syscall.c.

Near the top of syscall.c is the line: void Pass( void ) that is the source code for the Pass routine, except take a closer look at syscall.h, notice how every single line in Pass() is defined to be something else? What does the compiler see after the text preprocessor is done?

PUSH_MAGIC;
PUSH_VAR(PASS_SYS_CALL);
INTERRUPT_SOFT;
CHECK_MAGIC;
becomes:
__asm volatile("pushl %%eax":: "a" (MAGIC1) : "%eax" );
__asm volatile("pushl %%eax":: "a" (PASS_SYS_CALL) : "%eax" );
__asm volatile("int $69\n\t"
               "popl _null\n\t");
__asm volatile("popl _magicCheck");
if( magicCheck != MAGIC1 ) {
   printf("CHECK MAGIC NUMBER ERROR (%d)\n", magicCheck );
   asm ("cli;hlt\n");
}
So what does that code do? Well it starts by putting some data onto the stack, the first bit of data is a magic number, MAGIC1, which is actually also defined by to be 0x12345678 but I left it unchanged to improve readability, we will take a closer look at magic numbers in a second. Then the code puts a flag on the stack, this system call that I am about to do, it is a Pass() system call. Then the code calls interrupt 69 which Insop used for all software interrupts. Now remember we call the interrupt and then the next thing we know, we are back from a context switch. So we verify that the magic number is valid and then we return from the system call by ending the Pass() routine.

So what was the magic number for? Well what if we pushed or popped too much stuff from the stack during the context switching? Our magic number would be all mangled up and then we would know our Context switching code is bad so its time to halt execution before we really mess things up.

In the above diagrams where would this code be? Well using the diagram that measures instruction against context (the diagram right under the line "Now how does the CPU feel about things?") everything up to, and including the line

__asm volatile("int $69\n\t" 
would be that red dot under User Program A. Everything after that line would be the stuff under the red dot.

From the kernel perspective what happens? Well first we better start by searching the file kernelMain.cc for the string "xexitkernel", kernelMain.cc is so huge that Insop put the name of every function, preceded by an "x" at the start of each function. This would make searching for function headers much easier. Now, lets start with enterKernel which is just below exitKernel. Note that enterKernel and exitKernel are also available on this page, you can click here to see them without decompressing anything.

Now when the Interrupt Vectors were initialised the Interrupt Descriptor Table would have something set inside it so that when interrupt 69 (the software interrupt) was triggered the CPU would jump to the line in kernelMain.cc that read: "entryPtEnterKernel". Now what does that code do?

First the code CLears the Interrupt, then it saves the program context on the stack by PUSHing All the registers onto the stack. Then the code puts a marker on the stack, this was a software interrupt. Then the code PUSHs all the data segment registers on to the stack. Then the code copies a static value from memory into the Data Segment register so that the kernel can restore its own stack. Then the kernel copies critical data from the stack into variables that will later be copied into the PCB (or TD). And then we are in the realm of C code and things start to become sane again. Except, we went from running user code to running kernel code.

Now, what happens when we want to exit the kernel and return to user code? Well lets look at exitKernel just above enterKernel.

What happens in Exit Kernel? Well first we CLear Interrupts. Next we save Kernel Context by pushing all the registers on to the stack. Then we set the stack pointer and stack segment to point to the user stack, by obtaining the value for the ESP and SS from the PCB or TD. And then we iret which means Return from Interrupt.

Note, this code was written for GCC, some people used to AT&T standards may be slightly confused, movw %ax,%ds means: MOVe Word from ax to ds. See Brennan's Guide to Inline Assembly for more details. The main thing is gcc demands that source comes before destination. The other thing I want to mention is that since ds is only a 16bit register we only need the low half of eax, hence that mov was a movw instead of a movl (MOVe Long).

End Of Chapter

Wow, I guess that about covers the basics of context switching. Its pretty brutal stuff - in fact I would have to say that context switching is the hardest thing a programmer can ever take on, using a time to write fully debugged code divided by number of lines of code scale. In fact, once you get context switching working the rest of it is pretty trivial stuff. Right now I do not have a Linux machine handy to go through the src/ tree, but the context switching code, is in the file called entry.S I think. So if you would like to see another example also written for GCC using the x86 take a look at Linux.

[BACK] Take me back to the table of contents.

1
I skipped a really big point here. What happens if the stack keeps growing beyond the expectations created at load time, that is, what do you do if the stack is overwriting data? Well basically you have to kill the offending application, stack data and all - since the data is corrupted and is useless anyway. Does this problem happen in the Real World? YES! In a paging system this should never come up if the system is designed properly; but, in a non paging system it is easy to try to save memory by keeping the stack as small as possible and especially during application development, when stack size requirements are uncertain. Kernel Panics (Blue Screen of Death) or Segmentation Violations are frequent and difficult to pin down. Yet all to often in the development of FEMOS Insop and I would spend hours studying code only to find that our stack for some silly application just was not big enough and as soon as we increased our stack by a factor of ten everything worked beautifully.

How do you avoid a data overrun? Use a "canary". A data value you put right at the start of you data segment. If the stack from the application adjacent to this application grows out of control your canary gets wiped out. Then when you check your canary (something that should be done with each context switch) you find out that the canary is wiped out you kill the application with the corrupted data segment - and possibly the application with the run away stack.

Alright, where in that pretty diagram is the text (Code)? Well its somewhere else, typically in an embedded operating system when the code (Kernel and user code) gets loaded some space is set aside for the Kernel data and stack and then the Kernel starts to run. Only after the Kernel starts to run do we create data spaces for applications. Thus all the data spaces are adjacent.

2
Sorry, I should say, it is always the case that when a context switch occurs the Kernel executes for a little while between user programs.

3
C is a funny language in this respect. When you compile code first a text processor sweeps over your code looking for lines that being with a pound '#' sign. These lines are directives to the text processor that removes the actual line and takes action based on what the line reads.

Assembly listings
exitKernel and enterKernel from kernelMain.cc are below.

#include "syscall.h"
main() {
   int i;
   for (i = 0; i < 10; i++) {
      Pass();
   }
}
becomes:
        .file   "syscall.c"
        .text
        .align 2
.globl main
        .type   main,@function
main:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        andl    $-16, %esp
        movl    $0, %eax
        subl    %eax, %esp
        movl    $0, -4(%ebp)
.L2:
        cmpl    $9, -4(%ebp)
        jle     .L5
        jmp     .L3
.L5:
        call    Pass
        leal    -4(%ebp), %eax
        incl    (%eax)
        jmp     .L2
.L3:
        leave
        ret
.Lfe1:
        .size   main,.Lfe1-main
        .ident  "GCC: (GNU) 3.2 20020903 (Red Hat Linux 8.0 3.2-7)"
exitKernel and enterKernel
Note that exitKernel comes first. The first program to run is the OS itself, then the OS exits to a user program (context switch) and starts right up again in enterKernel.
/********************************************************************
*
* exitKernel
*
*******************************************************************
*xexitkernel
*/
interruptTestVal++;
asm volatile("
      .text
      .global _entryPtExitKernel
_entryPtExitKernel:
      cli
      pushal
      pushw %gs
      pushw %fs
      pushw %es
      pushw %ds

# store kernel's sp 
      movl %esp,_kernelSP

# get user process's sp and ss
      movw _userDS,%ss
      movl _userSP,%esp

      movw _userDS, %ax
      movw  %ax,%ds

      popw %ds
      popw %es
      popw %fs
      popw %gs
      popal
      iretl

      ");

/*************************************************************
*
* enterKernel
*
**************************************************************
*xenterkernel
*/
asm volatile("
      .text
      .global _entryPtEnterKernel
      .global _entryPtClockTick
      .global _entryPtSerial0
      .global _entryPtSerial1
_entryPtEnterKernel:
      cli
      pushal
      movl  $0, %ebx
      jmp CommonPartHandler

# entry point for clock tick (PIT) type 0
_entryPtClockTick:
      cli
      pushal
      movl  $1, %ebx
      jmp CommonPartHandler

      nop

# entry point for Serial 0 int type 2
_entryPtSerial0:
      cli
      pushal
      movl  $2, %ebx
      jmp CommonPartHandler

      nop

# entry point for Serial 1  type 3
_entryPtSerial1:
      cli
      pushal
      movl  $3, %ebx
      jmp CommonPartHandler

      nop

CommonPartHandler:
      pushw %gs
      pushw %fs
      pushw %es
      pushw %ds

      pushw %ds

# set kernel DS
      movl  $0x38,%eax
      movw  %ax,%ds
      movw  %ebx,_InterruptType

# set user DS
      popw  _userDS

# save user SS, SP
      movw  %ss,_userSS
      movl  %esp,_userSP

# set kernel SS SP
      movw  %ds,%ax
      movw  %ax,%ss
      movl  _kernelSP,%esp

# restor kernel's registers set
      popw  %ds
      popw  %es
      popw  %fs
      popw  %gs
      popal
      ");