Mixing Assembly and C-code

0

Mixing Assembly and C-code

Why mix programming languages?

After the last tutorial, you now feel like king of the world! =) You're eager to jump into the action, but there's one problem. Even though assembly is a powerful language, it takes time to read, write and understand. This is the main reason there ARE more programming languages than just assembly =).
Now that we have a working 32-bit boot sector, we want to be able to continue our development in a higher language, whenever possible. C is my main choice, because it's common and powerful. If you think C is old and want to use C++ instead, I'm not stopping you. The choice is your's to make.
Say that we want a print() function instead of addressing the video memory directly. Also, we want a clrscr() to clear the screen. This could easily be done by making a for-loop in C. We can't make function calls from a binary file (eg. our boot sector). For this purpose, we create another file, from which we will operate after the boot sector is done. So now we need to create a file, called 'main.c'. It will contain the main() function - yes, even operating systems can't escape main() =). As I said, a boot sector can't call functions. Instead, we read the following sector(s) from the boot disk, load it/them into memory and finally we jump to the memory address. We can do this the hard way using ports or the easy way using the BIOS interrupts (when we're still in Real mode). I choose the easy way, as always.

How do I do this?

We start as always, by creating a file (tutor3.asm) and typing:
[BITS 16]

[ORG 0x7C00]
When the BIOS jumps to our boot sector, it doesn't leave us empty handed. For example, to read a sector from the disk, we have to know what disk we are resident on. Probably a floppy disk, but it could as well be one of the hard drives. To let us know this, the BIOS is kind enough to leave that information in the DL register.
To read a sector, the INT 13h is used. First of all, we have to 'reset' the drive for some reason. This is just for security. Just put 0 in AH for the RESET-command. DL specifies the drive and this is already filled in by our friend, the BIOS. The INT 13h returns an error code in the AH register. This code is 0 if everything went OK. We assume that the only thing that can go wrong, is that the drive was not ready. So if something went wrong, just try again.
reset_drive:
mov ah, 0
   int 13h
   or ah, ah
   jnz reset_drive
The INT 13h has a lot of parameters when it comes to reading and loading a sector from the disk to the memory. This table should clearfy them a bit.
RegisterFunction
ahCommand - 02h for 'Read sector from disk'
alNumber of sectors to read
chDisk cylinder
clDisk sector (starts with 1, not 0)
dhDisk head
dlDrive (same as the RESET-command)
Now, where shall we put our boot sector. We have the whole memory by our selves. Well, not the reserved parts, but almost the whole memory. Remember, we placed our stack in 090000h-09FFFFh. I choose 01000h for our 'kernel code'. In real mode (we haven't switched yet), this is represented by 0000:1000. This address is read from es:bx by the INT 13h. We read two sectors, just in case our code happends to get bigger than 512 bytes (likely).
mov ax, 0
mov es, ax
mov bx, 0x1000
Followed by the INT 13h parameters and the interrupt call itself.
mov ah, 02h
mov al, 02h
mov ch, 0
mov cl, 02h
mov dh, 0
int 13h
or ah, ah
jnz reset_drive
Now, we should have the next sector on the disk in memory address 01000h. Just continue with the code from tutorial 2 with two little ajustments. First, now that we're going to clear the screen, we don't need our 'P' at the top right corner anymore. And instead of hanging the computer, we will now jump to our new C-code.
cli
xor ax, ax
.
.
.
mov ss, ax
mov esp, 090000h
Now, we want to jump to our code segment (08h) and offset 01000h. Remember, we didn't want our 'P' either. Change the following four lines:
mov 0B8000, 'P'
   mov 0B8001, 1Bh

hang:
   jump hang
To:
jump 08h:01000h
Don't forget to fill the rest of the file...
gdt:
gdt_null:
.
.
.
times 510-($-$$) db 0
   dw AA55h
Moving on to actually writing the second sector =). This should be our main(). Our main() function should be declared as void and not as int. What should it return the integer to? We must declare the constant message string here, because I don't know how to relocate constant strings within a file (anyone know how to do this?). This works, but it's kind of ugly...
const char *tutorial3;
I always put the word const in, whenever possible. That's because it keeps me from making mistakes. Sometimes, it's good and some times it ain't. Most of the time it's good to have it.
First of all, we wan to clear the screen, then we print our message and go into an infinite loop (hang). Simple as that.
void main()
{
   clrscr();
   print(tutorial3);
   for(;;);
}
But wait a minute?! You haven't declared clrscr() or print() anywhere? What's up with that? No, that's true. Because of my lack of knowledge of the linker, I don't know how to do that. This way, if we spelled everything right, the linker finds the appropriate function. If not, our OS will tripple fault and die/reset. Ideas are welcome here...
After main(), we place our string. After that, main.c is complete!
const char *tutorial3 = "MuOS Tutorial 3";
Now for our other functions. We place them in a file called 'video.c'. clrscr() is the easy one, so let's start with that.
void clrscr()
{
We know that the video memory is resident at 0xB8000. So we start by assigning a pointer to that location.
unsigned char *vidmem = (unsigned char *)0xB8000;
To clear the screen, we just set the ASCII character at each position in the video memory to 0. A standard VGA console, is initialized to 80x25 characters. As I told you in tutorial 2, the even memory addresses contains the ASCII code and the odd addresses, the color attribute. By default, our color attributes should be 0Fh, white on black background, non-blinking. All we have to do, is to make a simple for-loop.
const long size = 80*25;
   long loop;

   for (loop=0; loop
      *vidmem++ = 0;
      *vidmem++ = 0xF;
   }
Now for the cursor position. If we cleared the screen, we also want our cursor to be in the top right corner. To change the cursor position, we have to use two assembly commands: in and out. The computer has ports which is a way to communicate with the hardware. If you want to learn more, have a look at Chapter 9 in Intel's first manual(1.1MB PDF).
It's a little tricky to change the cursor position. We have two ports: 0x3D4 and 0x3D5. The first one is a index register and the second a data register. This means that we specify what we want to read/write with 0x3D4 and then do the actual reading and/or writing from/to 0x3D5. This register is called CRTC and contains functions to move the cursor position, scroll the screen and some other things.
The cursor position is divided into two registers, 14 and 15 (0xE and oxF in hex). This is because one index is just 8 bits long and with that, you could only specify 256 different positions. 80x25 is a larger than that, so it was divided into two registers. Register 14 is the MSB of the cursor offset (from the start of the video memory) and 15 the LSB. We call a function out(unsigned short _port, unsigned char _data). This doesn't exist yet, but we'll write it later.
out(0x3D4, 14);
   out(0x3D5, 0);
   out(0x3D4, 15);
   out(0x3D5, 0);
}
Now, to write the out() and in() functions, we need some assembly again. This time, we can stick to C and use inline assembly. We put them in a seperate file called 'ports.c'. First, we have the in() function.
unsigned char in(unsigned short _port)
{
This is just one assembly line, so if you want to know more about the in command, look in Intel's second manual(2.6MB PDF). Inline assembly is kind of special in GCC. First you program all your assembly stuff and then you specify inputs and outputs. We have one input and one output. The input is our port and the output is our value recieved from in.
unsigned char result;
   __asm__ ("in %%dx, %%al" : "=a" (result) : "d" (_port));
   return result;
}
This looks rather messy, but I'll try to explain. The two %% says that this is a register. If we don't have any inputs or outputs, only one % is required. After the first ':', the outputs are lined up. The "=a" (result), tells the compiler to put result = EAX. If I'd write "=b" instead, then result = EBX. You get the point. If you want more than one output, just put a ',' and write the next and so on. Now to the outputs. "d" specifies that EDX = _port. Same as output, but without the '='. Plain and simple =).
Now to the out(). Same as for in(), but with no outputs and two inputs instead. I hope this speaks for itself.
void out(unsigned short _port, unsigned char _data)
{
   __asm__ ("out %%al, %%dx" : : "a" (_data), "d" (_port));
}

Then we have the print(). Three variables are needed. One pointer to the videomemory, one to hold the offset of the cursor position and one to use in our print-loop.
void print(const char *_message)
{
   unsigned char *vidmem = (unsigned char *)0xB8000);
   unsigned short offset;
   unsigned long i;
We want print() to write at the cursor position. This is read from the CRTC registers with the in() function. Remember that register 14 holds bits 8-15, so there we need to left shift the bits we read. We increase the vidmem pointer by two times offset, because every character has both ASCII code and a color attribute.
out(0x3D4, 14);
   offset = in(0x3D5) << 8;
   out(0x3D4, 15);
   offset |= in(0x3D5);

   vidmem += offset*2;
With a correct vidmem pointer, we're all set to start printing our message. First we initialize our loop variable i. The loop should execute as long as the value we are next to print, is non-zero. Then we simply copy the value into vidmem and increase vidmem by two (we don't want to change the color attribute).
i = 0;
   while (_message[i] != 0) {
      *vidmem = _message[i++];
      vidmem += 2;
   }
Our message is printed and all that is left to do is to change the cursor position. Again, this is done with out() calls.
offset += i;
   out(0x3D5, (unsigned char)(offset));
   out(0x3D4, 14);
   out(0x3D5, (unsigned char)(offset >> 8));
}
To compile, we start with the boot sector.
nasmw -f bin tutor3.asm -o bootsect.bin
For the rest of the C-files, we first compile each file seperatly and then link them together.
gcc -ffreestanding -c main.c -o main.o
gcc -c video.c -o video.o
gcc -c ports.c -o ports.o
ld -e _main -Ttext 0x1000 -o kernel.o main.o video.o ports.o
ld -i -e _main -Ttext 0x1000 -o kernel.o main.o video.o ports.o
objcopy -R .note -R .comment -S -O binary kernel.o kernel.bin
'-i' says that the build should be incremental. First link without it, because when '-i' is used, the linker doesn't report unresolved symbols (misspelled function names for example). When it linkes without errors, put '-i' to reduce the size. '-e _main' specifies the entry symbol. '-Ttext 0x1000' tells the linker that we are running this code at memory address 0x1000. Then we just specify what output format we want, the output file name and list out .o-files, starting with main.o (important!). The objcopy line make the .o-file to a plain binary file, by removing some sections.
We're not done yet. We have our boot sector and our kernel. The boot sector assumes that the kernel is resident the two following sectors on the same disk. So, we need to make them into one file. For this, I've made a special program in C. I'm not going into any details about it, but I'll include the source code.
The program is called 'makeboot' and takes at least three parameters. The first one is the output file name. This can be 'a.img' in our case. The rest of the parameters are input files, read in order. We want our boot sector to be placed first and then our kernel.
makeboot a.img bootsect.bin kernel.bin
Just run bochs with a.img and this is what you should get:
Download the complete source for this tutorial, including makeboot and a .bat-file for compiling.

LBA to CHS

0

LBA to CHS

This is a (very) simple LBA to CHS tutorial. I assume that you know how to Add, Subtract, Divide and multiply and that you know what Assembly is and possibly the basics of it. (though I go over the basics to remind you in case you have forgotten)

Brief idea of what the physical drive is like:

A normal floppy drive (which I will use for this example) contains 2 main parts.
Sector: The area on the disk
Cylinder: aka Track, one circle at the same radius from the center.
Head: The top or bottom side of the disk? (In hard disks you have multiple magnetic disks) Also it is the head that contains the mechanism to read and write to the disk.
So to address any part of the drive you have to say:
1. Top or bottom
2. How far to move the head from the center
3. How far to move the disk round.

Ok so what is LBA?

LBA is logical addressing of your physicial drive.
Or in simple terms you refer to sectors on the floppy or hard disk as 1,2,3,4,5,.....

Ok so what is CHS?

CHS is the method the drive uses to load sectors from the floppy or hard disk.
This composes of a Sector, Cylinder, Head.

Maths!!!

If you are anything like me you will shudder at the notion of maths. In order to understand this you will have to know at least basic division, which is not to hard.
If you look at most sites they will give you this set of formulas:
Sector = (LBA mod SectorsPerTrack)+1
Cylinder = (LBA/SectorsPerTrack)/NumHeads
Head = (LBA/SectorsPerTrack) mod NumHeads

Is there a different way?

Well this is the only formula which works so don't go looking for any other way to do this, as an easier doesn't exist.

So what does it all mean?

mod stands for modulus in the formulas I presented.
/ stands for divide.
An easy way I've decided to look at the mod's is as points to take the remainder value rather than the quotient value.
Lets re-write this taking that into effect:
Sector = (LBA/SectorsPerTrack) Remainder value + 1
Cylinder = (LBA/SectorsPerTrack)/NumHeads (Take Remainder value)
Head = (LBA/SectorsPerTrack)/NumHeads (Take quotient value)
(The plus 1 on the sectors is because you need a sector to read at least, else you won't be reading anything if theres a 0 anywhere, for some reason they don't start at 0 or something like that, but miss it out and out and you will have interesting things happen)
Ok still confused?
LBA/Sectors Per Track:
Quotient - NumberOfTracks     Remainder + 1 - Sector
But you can't just have the number of tracks that has to be broken down into cylinders and heads!
Number Of Tracks/Number Of Heads:
Quotient - Cylinder     Remainder - Head
Cylinder is half the number of tracks, head value will alternate (0,1 on a floppy) with each increase of the NumberOfTracks value.
Ok I don't think I can make the formulas any simpler I'm afraid. So I'm going to move on to the next section.

The vague algorithm:

(SPT = Sectors Per Track)
(All divides return: Quotient, Remainder in this algorithm)
(; 

Assembly Language introduction:

ASM terms: Registers - In simple terms memory/variables of a fixed size in the CPU    - In the code I'm writing the size is 16 bytes for each register. ASM commands used: The DIV command is very important in this function. DIV ; Divides register AX by the register you enter as reg    ; Output value AX = Quotient DX = Remainder. The MOV command is simple but used often. MOV ; Move the Source to the destination. The INC command is again simple but is needed. INC ; Adds 1 to the register you pass to the command. The XOR command is mostly used for turning a register to zero's. It compares each bit of two registers and if they are set the same outputs zero as the result. XOR ax, ax ; This will zero ax (will work with any register) The RET command is used to return to the main program. RET ; Returns to the main program PUSH ; Puts the register's value onto the stack POP ; Restores a registers value using the value on the stack In order to learn more about Asm I suggest looking at the Art Of Asm website (The location changes and this may not be updated regularly so best to look for it in a search engine) The book 'Assembly Language Step By Step' By Jeff Duntemann is also very usefull for beginners to Assembly Language programming. And to find out more about the commands I put up look at the Intel Reference manual, also the NASM documentation has a reference section. (Quite likely similar documentation is also around else where)

Simple Assembly example:

(This is not likely to work if you cut and paste it and is here to show the principle only, look at the next section for a complete example)
LBACHS:
; Set up the registers ready for the divide
MOV ax, [LBAvalue] ; []'s means value at memory location LBAvalue.
; Make the divide
DIV [SectorsPerTrack] ; Carry out the division of ax.
; Put the returned Number Of Tracks some where
MOV [NumTracks], ax   ; Put the quotient into a memory variable
; Sort out the sector value
INC dx        ; Add 1 to the remainder
MOV [Sector], dx     ; Put the remainder into a memory variable

; Set up the registers ready for the divide
MOV ax, [NumTracks]    ; Put the number of tracks in to ax
; Make the divide
DIV [NumHeads]  ; Divide NumTracks (ax) by NumHeads
; Stash the results in some memory locations
MOV [Cylinder], ax     ; Quotient value, the Number of heads to be moved from ax
MOV [Head], dx        ; Remainder value, the cylinder value to be oved from dx

Advanced Assembly example:

(This will most likely work, but like any of my code I can't say for certain, being to lazy to test it at the time of writing this tutorial)
; Compile using NASM compiler (Again look for it using a search engine)
; Input: ax - LBA value
; Output: ax - Sector
;   bx - Head
;   cx - Cylinder

LBACHS:
 PUSH dx   ; Save the value in dx
 XOR dx,dx  ; Zero dx
 MOV bx, [SectorsPerTrack] ; Move into place STP (LBA all ready in place)
 DIV bx   ; Make the divide (ax/bx -> ax,dx)
 inc dx   ; Add one to the remainder (sector value)
 push dx   ; Save the sector value on the stack

 XOR dx,dx  ; Zero dx
 MOV bx, [NumHeads] ; Move NumHeads into place (NumTracks all ready in place)
 DIV bx   ; Make the divide (ax/bx -> ax,dx)

 MOV cx,ax  ; Move ax to cx (Cylinder)
 MOV bx,dx  ; Move dx to bx (Head)
 POP ax   ; Take the last value entered on the stack off.
   ; It doesn't need to go into the same register.
   ; (Sector)
 POP dx   ; Restore dx, just in case something important was
   ; originally in there before running this.
 RET   ; Return to the main function
I hope this has helpped anyone with LBA to CHS translation.

Loading Sectors

0

Loading Sectors

This tutorial is to show you how to load sectors from a floppy disk. Variations will be needed to get this code to work on most other disks.
In the bootloader code you can use a specific BIOS interrupt to load the sectors from the floppy disk for you. But unfortunatly it addresses the drive in the form of heads, cylinders and sectors.
Ideally we want to be able to refer to the floppy disk in terms of sectors only (chunks of 512bytes) the code to do that is in the LBA to CHS tutorial. Though to use that you need to first understand this tutorial.
Explaination of the CHS addressing system:
CHS simply stands for Cylinder, Head, Sector. This is the addressing system used by the low level BIOS functions and the likes.
The definitions of each are:
Sector = Chunk of data on the disk (normally 512bytes) - Segment of a cylinder(aka track)
Cylinder = This is a track on the disk (normally contains 18sectors)
Head = Which side of the disk (most floppys now are double sided, 2 heads)
So sectors 1-18 are on cylinder 1 on head 1 (or track 1 on side 1) but after that it gets complicated sectors 18+ are on varying cylinders and the head number alternates per cylinder.
This would be better explained with a diagram but that is yet to come I'm afraid though I may include one at a later date.
Explanation of the interrupt function we will be using:
We will be using interrupt 0x13 (anything starting in 0x or ending with h is a hexidecimal number)
Function number passed in ah (Most int's use ah to define a specific function) = 2 (Read sector)
The values to be passed in the rest of the registers:
al = Number of sectors to read (To be safe I wouldn't cross the cylinder boundary)
dh = Head to read from (aka side) - Addressing registers eg: Town/City
dl = Drive to read from.  - Country
ch = Cylinder (aka track) to read from - Street name
cl = Sector to read   - House number
es = Segment to put loaded data into - Output address in eg: Street name
bx = Offset to put loaded data into - House number in the street
An example to load the first sector of a floppy disk would be:
ah=2(Function number),al=1(1 sector to read),dh=1(First head)
dl=0(default for floppy drive),ch=1(First cylinder),cl=1(First sector)
es=1000h(Put the output at 1000h in memory), bx=0(Offset of 0)
To get beyound the 18th sector though you would have to change the head and cylinder values as appropriet.
Example of the code working:
; Code to load the second sector on the disk into memory location 0x2000:0x0000
 mov bx, 0x2000 ; Segment location to read into (remember can't load direct to segment register)
 mov es, bx
 mov bx, 0 ; Offset to read into
 mov ah, 02 ; BIOS read sector function
 mov al, 01 ; read one sector
 mov ch, 01 ; Track to read
 mov cl, 02 ; Sector to read
 mov dh, 01 ; Head to read
 mov dl, 00 ; Drive to read
 int 0x13  ; Make the BIOS call (int 13h contains mainly BIOS drive functions)
I recommend using the LBA to CHS code from one of my other tutorials to get past the cylinder and head addressing problems. In order to use that you put the code into a loop and read one sector at a time like so:
Get and set output location in memory,get start location,get number of sectors to load.
Loop 'number of sectors to load' times:
Run LBA to CHS (to convert the sector number in a head and cylinder)
Run int 0x13 to load the sector from the LBA to CHS outputed data.
Increase bx by the number of bytes per sector (512) ready for next sector.
This code is often best put into a procedure and called as needed to load sectors off a floppy disk.
A complete example of such a procedure is:
; Load kernel procedure
LoadKern:
        mov ah, 0x02    ; Read Disk Sectors
        mov al, 0x01    ; Read one sector only (512 bytes per sector)
        mov ch, 0x00    ; Track 0
        mov cl, 0x02    ; Sector 2
        mov dh, 0x00    ; Head 0
        mov dl, 0x00    ; Drive 0 (Floppy 1) (This can be replaced with the value in BootDrv)
        mov bx, 0x2000  ; Segment 0x2000
        mov es, bx      ;  again remember segments bust be loaded from non immediate data
        mov bx, 0x0000  ; Start of segment - offset value
.readsector
        int 0x13        ; Call BIOS Read Disk Sectors function
        jc .readsector  ; If there was an error, try again

        mov ax, 0x2000  ; Set the data segment register
        mov ds, ax      ;  to point to the kernel location in memory

        jmp 0x2000:0x0000       ; Jump to the kernel

A complete example of a procedure including the LBA to CHS code (that procedure is in that tutorial for details on it, though this does use a different version of that procedure):
; Procedure ReadSectors - Reads sectors from the disk.
;  Input: cx - Number of sectors; ax - Start position
;  Output: Loaded file into: es:bx

ReadSectors:
.MAIN:                          ; Main Label
        mov di, 5               ; Loop 5 times max!!!
.SECTORLOOP:
        push ax                 ; Save register values on the stack
        push bx
        push cx
        call LBAtoCHS             ; Change the LBA addressing to CHS addressing
        ; The code to read a sector from the floppy drive
        mov ah, 02              ; BIOS read sector function
        mov al, 01              ; read one sector
        mov ch, BYTE [absoluteTrack]    ; Track to read
        mov cl, BYTE [absoluteSector]   ; Sector to read
        mov dh, BYTE [absoluteHead]     ; Head to read
        mov dl, BYTE [BootDrv]          ; Drive to read
        int 0x13                ; Make the BIOS call
        jnc .SUCCESS
        dec di                  ; Decrease the counter
        pop cx                  ; Restore the register values
        pop bx
        pop ax
        jnz .SECTORLOOP         ; Try the command again incase the floppy drive is being annoying
        call ReadError          ; Call the error command in case all else fails
.SUCCESS
        pop cx                  ; Restore the register values
        pop bx
        pop ax
        add bx, WORD [BytesPerSector]   ; Queue next buffer (Adjust output location so as to not over write the same area again with the next set of data)
        inc ax                          ; Queue next sector (Start at the next sector along from last time)
        ; I think I may add a status bar thing also. A # for each sector loaded or something.
        ; Shouldn't a test for CX go in here???
        dec cx                          ; One less sector left to read
        jz .ENDREAD                     ; Jump to the end of the precedure
        loop .MAIN                      ; Read next sector (Back to the start)
.ENDREAD:                       ; End of the read procedure
        ret                     ; Return to main program
I have loads of variations of this code as I slowly improved it over various versions of my bootloader. I suggest looking at some of my source if you want a more detailed explanation of the source and to see it in context.
Once you have the data loaded you have to transfer control to it. Now as you should know if you know asm well you can't modify the value in the IP register directly so you have to setup the data segment registers and then jump to the new location.
The jump command needed is normally:
jmp :
eg: jmp 0x1000:0x0000
Normally for simple kernels you will leave the second part as 0x0000 and the first address should be equal to where you loaded the kernel in memory.

That is all I have had time to write I am afraid. Not being paid for this :( I will hopefully come back and write this in a more legible form but thats in the future some time.
All of my examples are cuts and pastes from various versions of my bootloader. Those is probably not ideal examples as I have implemented some things in odd ways. But my bootloaders does work which is the important thing. If you do use any of my code (no matter how small) I would appreciate being notified and my name mentioned with the source next to my code with a link to my website/details. To use any of my code in a commercial product requires my permission however!
I hope this has helpped you with loading sectors directly.
If this has helpped you please send me an e-mail saying so. (I like compliments)
If you want to see new things in here please say, if you want to translate this into an other language please send me the new version so I can host that as an alternative. (I can translate copy's of this if requested but the altavista/google/etc translaters aren't quite perfected for large documents like this, and I would rather spend my time working on something else)

LBA HDD Access via PIO

0

LBA HDD Access via PIO

Every operating system will eventually find a need for reliable, long-term storage. There are only a handful of commonly used storage devices:
  • Floppy
  • Flash media
  • CD-ROM
  • Hard drive
Hard drives are by far the most widely used mechanism for data storage, and this tutorial will familiarize you with a practical method for accessing them. In the past, a method known as CHS was used. With CHS, you specified the cylinder, head, and sector where your data was located. The problem with this method is that the number of cylinders that could be addressed was rather limited. To solve this problem, a new method for accessing hard drives was created: Linear Block Addressing (LBA). With LBA, you simply specify the address of the block you want to access. Blocks are 512-byte chunks of data, so the first 512 bytes of data on the disk are in block 0, the next 512 bytes are in block 1, etc. This is clearly superior to having to calculate and specify three separate bits of information, as with CHS. However, there is one hitch with LBA. There are two forms of LBA, which are slightly different: LBA28 and LBA48. LBA28 uses 28 bits to specify the block address, and LBA48 uses 48 bits. Most drives support LBA28, but not all drives support LBA48. In particular, the Bochs emulator supports LBA28, and not LBA48. This isn't a serious problem, but something to be aware of. Now that you know how LBA works, it's time to see the actual methods involved.
To read a sector using LBA28:
  1. Send a NULL byte to port 0x1F1: outb(0x1F1, 0x00);
  2. Send a sector count to port 0x1F2: outb(0x1F2, 0x01);
  3. Send the low 8 bits of the block address to port 0x1F3: outb(0x1F3, (unsigned char)addr);
  4. Send the next 8 bits of the block address to port 0x1F4: outb(0x1F4, (unsigned char)(addr >> 8);
  5. Send the next 8 bits of the block address to port 0x1F5: outb(0x1F5, (unsigned char)(addr >> 16);
  6. Send the drive indicator, some magic bits, and highest 4 bits of the block address to port 0x1F6: outb(0x1F6, 0xE0 | (drive << 4) | ((addr >> 24) & 0x0F));
  7. Send the command (0x20) to port 0x1F7: outb(0x1F7, 0x20);
To write a sector using LBA28:
Do all the same as above, but send 0x30 for the command byte instead of 0x20: outb(0x1F7, 0x30);
To read a sector using LBA48:
  1. Send two NULL bytes to port 0x1F1: outb(0x1F1, 0x00); outb(0x1F1, 0x00);
  2. Send a 16-bit sector count to port 0x1F2: outb(0x1F2, 0x00); outb(0x1F2, 0x01);
  3. Send bits 24-31 to port 0x1F3: outb(0x1F3, (unsigned char)(addr >> 24));
  4. Send bits 0-7 to port 0x1F3: outb(0x1F3, (unsigned char)addr);
  5. Send bits 32-39 to port 0x1F4: outb(0x1F4, (unsigned char)(addr >> 32));
  6. Send bits 8-15 to port 0x1F4: outb(0x1F4, (unsigned char)(addr >> 8));
  7. Send bits 40-47 to port 0x1F5: outb(0x1F5, (unsigned char)(addr >> 40));
  8. Send bits 16-23 to port 0x1F5: outb(0x1F5, (unsigned char)(addr >> 16));
  9. Send the drive indicator and some magic bits to port 0x1F6: outb(0x1F6, 0x40 | (drive << 4));
  10. Send the command (0x24) to port 0x1F7: outb(0x1F7, 0x24);
To write a sector using LBA48:
Do all the same as above, but send 0x34 for the command byte, instead of 0x24: outb(0x1F7, 0x34);



Once you've done all this, you just have to wait for the drive to signal that it's ready:
while (!(inb(0x1F7) & 0x08)) {}
And then read/write your data from/to port 0x1F0:
// for read:
for (idx = 0; idx < 256; idx++)
{
tmpword = inw(0x1F0);
buffer[idx * 2] = (unsigned char)tmpword;
buffer[idx * 2 + 1] = (unsigned char)(tmpword >> 8);
}
// for write:
for (idx = 0; idx < 256; idx++)
{
tmpword = buffer[8 + idx * 2] | (buffer[8 + idx * 2 + 1] << 8);
outw(0x1F0, tmpword);
}
Of course, all of this is useless if you don't know what drives you actually have hooked up. Each IDE controller can handle 2 drives, and most computers have 2 IDE controllers. The primary controller, which is the one I have been dealing with thus-far has its registers located from port 0x1F0 to port 0x1F7. The secondary controller has its registers in ports 0x170-0x177. Detecting whether controllers are present is fairly easy:
  1. Write a magic value to the low LBA port for that controller (0x1F3 for the primary controller, 0x173 for the secondary): outb(0x1F3, 0x88);
  2. Read back from the same port, and see if what you read is what you wrote. If it is, that controller exists.
Now, you have to detect which drives are present on each controller. To do this, you simply select the appropriate drive with the drive/head select register (0x1F6 for the primary controller, 0x176 for the secondary controller), wait a small amount of time (I wait 1/250th of a second), and then read the status register and see if the busy bit is set:
outb(0x1F6, 0xA0); // use 0xB0 instead of 0xA0 to test the second drive on the controller
sleep(1); // wait 1/250th of a second
tmpword = inb(0x1F7); // read the status port
if (tmpword & 0x40) // see if the busy bit is set
{
printf("Primary master exists\n");
}
And that about wraps it up. Note that I haven't actually tested my LBA48 code, because I'm stuck with Bochs, which only supports LBA28. It should work, according to the ATA specification.

Multitasking Howto

0

Multitasking Howto

This little HowTo is intended to show, how to set up a multitasking environment for a custom OS. If you have questions: don't hesitate to drop me a mail.
For the beginning: you will of course need several Knowledge like how to handle linked Lists or BinaryTrees, which I won't cover in this text. I will show you the basics of a stack-based multitasking subsystem.
Multitasking on a single processor machine: to switch between a bunch of processes in a quick manner: each of them posesses for either a certain time or until it gives it up - the CPU.
First, lets define the process structure: it covers the essential informations about a process and is the MAIN element an operating system uses to handle processes.
typedef struct {
          uint_t prozess_esp;    //actual position of esp
          uint_t prozess_ss;     //actual stack segment.
          uint_t prozess_kstack; //stacktop of kernel stack
          uint_t prozess_ustack; //stacktop of user stack
          uint_t prozess_cr3;
          uint_t prozess_number;
          uint_t prozess_parent;
          uint_t prozess_owner;
          uint_t prozess_group;
          uint_t prozess_timetorun;
          uint_t prozess_sleep;
          uint_t prozess_priority;
          uint_t prozess_filehandle;
          console_t *prozess_console;
          memtree_t *vmm_alloc; //memory management info concerning the process 
                         //- all its allocated virtual memory
          uchar_t prozess_name[32];
        } prozess_t;
You see in this segment five very important fields: esp,ss,kstack,ustack,cr3 - these, especially esp, are accessed via the low-level asm-routines. esp holds the actual position of the KERNEL-ESP of the actual process, which has been interrupted by any isr or system call. The asm stub stuffs the esp-adress in this field.
We also need a TSS: in this system wide available structure, the processor finds the Kernel stack of the interrupted process in the esp0-field. Upon each task switch this field has to be updated to the according kernel-stack adress of the NEXT process. This may even be the LAST process. The stack switching method doesn't care about it. Here is a definition of a tss.
TSS in C:
typedef struct {
       ushort_t backlink, __blh;
       uint_t esp0;
       ushort_t ss0, __ss0h;
       uint_t esp1;
       ushort_t ss1, __ss1h;
       uint_t esp2;
       ushort_t ss2, __ss2h;
       uint_t cr3;
       uint_t eip;
       uint_t eflags;
       uint_t eax, ecx, edx, ebx;
       uint_t esp, ebp, esi, edi;
       ushort_t es, __esh;
       ushort_t cs, __csh;
       ushort_t ss, __ssh;
       ushort_t ds, __dsh;
       ushort_t fs, __fsh;
       ushort_t gs, __gsh;
       ushort_t ldt, __ldth;
       ushort_t trace, bitmap;
      } tss_t;
Next, we need the proper routines for the ISR's:
%macro REG_SAVE 0
      ;save all registers in the kernel-level stack of the process and switch to the kernel stack
        cld
        pushad
        push ds
        push es
        push fs
        push gs
        mov eax,[p] ;put the adress of the struct of CURRENT PROCESS in eax.(the CONTENT of pointer p)
        mov [eax],esp ;save esp in the location of esp in the CURRENT PROCESS-STRUCT.
        lea eax,[kstackend] ; switch to the kernel's own stack.
        mov esp,eax
        %endmacro

        %macro REG_RESTORE_MASTER 0
        mov eax,[p] ;put adress of struct of current process in eax.
        mov esp,[eax] ;restore adress of esp.
        mov ebx,[eax+8];put content of the k-stack field into ebx.
        mov [sys_tss+4],ebx ;update system tss.
        mov al,0x20
        out 0x20,al
        pop gs
        pop fs
        pop es
        pop ds
        popad
        iretd
        %endmacro

 ;and here is an example isr.
 [EXTERN timer_handler]
 [GLOBAL hwint00]
        hwint00:
          REG_SAVE
          call timer_handler
          REG_RESTORE_MASTER
As the comments should show:
  • You take the current process' structure, put its adress into eax, and access/update the corresponding fields: esp is the actual esp - it is on the structs first position, so no offset is needed. For the other fields you have to add the offset to the adress in eax.
  • You also have to update the esp0 field in the system tss.
  • This done, you tell the cpu, where the kernelstack for the process it has to run next is located.
  • Upon interrupt, it takes this adress as its esp and every other popping/pushing is done on this stack. After having pushed all the relevant hardware states on the kernel-stack of the process, you have to save the position of esp after the last popped register into the structures esp-field.
  • Upon leaving the isr, it takes the new ESP value from the current-process-structure, replaces the value in the esp0 field of the tss with the adress of the top of the new kstack, and then pops off all the registers and returns to the process. If it is a user-process, the two last items popped off upon iret are user-stack-segment and user-stack.
It is really not that difficult to get started, this stack based task switching. But of course, you have to fill in the kstack of a new process with the proper values for it starts off with exactly what is located on the kstack to be popped off, when the process is first scheduled for execution.
Why do you need to fill in the esp0 field of the tss at every task switch: It is mainly because a user process which runs at dpl3 (User Level), can enter kernel space (=dpl0 - system level) by issuing an interrupt or accessing a call gate. I 'd rather prefer the software interrupt approach. The thing looks like this:
  • intXX:Dpl3->Dpl0 - cpu switches to stack indicated in tss.esp0.
  • iret: dpl0->dpl3 - cpu pops off register values before iret.
    iret itself restores eflags,cs,eip,userstack segment,user stack adress.
These userspace-kernelspace-userspace transitions happen at every interrupt - be it software be it hardware interrupt.
So, lets have a look on how to fill in at least the process structure and the kstack.
prozess_t *task_anlegen(entry_t entry,int priority){
        //various initialization stuff may be done before reaching the below described operations.

        //filling in the kstack for start up:
        uint_t *stacksetup; //temporary pointer - will be set to the
                          //kstacktop and afterwards saved in esp of proc structure.

      ...
        stacksetup=&kstacks[d][KSTACKTOP];
        *stacksetup--;
        *stacksetup--=0x0202;
        *stacksetup--=0x08;
        *stacksetup--=(uint_t)entry; //This is the adress of the process' entry point (z.b. main());
        *stacksetup--=0;    //ebp
        *stacksetup--=0;    //esp
        *stacksetup--=0;    //edi
        *stacksetup--=0;    //esi
        *stacksetup--=0;    //edx
        *stacksetup--=0;    //ecx
        *stacksetup--=0;    //ebx
        *stacksetup--=0;    //eax
        *stacksetup--=0x10; //ds
        *stacksetup--=0x10; //es
        *stacksetup--=0x10; //fs
        *stacksetup=  0x10; //gs

 //filling in the struct.
        processes[f].prozess_esp=(uint_t)stacksetup;
        processes[f].prozess_ss=0x10;
        processes[f].prozess_kstack=(uint_t)&kstacks[d][KSTACKTOP];
        processes[f].prozess_ustack=(uint_t)&stacks[d][USTACKTOP];
You put the Values onto the kstack in the order in which they would be pushed onto the it by an isr. the last position of stacksetup, you stuff into the esp field. This is a really straight forward thing.
The Entry point of your process is the adress of the process' main function. This is the adress to which eip is set upon iret to this process after it's ben scheduled for it's first execution.
Now, I'll show you a little example of scheduling to round it off a little. Basically, this scheduler just moves the processes around in the queue in round robin manner. Also I'll show a little isr.
//global pointer to current task:
      prozess_t *p;

      //the isr:
      void timer_handler(void){
        if(task_to_bill->prozess_timetorun>0){
   task_to_bill->prozess_timetorun--;
          if(task_to_bill->prozess_timetorunprozess_timetorunprozess_timetorun=10;//refill timeslice.
   //remove process from the head of the queue.
          proz=remove_first_element_from_queue(&roundrobin_prozesse,0);
   //put the element to the rear of the queue.
          add_element_to_queue_rear(&roundrobin_prozesse,proz);
        }
 //pick the process at the head of any queue. (f. ex. round robin queue)
 //and put it in p.
        choose_prozess(irq);
      }
So, it fits together: At each timer interrupt, the current process p's state is saved. then, the isr decrements the process timeslice by one. If there isn't any timeslice left, the process is put to the end of the queue by the scheduler and the next one is to be started: it is then located in p from where the isr-stub takes the relevant values and restores the process' state - and starts/restarts it.
You have to keep in mind at any time, that the operating system is event driven. Interrupts are events, as well as system calls. the operating system reacts to them and carries out the requested operations or methods necessary to satisfy devices which have triggered an event.You can also perform task switches upon receipt of an event (irq/software int/exception).
Below, you'll find some definitions.
convenient definitions:
typedef uchar_t  unsigned char;  // -->Length: 8 bit
      typedef ushort_t unsigned short; // -->Length: 16 bit
      typedef uint_t   unsigned int;   // -->Length: 32 bit
      typedef ulong_t  unsigned long;  // -->Length: 64 bit
      typedef void (*entry_t)(void);
And remember: Keep it simple!

How to program the DMA

0

How to program the DMA

Introduction

What is the DMA?
The DMA is another chip on your motherboard (usually is an Intel 8237 chip) that allows you (the programmer) to offload data transfers between I/O boards. DMA actually stands for 'Direct Memory Access'.
An example of DMA usage would be the Sound Blaster's ability to play samples in the background. The CPU sets up the sound card and the DMA. When the DMA is told to 'go', it simply shovels the data from RAM to the card. Since this is done off-CPU, the CPU can do other things while the data is being transferred.
Lastly, if you're interested in what I know about programming the DMA to do memory to memory transfers, you might want to refer to Appendix B. This section is by no means complete, and it will probably be added to in the future as I learn more about this particular type of transfer.
Allright, here's how you program the DMA chip.

DMA Basics

When you want to start a DMA transfer, you need to know three things:
  • Where the memory is located (what page)
  • The offset into the page
  • How much you want to transfer
Since the DMA can work in both directions (memory to I/O card, and I/O card to memory), you can see how the Sound Blaster can record as well as play by using DMA.
The DMA has two restrictions which you must abide by:
  • You cannot transfer more than 64K of data in one shot
  • You cannot cross a page boundary
Restriction #1 is rather easy to get around. Simply transfer the first block, and when the transfer is done, send the next block.
For those of you not familiar with pages, I'll try to explain.
Picture the first 1MB region of memory in your system. It is divided into 16 pages of 64K a piece like so:
Page Segment:Offset address
0 0000:0000 - 0000:FFFF
1 1000:0000 - 1000:FFFF
2 2000:0000 - 2000:FFFF
3 3000:0000 - 3000:FFFF
4 4000:0000 - 4000:FFFF
5 5000:0000 - 5000:FFFF
6 6000:0000 - 6000:FFFF
7 7000:0000 - 7000:FFFF
8 8000:0000 - 8000:FFFF
9 9000:0000 - 9000:FFFF
A A000:0000 - A000:FFFF
B B000:0000 - B000:FFFF
C C000:0000 - C000:FFFF
D D000:0000 - D000:FFFF
E E000:0000 - E000:FFFF
F F000:0000 - F000:FFFF
This might look a bit overwhelming. Not to worry if you're a C programmer, as I'm going to assume you know the C language for the examples in this text. All the code in here will compile with Turbo C 2.0.
Okay, remember the three things needed by the DMA? Look back if you need to. We can stuff this data into a structure for easy accessing:
typedef struct
{
    char page;
    unsigned int offset;
    unsigned int length;
} DMA_block;
Now, how do we find a memory pointer's page and offset? Easy. Use the following code:
void LoadPageAndOffset(DMA_block *blk, char *data)
{
    unsigned int temp, segment, offset;
    unsigned long foo;
    segment = FP_SEG(data);
    offset  = FP_OFF(data);
    blk->page = (segment & 0xF000) >> 12;
    temp = (segment & 0x0FFF)  0xFFFF)
        blk->page++;
    blk->offset = (unsigned int)foo;
}
Most (if not all) of you are probably thinking, "What the heck is he doing there?" I'll explain.
The FP_SEG and FP_OFF macros find the segment and the offset of the data block in memory. Since we only need the page (look back at the table above), we can take the upper 4 bits of the segment to create our page.
The rest of the code takes the segment, adds the offset, and sees if the page needs to be advanced or not. (Note that a memory region can be located at 2FFF:F000, and a single byte increase will cause the page to increase by one.)
In plain English, the page is the highest 4 bits of the absolute 20 bit address of our memory location. The offset is the lower 12 bits of the absolute 20 bit address plus our offset.
Now that we know where our data is, we need to find the length.
The DMA has a little quirk on length. The true length sent to the DMA is actually length + 1. So if you send a zero length to the DMA, it actually transfers one byte, whereas if you send 0xFFFF, it transfers 64K. I guess they made it this way because it would be pretty senseless to program the DMA to do nothing (a length of zero), and in doing it this way, it allowed a full 64K span of data to be transferred.
Now that you know what to send to the DMA, how do you actually start it? This enters us into the different DMA channels.

DMA channels

The DMA has 4 different channels to send 8-bit data. These channels are 0, 1, 2, and 3, respectively. You can use any channel you want, but if you're transferring to an I/O card, you need to use the same channel as the card. (ie: Sound Blaster uses DMA channel 1 as a default.)
There are 3 ports that are used to set the DMA channel:
  • The page register
  • The address (or offset) register
  • The word count (or length) register
The following chart will describe each channel and it's corresponding port number:
DMA Channel Page Address Count
0 87h 0h 1h
1 83h 2h 3h
2 81h 4h 5h
3 82h 6h 7h
4 8Fh C0h C2h
5 8Bh C4h C6h
6 89h C8h CAh
7 8Ah CCh CEh
(Note: Channels 4-7 are 16-bit DMA channels. See below for more info.)
Since you need to send a two-byte value to the DMA (the offset and the length are both two bytes), the DMA requests you send the low byte of data first, then the high byte. I'll give a thorough example of how this is done momentarily.
The DMA has 3 registers for controlling it's state. Here is the bitmap layout of how they are accessed:
Mask Register (0Ah):
MSB                             LSB
      x   x   x   x     x   x   x   x
      -------------------   -   -----
               |            |     |     00 - Select channel 0 mask bit
               |            |     \---- 01 - Select channel 1 mask bit
               |            |           10 - Select channel 2 mask bit
               |            |           11 - Select channel 3 mask bit
               |            |
               |            \----------  0 - Clear mask bit
               |                         1 - Set mask bit
               |
               \----------------------- xx - Don't care
Mask Register (0Ah):
MSB                             LSB
      x   x   x   x     x   x   x   x
      \---/   -   -     -----   -----
        |     |   |       |       |     00 - Channel 0 select
        |     |   |       |       \---- 01 - Channel 1 select
        |     |   |       |             10 - Channel 2 select
        |     |   |       |             11 - Channel 3 select
        |     |   |       |
        |     |   |       |             00 - Verify transfer
        |     |   |       \------------ 01 - Write transfer
        |     |   |                     10 - Read transfer
        |     |   |
        |     |   \--------------------  0 - Autoinitialized
        |     |                          1 - Non-autoinitialized
        |     |
        |     \------------------------  0 - Address increment select
        |
        |                               00 - Demand mode
        \------------------------------ 01 - Single mode
                                        10 - Block mode
                                        11 - Cascade mode
DMA clear selected channel (0Ch):
Outputting a zero to this port stops all DMA processes that are currently happening as selected by the mask register (0Ah).
Some of the most common modes to program the mode register are:
  • 45h: Write transfer (I/O card to memory)
  • 49h: Read transfer (memory to I/O card)
Both of these assume DMA channel 1 for all transfers.
Now, there's also the 16-bit DMA channels as well. These shove two bytes of data at a time. That's how the Sound Blaster 16 works as well in 16-bit mode.
Programming the DMA for 16-bits is just as easy as 8 bit transfers. The only difference is you send data to different I/O ports. The 16-bit DMA also uses 3 other control registers as well:
Mask Register (D4h):
MSB                             LSB
      x   x   x   x     x   x   x   x
      -------------------   -   -----
               |            |     |     00 - Select channel 4 mask bit
               |            |     \---- 01 - Select channel 5 mask bit
               |            |           10 - Select channel 6 mask bit
               |            |           11 - Select channel 7 mask bit
               |            |
               |            \----------  0 - Clear mask bit
               |                         1 - Set mask bit
               |
               \----------------------- xx - Don't care
Mode Register (D6h):
MSB                             LSB
      x   x   x   x     x   x   x   x
      -----   -   -     -----   -----
        |     |   |       |       |     00 - Channel 4 select
        |     |   |       |       \---- 01 - Channel 5 select
        |     |   |       |             10 - Channel 6 select
        |     |   |       |             11 - Channel 7 select
        |     |   |       |
        |     |   |       |             00 - Verify transfer
        |     |   |       \------------ 01 - Write transfer
        |     |   |                     10 - Read transfer
        |     |   |
        |     |   \--------------------  0 - Autoinitialized
        |     |                          1 - Non-autoinitialized
        |     |
        |     \------------------------  0 - Address increment select
        |
        |                               00 - Demand mode
        \------------------------------ 01 - Single mode
                                        10 - Block mode
                                        11 - Cascade mode
DMA clear selected channel (D8h):
Outputting a zero to this port stops all DMA processes that are currently happening as selected by the mask register (D4h).
Now that you know all of this, how do you actually use it? Here is sample code to program the DMA using our DMA_block structure we defined before.
/* Just helps in making things look cleaner.  :) */
typedef unsigned char   uchar;
typedef unsigned int    uint;

/* Defines for accessing the upper and lower byte of an integer. */
#define LOW_BYTE(x)         (x & 0x00FF)
#define HI_BYTE(x)          ((x & 0xFF00) >> 8)

/* Quick-access registers and ports for each DMA channel. */
uchar MaskReg[8]   = { 0x0A, 0x0A, 0x0A, 0x0A, 0xD4, 0xD4, 0xD4, 0xD4 };
uchar ModeReg[8]   = { 0x0B, 0x0B, 0x0B, 0x0B, 0xD6, 0xD6, 0xD6, 0xD6 };
uchar ClearReg[8]  = { 0x0C, 0x0C, 0x0C, 0x0C, 0xD8, 0xD8, 0xD8, 0xD8 };

uchar PagePort[8]  = { 0x87, 0x83, 0x81, 0x82, 0x8F, 0x8B, 0x89, 0x8A };
uchar AddrPort[8]  = { 0x00, 0x02, 0x04, 0x06, 0xC0, 0xC4, 0xC8, 0xCC };
uchar CountPort[8] = { 0x01, 0x03, 0x05, 0x07, 0xC2, 0xC6, 0xCA, 0xCE };

void StartDMA(uchar DMA_channel, DMA_block *blk, uchar mode)
{
    /* First, make sure our 'mode' is using the DMA channel specified. */
    mode |= DMA_channel;

    /* Don't let anyone else mess up what we're doing. */
    disable();

    /* Set up the DMA channel so we can use it.  This tells the DMA */
    /* that we're going to be using this channel.  (It's masked) */
    outportb(MaskReg[DMA_channel], 0x04 | DMA_channel);

    /* Clear any data transfers that are currently executing. */
    outportb(ClearReg[DMA_channel], 0x00);

    /* Send the specified mode to the DMA. */
    outportb(ModeReg[DMA_channel], mode);

    /* Send the offset address.  The first byte is the low base offset, the */
    /* second byte is the high offset. */
    outportb(AddrPort[DMA_channel], LOW_BYTE(blk->offset));
    outportb(AddrPort[DMA_channel], HI_BYTE(blk->offset));

    /* Send the physical page that the data lies on. */
    outportb(PagePort[DMA_channel], blk->page);

    /* Send the length of the data.  Again, low byte first. */
    outportb(CountPort[DMA_channel], LOW_BYTE(blk->length));
    outportb(CountPort[DMA_channel], HI_BYTE(blk->length));

    /* Ok, we're done.  Enable the DMA channel (clear the mask). */
    outportb(MaskReg[DMA_channel], DMA_channel);

    /* Re-enable interrupts before we leave. */
    enable();
}

void PauseDMA(uchar DMA_channel)
{
    /* All we have to do is mask the DMA channel's bit on. */
    outportb(MaskReg[DMA_channel], 0x04 | DMA_channel);
}

void UnpauseDMA(uchar DMA_channel)
{
    /* Simply clear the mask, and the DMA continues where it left off. */
    outportb(MaskReg[DMA_channel], DMA_channel);
}

void StopDMA(uchar DMA_channel)
{
    /* We need to set the mask bit for this channel, and then clear the */
    /* selected channel.  Then we can clear the mask. */
    outportb(MaskReg[DMA_channel], 0x04 | DMA_channel);

    /* Send the clear command. */
    outportb(ClearReg[DMA_channel], 0x00);

    /* And clear the mask. */
    outportb(MaskReg[DMA_channel], DMA_channel);
}

uint DMAComplete(uchar DMA_channel)
{
    /* Register variables are compiled to use registers in C, not memory. */
    register int z;

    z = CountPort[DMA_channel];
    outportb(0x0C, 0xFF);

    /* This *MUST* be coded in Assembly!  I've tried my hardest to get it */
    /* into C, and I've had no success.  :(  (Well, at least under Borland.) */
redo:
        asm {
                mov  dx,z
                in   al,dx
                mov bl,al
                in al,dx
                mov  bh,al
                in al,dx
                mov ah,al
                in al,dx
                xchg ah,al
                sub  bx,ax
                cmp  bx,40h
                jg redo
                cmp bx,0FFC0h
                jl redo
        }
    return _AX;
}
I think all the above functions are self explanatory except for the last one. The last function returns the number of bytes that the DMA has transferred to (or read from) the device. I really don't know how it works as it's not my code. I found it laying on my drive, and I thought it might be somewhat useful to those of you out there. You can find out when a DMA transfer is complete this way if the I/O card doesn't raise an interrupt. DMAComplete() will return -1 (or 0xFFFF) if there is no DMA in progress.
Don't forget to load the length into your DMA_block structure as well before you call StartDMA(). (When I was writing these routines, I forgot to do that myself... I was wondering why it was transferring garbage.. )

Conclusion

I hope you all have caught on to how the DMA works by now. Basically it keeps a list of DMA channels that are running or not. If you need to change something in one of these channels, you mask the channel, and reprogram. When you're done, you simply clear the mask, and the DMA starts up again.
If anyone has problems getting this to work, I'll be happy to help. Send us mail at the address below, and either I or another Tank member will fix your problem(s).

Appendix A - Programming the DMA in 32-bit protected mode

Programming the DMA in 32-bit mode is a little trickier than in 16-bit mode. One restriction you have to comply with is the 1 Mb DOS barrier. Although the DMA can access memory up to the 16 Mb limit, most I/O devices can't go above the 1 Mb area. Knowing this, we simply default to living with the 1 Mb limit.
Since your data you want to transfer is probably somewhere near the end of your RAM (Watcom allocates memory top-down), you won't have to worry about not having room in the 1 Mb area.
So, how do you actually allocate a block of RAM in the 1 Mb area? Simple. Make a DPMI call -- or better yet, use the following functions to do it for you. :)
typedef struct
{
    unsigned int segment;
    unsigned int offset;
    unsigned int selector;
} RMptr;

RMptr getmem(int size)
{
    union REGS regs;
    struct SREGS sregs;
    RMptr foo;

    segread(&sregs);
    regs.w.ax = 0x0100;
    regs.w.bx = (size+15) >> 4;
    int386x(0x31, &regs, &regs, &sregs);

    foo.segment = regs.w.ax;
    foo.offset = 0;
    foo.selector = regs.w.dx;
    return foo;
}

void freemem(RMptr foo)
{
    union REGS regs;
    struct SREGS sregs;

    segread(&sregs);
    regs.w.ax = 0x0101;
    regs.w.dx = foo.selector;
    int386x(0x31, &regs, &regs, &sregs);
}

void rm2pmcpy(RMptr from, char *to, int length)
{
    char far *pfrom;

    pfrom = (char far *)MK_FP(from.selector, 0);
    while (length--)
        *to++ = *pfrom++;
}

void pm2rmcpy(char *from, RMptr to, int length)
{
    char far *pto;

    pto = (char far *)MK_FP(to.selector, 0);
    while (length--)
        *pto++ = *from++;
}
Take note on a couple of things here. First of all, the getmem() function does exactly what it says, along with freemem(). But remember, you're not tossing around a pointer anymore. It's just a data structure with a segment and an offset stored in it.
You've allocated your memory, and now you need to put something into it. You need to use pm2rmcpy() to copy protected mode memory to real mode memory. If you want to go the other way, rm2pmcpy() is there to help you.
Now we need to load the DMA_block with our information since we now have data that the DMA can access. The function is technically the same, but it just handles different variables:
void LoadPageAndOffset(DMA_block *blk, RMptr data)
{
    unsigned int temp, segment, offset;
    unsigned long foo;

    segment = data.segment;
    offset  = data.offset;

    blk->page = (segment & 0xF000) >> 12;
    temp = (segment & 0x0FFF)  0xFFFF)
        blk->page++;
    blk->offset = (unsigned int)foo;
}
That's about it. Since you've now loaded your DMA_block structure with the data you need, the rest of the functions should work fine without any problems. The only thing you'll need to concern yourself with is using '_enable()' instead of 'enable()', '_disable()' instead of 'disable()', and 'outp()' instead of 'outportb()'.

Appendix B - Doing memory to memory DMA transfers

All information contained in this area is mostly theory and results of tests I have done in this area. This is not a very well documented area, and it is probably even less portable from machine to machine.
Welcome to the undocumented world of memory to memory DMA transfers! This area has given me many headaches, so as a warning (and maybe preventive medicine), you might want to take an aspirin or two before proceeding. :)
I will be writing on a level of medium intelligence. You should understand the basics of DMA transfers, and at least understand 90%, if not all, of the information contained in this document (except for this area, of course). You won't find any source code here, however, I plan to release full source code once I get the DMA to transfer a full block of memory to the video card (if it's possible)...
Anyways, let's get started.
I recently set out on the task of figuring out how to transfer a single area of memory to the video screen by using DMA.
When you sit down to think about it, it really does not seem to be too difficult. You might think, 'All I need to do is use 2 DMA channels. One set to read and one set to write. My video buffer will need to be aligned onto a segment so the DMA can transfer the data without stopping.' This is a good theory, but, unfortunately, it doesn't work. I'll show you (sort of) why it doesn't work.
I originally started out with the idea that DMA channel 0 would read from my video buffer aligned on a segment, and DMA channel 1 would write to the video memory (at 0xA000).
In testing this simple idea, I wasn't suprised that nothing happened when I enabled the DMA. After playing around with some of the registers for a little bit, I opened the Undocumented DOS book and scanned the ports. Here's a snippet of what I found:
0008    w   DMA channel 0-3 command register
  bit 7 = 1  DACK sense active high
         = 0  DACK sense active low
  bit 6 = 1  DREQ sense active high
         = 0  DREQ sense active low
  bit 5 = 1  extended write selection
         = 0  late write selection
  bit 4 = 1  rotating priority
         = 0  fixed priority
  bit 3 = 1  compressed timing
         = 0  normal timing
  bit 2 = 1  enable controller
         = 0  enable memory-to-memory
Seeing bit 2 at port 0x08 made me realize that the DMA might possibly NOT default to being able to handle memory to memory transfers.
Again, I tried my test program, and I still wasn't suprised that nothing happened. I opened Undocumented DOS again, and found another port that I skipped over:
0009       DMA write request register
After thinking for a little, I realized that even though the DMA is enabled, the I/O card that you are usually transferring to must communicate with the bus to tell the DMA it's ready to receive data. Since we have no I/O card to say 'Go!', we need to set the DMA to 'Go!' manually.
Undocumented DOS had no bit flags defined for port 0x09, so here is what I've been able to come up with thus far:
DMA Write Request Register (09h)
MSB                             LSB
      x   x   x   x     x   x   x   x
      -------------------   -   -----
               |            |     |     00 - Select channel 0
               |            |     \---- 01 - Select channel 1
               |            |           10 - Select channel 2
               |            |           11 - Select channel 3
               |            |
               |            \----------  0 - ???
               |                         1 - Needs to be turned on
               |
               \----------------------- xx - Don't care
After adding a couple of lines of code, and running the test program once again, I was amazed to see that my screen cleared! I didn't get a buffer copy, I got a screen clear. I went back into the code to make sure my buffer had data, and sure enough, it did.
Wondering what color my screen had cleared, I added some more code and found that the screen was cleared with value 0xFF.
Pondering on this one, I made the assumption that the DMA is NOT receiving data from itself, but from the bus! Since there are no I/O cards to send data down the bus, I assumed that 0xFF was a default value.
But then again, maybe DMA channel 0 wasn't working right. I took the lines of code to initialize DMA channel 0 and the code to start the DMA transfer for channel 0 out of the code and reran the test code. Much to my suprise, the screen cleared twice as fast as before.
As for timing, my results aren't too accurate. In fact, don't even take these as being true. The first test (with both DMA 0 and 1 enabled), cranked out around 8.03 frames per second on my 486DX-33 VLB Cirrus Logic. The second test (with just DMA 1 enabled), cranked out 18.23 fps.
This is about as far as I've gotten with memory to memory DMA transfers. I'm going to be trying other DMA channels, and maybe even the 16-bit ones to get a faster dump.
If anyone can contribute any information, please let me know. You will be credited for any little tiny piece of help you can give. Maybe if we all pull together, we might actually be able to do frame dumps in the background while we're rendering our next frame... could prove to be useful!

GUI Development

0

GUI Development

Why use this document?

The purpose of this tutorial is to try to explain how to create a simple Graphical User Interface Program for use in a DOS environment or for use in a home-brew type of Operating System. I was actually asked to write this tutorial, which will some day be posted on the internet.

Requirements for this tutorial

For this tutorial on GUI developement, I HIGHLY recommend reading the following:
You will also need:
  • A C/C++ Compiler (I use BorlandC v3.0, although DJGPP could definitely do it).
  • A VGA compatible display adapter and monitor (At least 64KBytes video RAM for mode 0x13).
  • A 286 based PC (That's the minimum that BorlandC allows), with 2MBytes of RAM.

OS STEP: Make sure you can handle a GUI

In order for you to correctly run a GUI using these methods, you must have included in your Operating System the following items: A memory manager of some sort, and (if you run in protected mode) a V86 handler so you can call real mode interrupts. Your memory manager MUST include a malloc( ) or equivalent, realloc( ) or equivalent, and free( ) or equivalent. We use binary trees to store the window and control lists, and therefore we need to dynamically allocate memory. A V86 handler is needed to be implemented in order to make real mode BIOS calls. These calls are ONLY used to set the graphics mode. If you wish, you may skip the V86 handler if you have set mode equivalent functions that change the video registers. You can call real mode interrupts without the V86 handler if your Operating System is a real mode-type OS.

GUI STEP: Set up your GUI environment

This is the step that you start coding your GUI. You will need to make a graphics library that uses bitmaps and double buffers. Example routines are posted here:
unsigned char *VGA = (unsigned char *)0xA0000000L;
unsigned char *dbl_buffer;

typedef struct tagBITMAP              /* the structure for a bitmap. */
{
    unsigned int width;
    unsigned int height;
    unsigned char *data;
} BITMAP;

typedef struct tagRECT
{
    long x1;
    long y1;
    long x2;
    long y2;
} RECT;

void init_dbl_buffer(void)
{
    dbl_buffer = (unsigned char *) malloc (SCREEN_WIDTH * SCREEN_HEIGHT);
    if (dbl_buffer == NULL)
    {
 printf("Not enough memory for double buffer.\n");
 getch();
 exit(1);
    }
}

void update_screen(void)
{
    #ifdef VERTICAL_RETRACE
      while ((inportb(0x3DA) & 0x08));
      while (!(inportb(0x3DA) & 0x08));
    #endif
    memcpy(VGA, dbl_buffer, (unsigned int)(SCREEN_WIDTH * SCREEN_HEIGHT));
}

void setpixel (BITMAP *bmp, int x, int y, unsigned char color)
{
    bmp->data[y * bmp->width + x];
}

/* Draws a filled in rectangle IN A BITMAP. To fill a full bitmap call as
drawrect (&bmp, 0, 0, bmp.width, bmp.height, color); */
void drawrect(BITMAP *bmp, unsigned short x, unsigned short y,
                     unsigned short x2, unsigned short y2,
                     unsigned char color)
{
    unsigned short tx, ty;
    for (ty = y; ty height; j++)
    {
 memcpy(&dbl_buffer[screen_offset], &bmp->data[bitmap_offset], bmp->width);

 bitmap_offset += bmp->width;
 screen_offset += SCREEN_WIDTH;
    }
}

void main()
{
    unsigned char key;
    do
    {
        key = 0;
        if (kbhit()) key = getch();

        /* You must clear the double buffer every time to avoid evil messes
            (go ahead and try without this, you will see) */
        memset (dbl_buffer, 0, SCREEN_WIDTH * SCREEN_HEIGHT);

        /* DRAW ALL BITMAPS AND DO GUI CODE HERE */

        /* Draws the double buffer */
        update_screen();

    } while (key != 27); /* keep going until escape */
}
Now, how does this code work? I first set up a pointer to the video memory at address 0xA0000000. If you are running in a protected mode OS (except for one that emulates DOS), you must set up this variable as 0xA0000. The BITMAP and RECT structures should be fairly straightforward: RECT defines an area on the screen ([x1,y1][x2,y2]) and BITMAP defines a bitmap in memory. The way it works is you use the width of the bitmap to find out how much is on a single horizontal line in the bitmap. You loop this through from 0 until height to draw. Do not forget to set the screen offset (see in the draw_bitmap functions). To make the GUI or program(you can use the above code for nearly ANY graphical program) run smoothly, you can do something called double buffering. This means that you allocate an area that is the size of the screen in memory (use malloc or equivalent) and draw directly to it instead of to the video memory. When finished drawing, you write the doublebuffer to the video memory and you image is shown. Please note, that you need to wait for what is called a "Vertical Retrace". This is the last phase that the video card goes through when finishing a screen update. When this is finished, then you draw the doublebuffer to the screen and the screen will not flicker. It is alot faster for the computer to draw to system memory rather than make an I/O request everytime you want to write to the screen.

GUI STEP: GUI Theory - binary trees

Binary tree structures are critical to my GUI method developement. The structure is like so: A parent window which supports child windows. The basic window structure should look like so:
Or as see in code, every box from above looks like:
struct WINDLIST
{
    RECT position;
    unsigned long handle;
    unsigned char *caption;
    unsigned long flags;
    unsigned char needs_repaint;

    BITMAP wbmp;

    struct WINDLIST *prev, *next;
    struct WINDLIST *first_child, *last_child;
    struct WINDLIST *parent;
};
The RECT position contains the X and Y coordinates as well as X2 and Y2 coordinates. Just like if you call a drawbox function you need to give x1, y1, x2, y2... As a rule of safety, DO NOT make x2 or y2 less than x1 or y1. The parent pointer from above structure points to the window structure below it and so on... so wnd.parent will give you it's parent... most likely the desktop, unless you decide to implement MDI forms(I have not done so yet). To get the top most window in the chain, you would go parent->first_child. This window will be drawn last on the screen... and the lowest zordered window will be drawn first (parent->last_child). You can also access the next and previous windows... This is how you would cycle through the windows for drawing: Call once like this: "repaint_children(0);". This will draw the parent window(window 0), and then cycle through all the children and their children too. If you change a window (like change the titlebar color), then change the "needs_redraw" variable to 1. When you call repaint children, it will redraw it's bitmap.
static void repaint_children(unsigned long parent)
{
    struct WINDLIST *wnd, *child;
    if (parent >= wm_num_windows)
 return;

    wnd = wm_handles[parent];
    if (wnd == NULL)
 return;

    if (wnd->needs_repaint)
    {
 windowborder(&wnd->wbmp, 0, 0, wnd->wbmp.width, wnd->wbmp.height);
 wnd->needs_repaint = 0;
    }

    draw_bitmap(&wnd->wbmp, wnd->position.x1, wnd->position.y1);
    for (child = wnd->first_child; child != NULL; child = child->next)
 repaint_children(child->handle);
}
Now, you will notice that the windows all have bitmaps. You need to fill in the bitmap structure for each window upon it's creation. This means that wnd->wbmp.width = wnd->position.x2 - wnd->position.x1 and so on, as well as allocate the bitmap's data field (This is where all the window's viewable stuff is drawn). If the window's bitmap is not allocated correctly, DO NOT ALLOW IT TO DRAW as it will crash your GUI and possibly your whole machine. Instead, you should shut the GUI down and say there is no more memory. You may notice the "wm_handles" variable as well. Declare it as struct WINDLIST **wm_handles. It's an array of pointers to windows. Also declare a handle counter "wm_num_handles" as a long. Absolutely nasty... Here's how you work it:
On GUI init, you must initialize the list of windows, wm_handles, as well as create it's first window... like a desktop window:
wm_handles = malloc(sizeof(struct WINDLIST *));
    wm_handles[0] = &wm_system_parent_window;
    wm_num_windows = 1;
On window creation(createwin function) you must resize the wm_handle variable to accommodate more windows:
struct WINDLIST *wnd;

    wnd = malloc(sizeof(*wnd));
    wm_handles = realloc(wm_handles, sizeof(struct WINDLIST *) * (wm_num_windows + 1));
    wm_handles[wm_num_windows] = wnd;

    memset(wnd, 0, sizeof(*wnd));
    wnd->handle = wm_num_windows++;

    /* set window variables here... Fill them ALL IN */
To move a window to the front of a list, you must FIRST unlink the window from the list by pointing the next and previous windows to eachother:
/* Remove window from the parent's list of children */
    if (wnd->prev != NULL)
 wnd->prev->next = wnd->next;
    if (wnd->next != NULL)
 wnd->next->prev = wnd->prev;
    if (wnd == wnd->parent->first_child)
 wnd->parent->first_child = wnd->next;
    if (wnd == wnd->parent->last_child)
 wnd->parent->last_child = wnd->prev;
...and THEN add it to the end of the list, by modifying the last window in the chain so that it points to "this" window (wnd). Change wnd so that it points to the previous last window:
/* Add window to end of parent's list of children */
    wnd->prev = wnd->parent->last_child;
    wnd->next = NULL;
    if (wnd->parent->last_child != NULL)
 wnd->parent->last_child->next = wnd;
    wnd->parent->last_child = wnd;
    if (wnd->parent->first_child == NULL)
 wnd->parent->first_child = wnd;
The last difficult part of the GUI list manipulation is checking what window the point (x, y) is in. Checking the top most window first, and then the next top most, etc... returning when the first window with the point (x, y) is found. You may notice that it calls itself as well. It calls itself with the children windows so that they may be checked too:
struct WINDLIST *inwhatwin(struct WINDLIST *parent, int x, int y)
{
    struct WINDLIST *child, *hit;

    for (child = parent->last_child; child != NULL; child = child->prev)
    {
 hit = inwhatwin(child, x, y);
 if (hit != NULL) return hit;
    }

    if (pt_inrect(parent->position, x, y)) return parent;
    return NULL;
}
Using this basic information, you should have the basic frameworkings of a very simple GUI, using simple filled boxes as windows. All you have to do is create simple wrapper code to get mouse or keyboard input to move the windows. Simply move the x1, y1, x2, y2 coordinates based on the cursor coords...

GUI STEP: BEYOND THIS GUI

Now with this basic GUI, you can add things like control support(buttons, textboxes, labels, you get the idea), and menu support. I added control support in about 2-3 hours of coding and being tired. Resizing is not difficult, you just need to compare the window's width and height with that of it's bitmap's, and then resize the bitmap accordingly, and set it's "needs_repaint" to 1. That way, the newly resized window will be resized and your machine will not die from bad color values. For menus, I am still experimenting with them. They are going to be a real pain(I think). Here's a hint that I found out: You need a stucture for a menu and a structure for each menu selection. An example is "file" is a menu, but "new", "save", "open", and "exit" are selections. Each selection should be given the opportunity(give it a pointer to a menu) to drop it's own menu. This creates something like a "start->programs->accessories" type of thing for you windows users. Textboxes will also be a pain, they need a text variable that can be resized dynamically, every KByte or so, and you need a BITMAP for it's control (visible).

Multiprocessing Support for Hobby OSes Explained

0

Multiprocessing Support for Hobby OSes Explained

Reference Materials

  • Intel Multiprocessing Specification
  • Intel Software Developer's Manual Volume 3
  • Intel 82093AA I/O APIC Manual

Introduction

Many hobby operating system projects start out with very modest goals of being able to boot off of a floppy and load a kernel written in a high level language like C or C++. Some progress further, to the point that they can manage virtual memory and multiple processes, but very few of these operating systems ever get to the point that they support multi-processing with more than one CPU. The reason for this is a general lack of good information on how to accomplish the necessary steps of detecting and initializing other processors in the system.
The design of a multi-processing operating system must be made very carefully and many situations must be taken into account to avoid race conditions that undermine the stability and correctness of a multi-processing OS. Basic locking primitives are needed that protect kernel data structures from concurrent access in situations that can result in corruption, which inevitably lead to instability in the OS kernel itself. This document touches briefly on locking mechanisms, but does not go deeply into the design decisions of a multi-processing operating system. It is meant for the hobby OS developer that understands virtual memory and multithreading and would like to take their OS project to the next level by beginning to add multiprocessing support.

1 - Multiprocessing in Nutshell

How does multiprocessing work? The most basic simplification is that multiple processors can execute code simultaneously and independent of each other. Instead of one processor in a system, there are more than one, from as few as two up to thousands. These processors can either share the same system memory or have separate private memories that only they can access. There can also be configurations in which processors are "clustered" where there may be many physical memories with several processors each.
Systems that share system memory between all processors so that all processors see the same physical memory are called UMA for Uniform Memory Access. They are more often called SMP systems for Symmetric Multiprocessing. Systems that have separate, private physical memories are called NUMA for Non-uniform Memory Access. SMP architectures are generally used where the number of processors accessing the same physical memory is at most a dozen or a few dozen. This is because of the law of diminishing returns: as each processor is added, it has to compete with the other processors in the system for memory bandwidth, and so the speed increase from adding more processors becomes much less than linear. NUMA architectures, where there is no central memory for the processors to contend over, offers much greater scalability, often into the thousands of processors. NUMA have the disadvantage of larger memory requirements (because the OS and applications are duplicated in many separate memories) and because coordinating the system's execution requires extra communication overhead. SMP and NUMA each have their specific uses. NUMA is used for systems on the scale of super-computers and on tasks that have a high degree of parallel data that is not interdependent. SMP is more useful in smaller systems that operate on interdependent data, such as a PC workstation or a server.
This document only focuses on one uniform memory access architecture, that of the Intel Pentium family of processors, since the Intel platform is the most common among hobby OSes, and SMP multiprocessing machines with Intel architecture processors are relatively commonplace.

1.1 - Basics of an SMP System

SMP systems share the same physical memory between all the processors in the system. There is one copy of the OS kernel that manages resources such as memory and devices. The OS kernel can schedule processes to run on different CPUs without the need to copy any of the process's state from one part of physical memory to the next. Since all CPUs see identical physical memory, they are all equally capable of running any particular process or interacting with the hardware devices. They are also equally capable of running the OS kernel code.

1.2 - Communication in an SMP System via Shared Memory

Processors in the system can communicate to each other by one of two methods. The first is to communicate by reading and writing from the same addresses in physical memory to signal that some condition has been meant or that one processor should perform some task. An example of two processors communicating by reading and writing the same address in memory is as follows:
processor 1:
volatile int *what_to_do = SHARED_ADDRESS; // point to some memory
*what_to_do = DO_NOTHING;         // default to do nothing

// wait for other processor to set *what_to_do
while ( *what_to_do == DO_NOTHING ) ;
switch ( *what_do_do )
{
  ...
}
processor 2:
volatile int *what_to_do = SHARED_ADDRESS; // point to some memory
*what_to_do = DO_SOMETHING_ELSE;  // notify other processor
In this example, processor 1 and processor 2 communicate by reading and writing from address SHARED_ADDRESS, which we assume is some constant, previously agreed upon address. The first processor sets this integer in memory to the constant DO_NOTHING and waits in a loop until that integer becomes any other value. The second processor simply writes a value into that shared memory address which causes the first to break out of the while loop and enter the switch statement. The second processor could tell the first to do one of several possible things based on what value it wrote to SHARED_ADDRESS.

Cache Coherency and SMP

What about processor caches? What if the shared memory is cached in one of one of the processors' caches? This would cause massive problems communicating via shared memory because the memory in question would have to be uncached to ensure that changes made to shared memory by one processor are seen by other processors interested in the same memory range. This problem is solved by a coherency protocol implemented in hardware that ensures that changes made by one processor are seen by other processors. The details of this scheme aren't particularly interesting in this document and since they make the processor caches appear transparent to software, they are not discussed further.

1.3 - Communicating Better with Interprocessor Interrupts

The about example is a rather clumsy and particularly inefficient way to communicate to other processors. First, the processor "listening" in the while loop isn't doing anything useful while it is waiting for the other processor to signal it. The other problem with this is that there may in fact be more than two processors in the system (remember that there can be dozens in some SMP machines). If more than one of these processors is listening and one processor tries to signal one to do something, then they all will wake up, not just one.
We can reduce these problems by having the listening processor only check the flag periodically and between checks do something useful, but then the processor is less responsive. We could solve the problem of multiple listening processors with flags for each processor, but the latency and busy polling problems still remain. If you are an intermediate OS developer, chances you understand this problem and know the solution already: interrupts. In multiprocessor systems, communication can be made through interprocessor interrupts (IPIs) that allow one processor to send an interrupt to another specified processor or range of processors. The ability to interrupt another processor solves both the latency and polling problems. The processor can be doing useful work, but still stay responsive to interrupts from the other processors in the system.

2 - Intel Multiprocessing Specification

Now that we have discussed the differences between polling and interrupts on SMP systems, it is time to consider the more practical questions of how they work and how to use them. For this purpose, to standardize how Intel processors work in a SMP setting, Intel developed a standard called the Intel Multiprocessing Specification, which sets standards for the interface between the BIOS/firmware level and system software (OS) level. It is strongly recommended that you download this manual, as it covers some specifics that are important. You can find it here. This manual was introduced with the 486 line of processors which supported multiprocessing. The 386 processors also supported multiprocessing, but saw almost no use as a multiprocessing platform because there were no standards.

2.1 - The APIC module

The centerpiece of the Intel Multiprocessing specification is the APIC device, which stands for Advanced Programmable Interrupt Controller. Even beginning OS developers have probably heard of the PIC (Programmable Interrupt Controller) which delivers IRQs to the processor. The APIC module is similar in function to the PIC, but it accepts and directs interrupts among multiple processors. In Intel multiprocessing systems, there is one local APIC module for each processor and at least one IO APIC that routes interrupt requests among multiple processors. The local APIC module is built into the processor die itself for Pentium family of processors, but is separate for 486 processors. This local module for 486s was a different model (the 82489DX) and had slightly fewer features than the later modules built into the Pentium line of processors. For that reason they are not discussed, and we focus on multiprocessing with the Pentium and higher line of processors.
The local APIC module serves as the only input of interrupts to the processor. The external PIC and IO APICs send their interrupts to the local APIC of the destination processor and that local APIC interrupts the processor. The APIC can be programmed to mask these interrupts 0-255. However, the APIC cannot mask the exceptions 0-21 which are generated internal to the processor.
Each local APIC module has a unique ID that is initialized by the BIOS, firmware, or hardware. The OS is guaranteed that the local APIC IDs are unqiue. Local APICs are also capable of sending IPIs (inter-processor interrupts) to other processors in the system using the local IDs of the destination. This is primarily how the OS communicates with other processors, by programming the current processor's (whichever processor the OS is running on) local APIC chip to send an IPI to a destination APIC ID.

2.2 - Bootup Sequence

The Specification not only defined the the APIC as the basic building block of multiple processor systems, but it also had to define some standards on booting the system so that multiple processor systems could remain backwards compatible. Some guarantee as to the state of the other processors in the system was needed so that an a uniprocessor OS could function correctly on one processor.
The Multiprocessing specification defines a standard boot sequence that guarantees the OS that the system is in a state ready for multiprocessor detection and initialization. The specification states that in the standard boot sequence the BIOS, hardware or firmware (not the OS) will select one of the processors to be designated the BSP or Bootstrap Processor. The selection of which processor is the BSP can be either hardwired to physical location, generated randomly, or selected by some other means. The only restriction the specification enforces is that one and only one processor is selected as the BSP and the other processors, called AP's for Application Processors are initialized to Real Mode and put into a halted state. The APs' local APICs are initialized such that they will not service any interrupts. The system is initialized so that all interrupts are directed to the BSP. The BSP then boots normally exactly as if the system was a uniprocessor machine.

2.3 - Multiple Processor Detection

The resulting initialization and loading of the OS in uniprocessor mode should be familiar to even beginning OS developers and is not the aim of this document. What is the aim of this document is the steps the operating system must now take to detect and initialize the APs, which are still in a halted state. In order for the OS to detect the presence of multiple processors, the specification requires that the BIOS or firmware construct two tables in physical memory that describes the configuration of the system, including information about processors, IO APIC modules, irq assignments, busses present in the system, and other useful data for the OS. The OS must find these structures and parse them in order to determine what initialization needs to be done. If the OS does not find these tables, then the OS can assume that the system is not multiprocessor capable and it can continue with uniprocessor initialization. This allows an OS compiled for SMP operation to fall back on default, uniprocessor behavior on a uniprocessor system.

Finding the MP Floating Pointer Structure

The first structure the OS must search for is called the MP Floating Pointer Structure. This table contains some information pertaining to the multiprocessing configuration and indicates that the system is multiprocessing compliant. This structure has the following format:
MP Floating Pointer Structure
Field Offset Length Description/Use
Signature 0 4 bytes This 4 byte signature is the ASCII string "_MP_" which the OS should use to find this structure.
MPConfig Pointer 4 4 bytes This is a 4 byte pointer to the MP configuration structure which contains information about the multiprocessor configuration.
Length 8 1 byte This is a 1 byte value specifying the length of this structure in 16 byte paragraphs. This should be 1.
Version 9 1 byte This is a 1 byte value specifying the version of the multiprocessing specification. Either 1 denoting version 1.1, or 4 denoting version 1.4.
Checksum 10 1 byte The sum of all bytes in this floating pointer structure including this checksum byte should be zero.
MP Features 1 11 1 byte This is a byte containing feature flags.
MP Features 2 12 1 byte This is a byte containing feature flags. Bit 7 reflects the presence of the ICMR, which is used in configuring the IO APIC.
MP Features 3-5 13 3 bytes Reserved for future use.
The MP Floating Pointer Structure is in one of three memory areas: (1) The first kilobyte of the Extended BIOS Data Area (EBDA). (2) The last kilobyte of base memory (639-640k). (3) The BIOS ROM address space (0xF0000-0xFFFFF). The OS should search for the ASCII string "_MP_" in these three areas. If the OS finds this structure, this indicates the system is multiprocessing compliant and multiple processor initialization should continue. If this structure is not present in any of these three areas, then uniprocessor initialization should continue.

Parsing the MP Configuration Table

The MP Floating Pointer Structure indicates whether the MP Configuration Table exists by the value in MP Features 1. If this byte is zero, then the value in MPConfig Pointer is a valid pointer to the physical address of the MP Configuration Table. If MP Features 1 is non-zero, this indicates that the system is one of the default configurations as described in the Intel Multiprocessing Specification Chapter 5. These default configurations are concisely described in that chapter of the specification and we will not discuss them fully here except to say that these default configurations have only two processors and the local APICs have IDs 0 and 1, among a few other nice properties. If one of these default implementations is specified in the MP Floating Pointer Structure, then the OS need not parse the MP Configuration Table, and can initialize the system based on the information in the specification.
The MP Configuration Table contains information regarding the processors, APICs, and busses in the system. It has a header (called the base table) and a series of variable length entries immediately following it in increasing address. The base table has the following format:
MP Configuration Table
Field Offset Length Description/Use
Signature 0 4 bytes This 4 byte signature is the ASCII string "PCMP" which confirms that this table is present.
Base Table Length 4 2 bytes This 2 byte value represents the length of the base table in bytes, including the header, starting from offset 0.
Specification Revision 6 1 byte This 1 byte value represents the revision of the specification which the system complies to. A value of 1 indicates version 1.1, a value of 4 indicates version 1.4.
Checksum 7 1 byte The sum of all bytes in the base table including this checksum and reserved bytes must add to zero.
OEM ID 8 8 bytes An ASCII string that identifies the manufacturer of the system. This string is not null terminated.
Product ID 16 12 bytes An ASCII string that identifies the product family of the system. This string is not null terminated.
OEM Table Pointer 28 4 bytes An optional pointer to an OEM-defined configuration table. If no OEM table is present, this field is zero.
OEM Table Size 32 2 bytes The size (if it exists) of the OEM table. If the OEM table does not exist, this field is zero.
Entry Count 34 2 bytes The number of entries following this base header table in memory. This allows software to find the end of the table when parsing the entries.
Address of Local APIC 36 4 bytes The physical address where each processor's local APIC is mapped. Each processor memory maps its own local APIC into this address range.
Extended Table Length 40 2 bytes The total size of the extended table (entries) in bytes. If there are no extended entries, this field is zero.
Extended Table Checksum 42 1 byte A checksum of all the bytes in the extended table. All off the bytes in the extended table must sum to this value. If there are no extended entries, this field is zero.
The MP Configuration Table is immediately followed by Entry Count entries that describe the configuration of processors, busses and IO APICs in the system. The first byte of each entry denotes the entry type, e.g. a processor entry or a bus entry. The entries are sorted by entry type in ascending order. The table entry types are summarized as follows:
MP Configuration Table Entries
Entry Description Entry Type Code Length Comments
Processor 0 20 bytes An entry describing a processor in the system. One entry per processor.
Bus 1 8 bytes An entry describing a bus in the system. One entry per bus.
IO APIC 2 8 bytes An entry describing an IO APIC present in the system. One entry per IO APIC.
IO Interrupt Assignment 3 8 bytes An entry describing the assignment of an interrupt source to an IO APIC. One per bus interrupt source.
Local Interrupt Assignment 4 8 bytes An entry describing a local interrupt assignment in the system. One entry per system interrupt source.
Since the entries of the MP Configuration Table are sorted by entry type in ascending order, the first entries will be all the processor entries, followed by all the bus entries, followed by the IO APIC entries, and so on. The OS should parse these entries in order to discover how many processors, how many IO APICs, and other information it will need to initialize the system. The processor entries have the format:
Processor Entry
Field Offset (in bytes:bits) Length Description/Use
Entry Type 0 1 byte Since this is a processor entry, this field is set to 0.
Local APIC ID 1 1 byte This is the unique APIC ID number for the processor.
Local APIC Version 2 1 byte This is bits 0-7 of the Local APIC version number register.
CPU Enabled Bit 3:0 1 bit This bit indicates whether the processor is enabled. If this bit is zero, the OS should not attempt to initialize this processor.
CPU Bootstrap Processor Bit 3:1 1 bit This bit indicates that the processor entry refers to the bootstrap processor if set.
CPU Signature 4 4 bytes This is the CPU signature as would be returned by the CPUID instruction. If the processor does not support the CPUID instruction, the BIOS fills this value according to the values in the specification.
CPU Feature flags 8 4 bytes This is the feature flags as would be returned by the CPUID instruction. If the processor does not support the CPUID instruction, the BIOS fills this value according to values in the specification.
Bus entries identify the kinds of buses in the system. The BIOS is responsible for assigning them each a unique ID number. The entries allow the BIOS to communicate to the OS the buses in the system. The format of the entries is described in the Intel Multiprocessing Specification Chapter 5. Because using and initializing buses is beyond the scope of this document, bus entries are not discussed.
The configuration table contains at least one IO APIC entry which provides to the OS the base address for communicating with the IO APIC and its ID. The entry for an IO APIC has the following format:
IO APIC Entry
Field Offset (in bytes:bits) Length Description/Use
Entry Type 0 1 byte Since this is an IO APIC entry, this field is set to 2.
IO APIC ID 1 1 byte This is the ID of this IO APIC.
IO APIC Version 2 1 byte This is bits 0-7 of the IO APIC's version register.
IO APIC Enabled 3:0 1 bit This bit indicates whether this IO APIC is enabled. If this bit is zero, the OS should not attempt to access this IO APIC.
IO APIC Address 4 4 bytes This contains the physical base address where this IO APIC is mapped.

3 - Initializing and Using the local APIC

Now that we are able to detect the processors and IO APICs in a system, it is necessary to initialize and configure the bootstrap processor's local APIC so that it can begin to send interrupts to the other processors in the system. Interprocessor interrupts are the best way to communicate between processors in certain situations, and as we will see, they are used by the bootstrap processor to awaken the other processors in the system.

3.1 - Memory Mappings of APIC Modules

Each local APIC module is memory mapped into the address space of its corresponding processor. They are all mapped to their local processor's address space at the same address so that when a processor accesses this address range it is accessing its own local APIC. However, for an IO APIC, it is mapped into the address space of all processors at the same address so that all processors can address the same IO APIC through the same address range. Multiple IO APICs each have their own address range in which they are mapped, but are, again, mapped globally and accessable from all processors. The address ranges APICs are given as follows:
APIC Memory Mappings
APIC Type Default address Alternate Address
Local APIC 0xFEE00000 If specified, the value of the Address of Local APIC field in the MP Configuration Table.
First IO APIC 0xFEC00000 If specified, the value of the IO APIC Address field in the IO APIC entry in the MP Configuration Table.
Additional IO APICs - The value of the IO APIC Address field in the IO APIC entry in the MP Configuration Table.

3.2 - The Local APIC's Register Set

In order for the OS to begin to communicate with the other processors present in the system, it must first initialize its own local APIC module. The local APIC module is the means by which the local processor can send interrupts to the other processors and is memory mapped into the address space of the processor at the addresses in the previous table. The APIC uses no IO ports and is configured by writing the appropriate settings into the APIC's registers at the correct memory offsets. The registers' offsets are summarized in the following table:
Local APIC Register Addresses
Offset Register Name Software Read/Write
0x0000h - 0x0010 reserved -
0x0020h Local APIC ID Register Read/Write
0x0030h Local APIC ID Version Register Read only
0x0040h - 0x0070h reserved -
0x0080h Task Priority Register Read/Write
0x0090h Arbitration Priority Register Read only
0x00A0h Processor Priority Register Read only
0x00B0h EOI Register Write only
0x00C0h reserved -
0x00D0h Logical Destination Register Read/Write
0x00E0h Destination Format Register Bits 0-27 Read only, Bits 28-31 Read/Write
0x00F0h Spurious-Interrupt Vector Register Bits 0-3 Read only, Bits 4-9 Read/Write
0x0100h - 0x0170 ISR 0-255 Read only
0x0180h - 0x01F0h TMR 0-255 Read only
0x0200h - 0x0270h IRR 0-255 Read only
0x0280h Error Status Register Read only
0x0290h - 0x02F0h reserved -
0x0300h Interrupt Command Register 0-31 Read/Write
0x0310h Interrupt Command Register 32-63 Read/Write
0x0320h Local Vector Table (Timer) Read/Write
0x0330h reserved -
0x0340h Performance Counter LVT Read/Write
0x0350h Local Vector Table (LINT0) Read/Write
0x0360h Local Vector Table (LINT1) Read/Write
0x0370h Local Vector Table (Error) Read/Write
0x0380h Initial Count Register for Timer Read/Write
0x0390h Current Count Register for Timer Read only
0x03A0h - 0x03D0h reserved -
0x03E0h Timer Divide Configuration Register Read/Write
0x03F0h reserved -
Note that the local APIC's registers are divided into 32 bit words that are aligned on 16 byte boundaries. Registers that are larger than 32 bits are split into multiple 32 bit words, aligned on successive 16 byte boundaries. The Intel Multiprocessing Specification states that all local APIC registers must be accessed with 32 bit reads and writes.

3.3 - Initializing the BSP's Local APIC

In order for the OS to communicate with the other processors in the system, it first must enable and configure its local APIC. Software must first enable the local APIC by setting a bit in a register and programming other registers with vectors to handle bus and inter-processor interrupts.

Spurious-Interrupt Vector Register

The Spurious-Interrupt Vector Register contains the bit to enable and disable the local APIC. It also has a field to specify the interrupt vector number to be delivered to the processor in the event of a spurious interrupt. This register is 32 bits and has the following format:
32bit Spurious-Interrupt Vector Register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

. F
C
E
N
VECTOR 1 1 1 1
  • EN bit - This allows software to enable or disable the APIC module at any time. Writing a value of 1 to this bit enables the APIC module, and writing a value of 0 disables it.
  • FC Bit - This bit indicates whether focus checking is enabled for the current processor. A value of 0 indicates focus checking is enabled, and a value of 1 indicates it is disabled. For our purposes, this bit can be ignored.
  • VECTOR - This field of the Spurious-Interrupt Vector Register specifies which interrupt vector is delivered to the processor in the event of a spurious interrupt. Bits 0-3 of this vector field are hard-wired to 1111b, or 15. Bits 4-7 of this field are programmable by software.
A spurious interrupt can happen when all pending interrupts are masked or there are no pending interrupts during an internal interrupt acknowledge cycle of the APIC. The APIC module delivers an interrupt vector to its local processor specified by the value in the VECTOR field of this register. The processor then transfers control to the interrupt handler in the IDT, at the vector number delivered to it by the APIC. Basically, the VECTOR field specifies which interrupt handler to transfer control to in the event of a spurious interrupt. Spurious interrupts happen because of certain interactions within the APIC's hardware itself, and do not reflect any meaningful information. Software can safely ignore these interrupts, and should program this vector to refer to an interrupt handler that ignores the interrupt.

Local APIC Version and Local APIC ID Registers

The Local APIC Version Register is a read-only register that the APIC reports its version information to software. It also specifies the maximum number of entries in the Local Vector Table (LVT). The Local APIC ID Register stores the ID of the local APIC.
Local APIC Version Register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

.
MAXIMUM LVT
ENTRY
. VERSION
  • Maximum LVT Entry - Indicates the number of the Maximum LVT entry. For Pentium processors, this number is 3 (4 entries total) and for P6 family, this is 4 (5 entries total).
  • Version - Indicates the version number of the local APIC module. For 82489DX APICs, this number is 0h. For integrated APICs of the Pentium family and higher, this number is 1h.
Local APIC ID Register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

.
APIC
ID
.

Local Vector Table

The Local Vector Table allows software to program the interrupt vectors that are delivered to the processor in the event of errors, timer events, and LINT0 and LINT1 interrupt inputs. It also allows software to specify status and mode information to the APIC module for the local interrupts.
Local Vector Table
. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Timer .
T
P
M . D
S
. VECTOR
LINT0 .
M T
M
R
I
I
P
D
S
. DMODE VECTOR
LINT1 .
M T
M
R
I
I
P
D
S
. DMODE VECTOR
ERROR .
M . D
S
. VECTOR
PCINT .
M . D
S
. DMODE VECTOR
  • Vector: The interrupt vector number.
  • DMODE (Delivery Mode): Defined only for the local interrupts LINT0, LINT1, and PCINT (the performance monitoring counter). It can be one of three defined values:
    • 000 (Fixed) - Delivers the interrupt to the processor as specified in the corresponding LVT entry.
    • 100 (NMI) - The interrupt is delivered to the local processor as a NMI (non-maskable interrupt) and the vector information is ignored. The interrupt is treated as an edge-triggered interrupt regardless of how software had programmed it.
    • 111 (ExtINT) - Delivers the interrupt to the processor as if it had originated in an external controller such as an 8249A PIC. The external controller is expected to supply the vector information. The interrupt is always treated as level trigger, regardless of how the software had programmed the entry.
  • DS (Delivery Status) - Read only to software. A value of 0 (idle) indicates that there are no pending interrupts for this interrupt or that the previous interrupt from this source has completed. A value of 1 (send pending) indicates that the interrupt transmission has begun but has not yet been completely accepted.
  • IP (Interrupt Polarity) - Specifies the interrupt polarity of the interrupt source. A value of 0 indicates active high and a value of 1 indicates active low.
  • RI (Remote Interrupt Request Register bit) - For level triggered interrupts, this bit is set when the APIC module accepts the interrupt and is cleared upon EOI. Undefined for edge triggered interrupts.
  • TM (Trigger Mode) - When the delivery mode is Fixed, (0) indicates edge-sensitivity and (1) indicates level-sensitivity.
  • M (Mask) - Indicates whether the interrupt is masked. A value of 1 indicates the interrupt is masked, while 0 indicates the interrupt is unmasked.
  • TP (Timer Periodic Mode) - Indicates whether the timer interrupt should be fired periodically (1) or only once (0).

3.4 - Issuing Interrupt Commands

The local APIC module has a 64 bit register called the Interrupt Command Register that software can use cause the APIC to issue interrupts to other processors. A write to the low 32 bits of the register causes the command specified in the write operation to be issued. The format of the Interrupt Command Register is as follows:
Interrupt Command Register
63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32
DESTINATION
FIELD
.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

.
DSH . T
M
L
V
. D
S
D
M
DMODE VECTOR
  • Vector - Indicates the vector number identifying the interrupt being sent.
  • DMODE (Delivery Mode) - Specifies how the APICs in the destination field should handle the interrupt being sent. All inter-processor interrupts are treated as edge-triggered, even if programmed otherwise.
    • 000 (Fixed) - Delivers the interrupt to the processors listed in the destination field according to the information in the ICR.
    • 001 (Lowest Priority) - Same as fixed mode, except the interrupt is delivered to the processor executing at the lowest priority among the set of processors specified in the destination field.
    • 010 (SMI) - Only the edge triggered mode is allowed. The vector field must be programmed to 00b.
    • 011 (Reserved)
    • 100 (NMI) - Delivers the interrupt as an NMI to all processors listed in the destination. The vector information is ignored.
    • 101 (INIT) - Delivers the interrupt as an INIT, causing all processors in the destination to assume their INIT state. Vector information is ignored.
    • 101 (INIT Level De-assert) - (Specified by setting Level to 0 and Trigger Mode to 1). The interrupt is delivered to all processors regardless of the destination field. Causes all the APICs to reset their arbitration IDs to the local APIC IDs.
    • 110 (Startup) - Sends a Startup message to the processors listed in the destination field. The 8-bit vector information is the physical page number of the address for the processors to begin executing from. This message is not automatically retried, and software may need to retry in the case of failure.
  • DM (Destination Mode) - Indicates whether the destination field contains a physical (0) or logical (1) address.
  • DS (Delivery Status) - Indicates idle (0), that there is no activity for this interrupt, or send pending (1), that the transmission has started, but has not yet been completely accepted.
  • LV (Level) - For the INIT De-assert mode, this is set to 0. For all other delivery modes, this is set to 1.
  • TM (Trigger Mode) - Used from the INIT De-assert mode only.
  • DSH (Destination Shorthand) - Indicates whether shorthand notation is being used to specify the destination of the interrupt. If destination shorthand is used, then the destination field is ignored. This field can have the values:
    • 00 (No shorthand) - Indicates no shorthand is being specified and that the destination field contains the destination.
    • 01 (Self) - Indicates that the current APIC is the only destination. Useful for self interrupts.
    • 10 (All) - Broadcasts the message to all APICs, including the processor sending the interrupt.
    • 11 (All excluding Self) - Broadcasts the message to all APICs, excluding the processor sending the interrupt.
  • Destination Field - When the destination shorthand field is set to 00 and the destination mode is physical, the destination field (bits 56-59) contains the APIC ID of the destination. When the mode is logical, the interpretation of this field is more complicated. See the Intel SDM Vol 3, Chap 7, for details.

4 - Application Processor Startup

5 - MP Detection and Initialization Recap

  1. The BIOS selects the BSP and begins uniprocessor startup, initializing the APs to Real Mode and halting them.
  2. The OS code (either bootstrap or kernel) searches for the MP Floating Pointer structure.
  3. The OS uses the MP Floating Pointer structure to select a default configuration or to find the MP Configuration Table.
  4. The OS parses the MP Configuration Table to determine how many processors and IO APICs are in the system.
  5. The OS initializes the bootstrap processor's local APIC.
  6. The OS sends Startup IPIs to each of the other processors with the address of trampoline code.
  7. The trampoline code initializes the AP's to protected mode and enters the OS code to being further initialization.
  8. When the AP's have been awakened and initialized, the BSP can initialize the IO APIC into Symmetric IO mode, to allow the AP's to begin to handle interrupts.
  9. The OS continues further initialization, using locking primitives as necessary.

6 - Locks and IPIs

 

Total Pageviews

Pageviews