Editing Intel 8086 (section)

== Example code ==
The following 8086 [[assembly language|assembly]] source code is for a subroutine named <code>_strtolower</code> that copies a null-terminated [[ASCIIZ]] character string from one location to another, converting all alphabetic characters to lower case. The string is copied one byte (8-bit character) at a time.

<!--NOTE: The hex codes were assembled by hand, so there may be errors-->
{| style="font-size:70%"
|
<!--NOTE: DO NOT REMOVE BLANK LINES, 0000 line sets block width--><pre>









0000            
0000  55
0001  89 E5
0003  56
0004  57
0005  8B 75 06
0008  8B 7D 04
000B  FC

000C  AC
000D  3C 41
000F  7C 06
0011  3C 5A
0013  7F 02
0015  04 20
0017  AA
0018  08 C0
001A  75 F0

001C  5F
001D  5E
001E  5D
001F  C3
001F 
</pre>
|
<syntaxhighlight lang="nasm">
; _strtolower:
; Copy a null-terminated ASCII string, converting
; all alphabetic characters to lower case.
; ES=DS
; Entry stack parameters
;   [SP+4] = src, Address of source string
;   [SP+2] = dst, Address of target string
;   [SP+0] = Return address
;
_strtolower proc
            push    bp              ;Set up the call frame
            mov     bp,sp
            push    si
            push    di
            mov     si,[bp+6]       ;Set si = src (+2 due to push bp)
            mov     di,[bp+4]       ;Set di = dst
            cld                     ;string direction ascending
            
loop:       lodsb                   ;Load al from [si], inc si
            cmp     al,'A'          ;If al < 'A',
            jl      copy            ; skip conversion
            cmp     al,'Z'          ;If al > 'Z',
            jg      copy            ; skip conversion
            add     al,'a'-'A'      ;Convert al to lowercase
copy:       stosb                   ;Store al to es:[di], inc di
            or      al,al           ;If al <> 0,
            jne     loop            ; repeat the loop
            
done:       pop     di              ;restore di and si
            pop     si
            pop     bp              ;Restore the prev call frame
            ret                     ;Return to caller
            end     proc
</syntaxhighlight>
|}

The example code uses the BP (base pointer) register to establish a [[call frame]], an area on the stack that contains all of the parameters and local variables for the execution of the subroutine. This kind of [[calling convention]] supports [[reentrancy (computing)|reentrant]] and [[recursion (computer science)|recursive]] code and has been used by Algol-like languages since the late 1950s. A flat memory model is assumed, specifically, that the DS and ES segments address the same region of memory.

===Performance===
[[File:Intel 8086 block scheme.svg|thumb|405px|''Simplified block diagram over Intel 8088 (a variant of 8086); 1=main & index registers; 2=segment registers and IP; 3=address adder; 4=internal address bus; 5=instruction queue; 6=control unit (very simplified!); 7=bus interface; 8=internal databus; 9=ALU; 10/11/12=external address/data/control bus.'']]

Although partly shadowed by other design choices in this particular chip, the [[multiplexed]] address and [[Bus (computing)|data buses]] limit performance slightly; transfers of 16-bit or 8-bit quantities are done in a four-clock memory access cycle, which is faster on 16-bit, although slower on 8-bit quantities, compared to many contemporary 8-bit based CPUs. As instructions vary from one to six bytes, fetch and execution are made [[Concurrency (computer science)|concurrent]] and decoupled into separate units (as it remains in today's x86 processors): The ''bus interface unit'' feeds the instruction stream to the ''execution unit'' through a 6-byte prefetch queue (a form of loosely coupled [[Pipeline (computing)|pipelining]]), speeding up operations on [[Processor register|register]]s and [[Operand|immediate]]s, while memory operations became slower (four years later, this performance problem was fixed with the [[80186]] and [[80286]]). However, the full (instead of partial) 16-bit architecture with a full width [[Arithmetic logic unit|ALU]] meant that 16-bit arithmetic instructions could now be performed with a single ALU cycle (instead of two, via internal carry, as in the 8080 and 8085), speeding up such instructions considerably. Combined with [[orthogonalization]]s of operations versus [[operand]] types and [[addressing mode]]s, as well as other enhancements, this made the performance gain over the 8080 or 8085 fairly significant, despite cases where the older chips may be faster (see below).

{| class="wikitable" style="text-align: center; width: 100px; height: 50px;"
|+ Execution times for typical instructions (in clock cycles)<ref>{{cite book|title=Microsoft Macro Assembler 5.0 Reference Manual|year=1987|publisher=Microsoft Corporation| quote=Timings and encodings in this manual are used with permission of Intel and come from the following publications: Intel Corporation. iAPX 86, 88, 186 and 188 User's Manual, Programmer's Reference, Santa Clara, Calif. 1986.|title-link=MASM}} (Similarly for iAPX 286, 80386, 80387.)</ref>
|-  style="vertical-align:bottom; border-bottom:3px double #999;"
!align=left | instruction
!align=left | register-register
!align=left | register immediate
!align=left | register-memory
!align=left | memory-register
!align=left | memory-immediate
|-  style="vertical-align:top; border-bottom:1px solid #999;"
|mov || 2 || 4|| 8+EA || 9+EA || 10+EA
|-  style="vertical-align:top; border-bottom:1px solid #999;"
|ALU || 3 ||4|| 9+EA, || 16+EA,|| 17+EA
|-  style="vertical-align:top; border-bottom:1px solid #999;"
|jump || colspan="5" | ''register'' ≥ 11 ; ''label'' ≥ 15 ; ''condition,label'' ≥ 16
|-  style="vertical-align:top; border-bottom:1px solid #999;"
|integer multiply || colspan="5" | 70~160 (depending on operand ''data'' as well as size) ''including'' any EA
|-  style="vertical-align:top; border-bottom:1px solid #999;"
|integer divide || colspan="5" | 80~190 (depending on operand ''data'' as well as size) ''including'' any EA
|}
* EA = time to compute effective address, ranging from 5 to 12 cycles.
* Timings are best case, depending on prefetch status, instruction alignment, and other factors.

As can be seen from these tables, operations on registers and immediates were fast (between 2 and 4 cycles), while memory-operand instructions and jumps were quite slow; jumps took more cycles than on the simple [[Intel 8080|8080]] and [[Intel 8085|8085]], and the 8088 (used in the IBM PC) was additionally hampered by its narrower bus. The reasons why most memory related instructions were slow were threefold:
* Loosely coupled fetch and execution units are efficient for instruction prefetch, but not for jumps and random data access (without special measures).
* No dedicated address calculation adder was afforded; the microcode routines had to use the main ALU for this (although there was a dedicated ''segment'' + ''offset'' adder).
* The address and data buses were [[multiplexing|multiplex]]ed, forcing a slightly longer (33~50%) bus cycle than in typical contemporary 8-bit processors.{{Dubious|1=Multiplexed bus|reason=The multiplexed bus is unlikely to slow things by "33~50%." The address was only delayed by the 18 nanosecond max propagation delay of the 74LS373 transparent latch.|date=May 2023}}

However, memory access performance was drastically enhanced with Intel's next generation of 8086 family CPUs. The [[Intel 80186|80186]] and [[Intel 80286|80286]] both had dedicated address calculation hardware, saving many cycles, and the 80286 also had separate (non-multiplexed) address and data buses.

===Floating point===
The 8086/8088 could be connected to a mathematical coprocessor to add hardware/microcode-based [[floating-point]] performance. The [[Intel 8087]] was the standard math coprocessor for the 8086 and 8088, operating on 80-bit numbers. Manufacturers like [[Cyrix]] (8087-compatible) and [[Weitek]] (''not'' 8087-compatible) eventually came up with high-performance floating-point coprocessors that competed with the 8087.