Not-so-ultra fast division routine

Here you can ask questions or provide insights about how to use efficiently 6502 assembly code on the Oric.
User avatar
Chema
Game master
Posts: 3013
Joined: Tue Jan 17, 2006 10:55 am
Location: Gijón, SPAIN
Contact:

Not-so-ultra fast division routine

Post by Chema »

Greetings.

I have also adapted S. Judd's division routine (just copied it, really... little modifications needed).

It can divide two 16-bit numbers and generate a result of 8-bit integer division and the remainder as a fraction of 256 (so .5 is 128).

He is using a predictor/corrector method, first estimating the integer result with a table of logs, then adjusting it with two tables, one of exp and another of negative exp.

He also employs the fast multiplication routine I also posted before.

Some figures now. Running a loop like:

Code: Select all

for (i=10;i<1000;i+=10) 
     for (j=10;j<1000;j+=10)
        k=(int)(i/j);
Takes 1130 100ths of a second. Using this routine (also from C) it takes 388. Not sure how division is implemented in the C library, though.

Here is the code:

Code: Select all

.zero
DIVXLO   .byt 00           ;Division: DIVX/DIVY
DIVXHI   .byt 00
DIVY     .byt 00
DIVTEMP  .byt 00           ;High byte of DY
TEMP1    .byt 00
TEMP2    .byt 00

.text
DIVSHIFT
.( 
         LSR DIVTEMP
         ROR DIVY
         LSR DIVXHI
         ROR DIVXLO
.)
DIVXY
.(   
         LDA DIVTEMP      ;Div by 2 if dy>255
         BNE DIVSHIFT

         LDA DIVXHI
         CMP #2
         BCS DIVSHIFT     ;Or if dx>511

         LSR              ;Compute dx/2
         LDA DIVXLO
         ROR
         TAX
         LDA DIVY
         LSR              ;dy/2
         BEQ TWOSTEP     ;If Y=1 then handle special

         TAY
         LDA tab_log,X        ;This is the division part
         SEC
         SBC tab_log,Y
         BCC NEG
         TAX
         LDA tab_exp,X
         TAX              ;Now we have int estimate

         STA MultLo1
         STA MultHi1
         EOR #$FF
         ADC #00          ;Carry is guaranteed set
         STA MultLo2
         STA MultHi2
         LDY DIVY
         LDA (MultLo1),Y
         SEC
         SBC (MultLo2),Y  ;a=N*dy
         STA TEMP1
         LDA (MultHi1),Y
         SBC (MultHi2),Y
         STA TEMP2

         LDA DIVXLO       ;R=dx-a
         SBC TEMP1        ;C set
         STA TEMP1
         LDA DIVXHI
         SBC TEMP2
         LDA TEMP1        ;A=remainder
         BCC RNEG
                          ;If R>0 then assume R<255
                          ;(true unless dx>500 or so)

RPOS     CMP DIVY         ;If R>=dy then
         BCC DONE
L1       INX              ;a=a+1
         SBC DIVY         ;R=R-dy
         CMP DIVY
         BCS L1
DONE                      ;Now X contains integer, A rem
                          ;y=dy
;
; Compute remainder as a fraction of 256, i.e.
; 256*r/dy
;
; Currently, a small error may occur for large r
; (cumulative error of 1-2 pixels, up to 4 in rare cases)
;
FRACREM 
         STX TEMP1
         TAX
         BEQ ZERO
         LDA tab_log,X
         SEC
         SBC tab_log,Y
         TAX
         LDA tab_negexp,X
ZERO     LDX TEMP1
         RTS
                          ;And, if R<0 then assume
                          ;R>-255
RNEG     DEX
         ADC DIVY
         BCC RNEG
         JMP FRACREM

NEG      LDX #00          ;Since log is monotonic, and
         LDA DIVXLO       ;we /2, there is no chance
         LDY DIVY
         JMP FRACREM     ;of undershooting.

TWOSTEP  LDA DIVXHI       ;If Y=1
         LSR
         LDA DIVXLO       ;then just two steps of size
         ROR              ;dx/2
         TAX
         LDA #255
         RTS
.)
And the tables:

Code: Select all

tab_log
    .byt  $00,$00,$1f,$32,$3f,$4a,$52,$59,$5f,$65,$69,$6e,$72,$76,$79,$7c
    .byt  $7f,$82,$85,$87,$89,$8c,$8e,$90,$92,$94,$95,$97,$99,$9a,$9c,$9e
    .byt  $9f,$a0,$a2,$a3,$a4,$a6,$a7,$a8,$a9,$aa,$ac,$ad,$ae,$af,$b0,$b1
    .byt  $b2,$b3,$b4,$b4,$b5,$b6,$b7,$b8,$b9,$ba,$ba,$bb,$bc,$bd,$bd,$be
    .byt  $bf,$c0,$c0,$c1,$c2,$c2,$c3,$c4,$c4,$c5,$c6,$c6,$c7,$c7,$c8,$c9
    .byt  $c9,$ca,$ca,$cb,$cb,$cc,$cc,$cd,$ce,$ce,$cf,$cf,$d0,$d0,$d1,$d1
    .byt  $d2,$d2,$d2,$d3,$d3,$d4,$d4,$d5,$d5,$d6,$d6,$d7,$d7,$d7,$d8,$d8
    .byt  $d9,$d9,$d9,$da,$da,$db,$db,$db,$dc,$dc,$dd,$dd,$dd,$de,$de,$de
    .byt  $df,$df,$df,$e0,$e0,$e1,$e1,$e1,$e2,$e2,$e2,$e3,$e3,$e3,$e4,$e4
    .byt  $e4,$e5,$e5,$e5,$e5,$e6,$e6,$e6,$e7,$e7,$e7,$e8,$e8,$e8,$e8,$e9
    .byt  $e9,$e9,$ea,$ea,$ea,$ea,$eb,$eb,$eb,$ec,$ec,$ec,$ec,$ed,$ed,$ed
    .byt  $ed,$ee,$ee,$ee,$ee,$ef,$ef,$ef,$ef,$f0,$f0,$f0,$f0,$f1,$f1,$f1
    .byt  $f1,$f2,$f2,$f2,$f2,$f3,$f3,$f3,$f3,$f4,$f4,$f4,$f4,$f4,$f5,$f5
    .byt  $f5,$f5,$f6,$f6,$f6,$f6,$f6,$f7,$f7,$f7,$f7,$f7,$f8,$f8,$f8,$f8
    .byt  $f9,$f9,$f9,$f9,$f9,$fa,$fa,$fa,$fa,$fa,$fb,$fb,$fb,$fb,$fb,$fc
    .byt  $fc,$fc,$fc,$fc,$fc,$fd,$fd,$fd,$fd,$fd,$fe,$fe,$fe,$fe,$fe,$ff

tab_exp
    .byt  $01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01
    .byt  $01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01
    .byt  $02,$02,$02,$02,$02,$02,$02,$02,$02,$02,$02,$02,$02,$02,$02,$02
    .byt  $02,$02,$02,$03,$03,$03,$03,$03,$03,$03,$03,$03,$03,$03,$03,$03
    .byt  $04,$04,$04,$04,$04,$04,$04,$04,$04,$04,$04,$05,$05,$05,$05,$05
    .byt  $05,$05,$05,$06,$06,$06,$06,$06,$06,$06,$07,$07,$07,$07,$07,$07
    .byt  $08,$08,$08,$08,$08,$08,$09,$09,$09,$09,$0a,$0a,$0a,$0a,$0a,$0b
    .byt  $0b,$0b,$0b,$0c,$0c,$0c,$0c,$0d,$0d,$0d,$0e,$0e,$0e,$0f,$0f,$0f
    .byt  $10,$10,$10,$11,$11,$11,$12,$12,$13,$13,$14,$14,$14,$15,$15,$16
    .byt  $16,$17,$17,$18,$18,$19,$1a,$1a,$1b,$1b,$1c,$1d,$1d,$1e,$1e,$1f
    .byt  $20,$21,$21,$22,$23,$24,$24,$25,$26,$27,$28,$29,$29,$2a,$2b,$2c
    .byt  $2d,$2e,$2f,$30,$31,$33,$34,$35,$36,$37,$38,$3a,$3b,$3c,$3e,$3f
    .byt  $40,$42,$43,$45,$46,$48,$49,$4b,$4d,$4e,$50,$52,$54,$56,$57,$59
    .byt  $5b,$5d,$5f,$62,$64,$66,$68,$6a,$6d,$6f,$72,$74,$77,$79,$7c,$7f
    .byt  $82,$84,$87,$8a,$8d,$90,$94,$97,$9a,$9e,$a1,$a5,$a8,$ac,$b0,$b4
    .byt  $b8,$bc,$c0,$c4,$c8,$cd,$d1,$d6,$db,$df,$e4,$e9,$ee,$f4,$f9,$fe


tab_negexp
    .byt  $fe,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01
    .byt  $01,$01,$01,$01,$02,$02,$02,$02,$02,$02,$02,$02,$02,$02,$02,$02
    .byt  $02,$02,$02,$02,$02,$02,$02,$02,$02,$02,$02,$03,$03,$03,$03,$03
    .byt  $03,$03,$03,$03,$03,$03,$03,$03,$03,$03,$03,$04,$04,$04,$04,$04
    .byt  $04,$04,$04,$04,$04,$04,$04,$05,$05,$05,$05,$05,$05,$05,$05,$05
    .byt  $06,$06,$06,$06,$06,$06,$06,$07,$07,$07,$07,$07,$07,$07,$08,$08
    .byt  $08,$08,$08,$08,$09,$09,$09,$09,$09,$0a,$0a,$0a,$0a,$0a,$0b,$0b
    .byt  $0b,$0b,$0c,$0c,$0c,$0c,$0d,$0d,$0d,$0e,$0e,$0e,$0f,$0f,$0f,$10
    .byt  $10,$10,$11,$11,$11,$12,$12,$12,$13,$13,$14,$14,$15,$15,$15,$16
    .byt  $16,$17,$17,$18,$18,$19,$1a,$1a,$1b,$1b,$1c,$1d,$1d,$1e,$1e,$1f
    .byt  $20,$20,$21,$22,$23,$23,$24,$25,$26,$27,$28,$28,$29,$2a,$2b,$2c
    .byt  $2d,$2e,$2f,$30,$31,$32,$33,$34,$36,$37,$38,$39,$3a,$3c,$3d,$3e
    .byt  $40,$41,$43,$44,$46,$47,$49,$4a,$4c,$4d,$4f,$51,$53,$55,$56,$58
    .byt  $5a,$5c,$5e,$60,$62,$65,$67,$69,$6b,$6e,$70,$73,$75,$78,$7a,$7d
    .byt  $80,$83,$85,$88,$8b,$8e,$92,$95,$98,$9b,$9f,$a2,$a6,$a9,$ad,$b1
    .byt  $b5,$b9,$bd,$c1,$c5,$ca,$ce,$d3,$d7,$dc,$e1,$e6,$eb,$f0,$f5,$fa
User avatar
Twilighte
Game master
Posts: 819
Joined: Sat Jan 07, 2006 12:07 am
Location: Luton, UK
Contact:

Post by Twilighte »

This looks similar to the division routine i used many years ago for converting 16 bit SID pitch to 12 bit AY pitch for the SID play project.
I remember its clever use of shifting to minimise the number of division cycles. Looks like the same thing done here.
It truelly reduced the time, from 100% cpu to produce a very slow SID conversion to an estimated 45% cpu and a very accurate SID conversion.
I was even able to do double speed tunes.
User avatar
Chema
Game master
Posts: 3013
Joined: Tue Jan 17, 2006 10:55 am
Location: Gijón, SPAIN
Contact:

Post by Chema »

Twilighte wrote:This looks similar to the division routine i used many years ago for converting 16 bit SID pitch to 12 bit AY pitch for the SID play project.
I remember its clever use of shifting to minimise the number of division cycles. Looks like the same thing done here.
It truelly reduced the time, from 100% cpu to produce a very slow SID conversion to an estimated 45% cpu and a very accurate SID conversion.
I was even able to do double speed tunes.
Yep. It is not *really* a huge gain (3 times faster than the routine in the C library), but it is quite adequate. I would love to have something more efficient.

BTW I forgot to mention that it was designed for divisions that might fit in 8-bit. In fact, originally, it worked only for 9-bit numbers divided by 8-bit numbers, but it can handle larger numbers now, as it shiftes them (divide both by 2) until X<512 and Y<256... So it is not general.

Judd uses it in its filled polygon routine, so it should serve at least for that objective, as well as for clipping lines (looks at Dbug).
Post Reply