16 bits linear interpolation, anyone ?

Here you can ask questions or provide insights about how to use efficiently 6502 assembly code on the Oric.
User avatar
Dbug
Site Admin
Posts: 4437
Joined: Fri Jan 06, 2006 10:00 pm
Location: Oslo, Norway
Contact:

16 bits linear interpolation, anyone ?

Post by Dbug »

Hi.

This question is a bit hardcore, but if we can manage to find a solution to it, I will be able to finish my line and polygon clipping, and Chema will be able to continue is 3D game !

Basically, imagine you have two 16 bits (signed) coordinates (let's say in the -32000 to +32000 range), that represent a line or the border of a polygon.

Obviously, you need to clip that to something that fits into the screen.

There are a lot of well known algorithm, and I'm using one of these designed by Cohen & Sutherland.

The code is working fine, except that in the implementation I ran in an overflow issue when the lines are in certain positions, and this is not solvable in the current way, because the clipping involves doing some linear interpolation (X0+(Width*position)/Height), and unfortunately the result of the multiplication of two 16 bits numbers can easily require more than 16 bits.

Of course, both the multiplication and the divide could be replaced by 24 or 32 bits variants (16*16 to 32 and 32/16 to 16) but the existing 16 bits variants are already of killing the performance anyway, so it would be nice to get rid of them entirely.

This week end I will try something doing some dichotomy loop, but I wanted to ask if anyone would have an idea on how to implement efficiently (and accurately) this operation.

So, what I need, is the following.

Given X0,Y0,X1,Y1 (all 16 bits, the coordinates of both points of a line), if I give some value Xn (with Xn some value between X0 and X1) I need to get the corresponding Yn (and if given some Yn, need to find Xn).

It's the equivalent of these formula:

Code: Select all

Yn=Y0+((Y1-Y0)*(Xn-X0))/(X1-X0)
Xn=X0+((X1-X0)*(Yn-Y0))/(Y1-Y0)
Example:

With

Code: Select all

(X0,Y0)=(0,0)
(X1,Y1)=(200,200)
Then

Code: Select all

If Xn=50 -> get Yn=0+(200-0)*(50-0))/(200-0)=0+(200*50)/200=50
If Xn=75 -> get Yn=0+(200-0)*(75-0))/(200-0)=0+(200*75)/200=75
With

Code: Select all

(X0,Y0)=(30,90)
(X1,Y1)=(15,160)
Then

Code: Select all

If Xn=20  -> get Yn=90+(160-90)*(20-30))/(15-30)=90+(70*-10)/-15=136
If Yn=100 -> get Xn=30+(15-30)*(100-90))/(90-160)=30+(-15*10)/-70=27
I guess everybody can agree that the formula is simple, it's just a variant of the Cross-multiplication.

Thanks for any suggestion !
User avatar
waskol
Flight Lieutenant
Posts: 414
Joined: Wed Jun 13, 2007 8:20 pm
Location: FRANCE, Paris

Post by waskol »

What you are doing it, basically is to multiply A by B and then divide by C.
What I understand is that, under certain conditions, when you multiply by B you get and overflow, due to the fact that A is too big.

May be could you reduce that overflow by reducing the quotient B/C, computing for both B and C the GCD (Greater COmmon Divisor) ?

I.e., instead of multiplying by B/C=12/9 , you then multiply by B'/C'=4/3

I know it is not THE solution but perhaps a begining.

If I have something else, I come back around...

Step 2 :
Since we are dealing with integers, what can be done also is once you have your coordinate to compute with
AxB'/C'

You ten reduce the quotient A/C'by computing their GCD
Then you obtain this : A'xB'/C'', that should perform without overflow.

If you want to simplify a quotient in asm, the first thing to look between A and B, is if their last bit finishes by 0.

Let's say that we have A=111110b and B=1010b
They both finish by zero, which means they are both multiple of 2.
Follows that if you shift right both numbers, you simplify your operation A/B ( A/B=111110/1010=A'/B'=11111/101 ).
of course, you can simplify A/B as long as there are still leading zeros for their last bits.

I am pretty sure there exist some easy way in order to simplify a quotient of two binary numbers.
Last edited by waskol on Thu Sep 11, 2008 9:03 am, edited 1 time in total.
User avatar
waskol
Flight Lieutenant
Posts: 414
Joined: Wed Jun 13, 2007 8:20 pm
Location: FRANCE, Paris

Post by waskol »

I found this :
http://en.wikipedia.org/wiki/Binary_GCD_algorithm

among this, the source code provided and so forth, it is said :
Efficiency

Brigitte Vallée proved that binary GCD can be about 60% more efficient (in terms of the number of bit operations) on average than the Euclidean algorithm
and
With a 256-byte lookup table (8 bits), the implementation turned to be between 82.5% and 163% faster than the Euclidean algorithm, depending on CPU and compiler. Even with a small 16-byte lookup table (4 bits), the gains are in the range of 54% to 116%.
I think it is a way to dig out.
User avatar
waskol
Flight Lieutenant
Posts: 414
Joined: Wed Jun 13, 2007 8:20 pm
Location: FRANCE, Paris

Post by waskol »

May be another thing :
When you compute A.B/C, first start by division (i.e (A/C).B ) !

The thing is to compute the euclidian division (A/C) , so that you get a quotient (a1) and a remainder (a2). (Thus A=a1.C+a2)

Then you do the same thing with B (B=b1.C+b2)

And the result of A.B/C is :
a1.B+a2.b1+(a2.b2/C)

Check this example :
A=32
B=10
C=6

we want to compute A.B/C

if we first compute A.B, we obtain a "huge" value (320)

Following the method :
A=5*C+2 (a1=5, a2=2)
B=1*C+4 (b1=1, b2=4)

and A.B/C=a1*B+a2*b1+(a2*b2/C)=5*10+2*1+(2*4/6) (<-- binary simplification of last bits is possible here !)

Leading to A.B/C=50+2+1.3333=53.3333

The greatest value obtained was 50.



:P
User avatar
Chema
Game master
Posts: 3013
Joined: Tue Jan 17, 2006 10:55 am
Location: Gijón, SPAIN
Contact:

Post by Chema »

Greetings!

Glad to see that you are working on this DBug!

I don't know if this would be of any help but we already have a quick division routine, which makes use of the fast multiply and the predictor-corrector method (using logs and exps). It computes the division of two numbers and produces the result and the remainder modulo 256.

The real fact here is that the routine states that Both X and Y must be positive. X is assumed to be 9-bits and Y is 8-bits. But what it does with 16-bit signed numbers is storing the sign first and then rotate X *AND* Y until they fit the correct sizes and then perform the division.

I can make a few tests, because it will ovbiously NOT be accurate for large numbers. The question is if it would be accurate enough for our needs.

In this case you can use a similar trick to perform your calculations as waskol suggested (divisions first).

For the multiply, we could study the possibility of reducing them to 8-bit multiply (it would be FAST with our routine) if the division is performed first or use a 16-bit routine based on the fast 8-bit (I also have some code for this somewhere).

I will try to find some time to put the routines together and write an example to see how all this works.

Cheers.
User avatar
Chema
Game master
Posts: 3013
Joined: Tue Jan 17, 2006 10:55 am
Location: Gijón, SPAIN
Contact:

Post by Chema »

Ok. Made a few (quick) tests.

It seems the routine is quite accurate if the size of both numbers is comparable (as expected), otherwise the result is just an approximation.

For instance:

26/13 yields 2|0 (integer part|remainder) which is correct
26/3 yields 8|169 (169/256=0.66). The real result is 8.66666 so again correct

However
10001/101 yields 104|0 where it should be 99.019
10001/501 yields 20|206 (206/256=0.8) where it should be 19.92
10001/1001 yields 10|26 where it should be 9.99
10001/901 yields 11|36 thus 11.14 instead of 11.0998
10003/2503 yields 4|0 instead of 3.99 (nearly correct)
10003/2403 yields 4|41 thus 4.16 which is correct

The question is if we could live with this errors or not. Also notice that it only work for unsigned numbers (sign should be treated separately) and it does not work when dividing by 1 (not sure why, but it should be special-cased IMHO).

The routine is quite fast. With the C wrap and compiling this code:

Code: Select all

for (a=100;a<1000;a++){
     for(b=2;b<=a;b+=50){
      r=miDiv(a,b);
      d=r;
      d=d&255;
      r=r>>8;
      tot++;
     }
}
10314 divs are performed in nearly 8 seconds.

EDIT: Just noticed I posted this routine with sources in this forum before. See "Not so fast division routine". Maybe you are even using it!
User avatar
Dbug
Site Admin
Posts: 4437
Joined: Fri Jan 06, 2006 10:00 pm
Location: Oslo, Norway
Contact:

Post by Dbug »

May be could you reduce that overflow by reducing the quotient B/C, computing for both B and C the GCD (Greater COmmon Divisor) ?

That's smart :)

Looks like I will have a lot of things to experiment.

Probably this week end I will work a bit on that, and publish the code source with the LineBench with a clipping version, people will be able to temper with the clippin routine.

Thanks !
User avatar
waskol
Flight Lieutenant
Posts: 414
Joined: Wed Jun 13, 2007 8:20 pm
Location: FRANCE, Paris

Post by waskol »

Another way without clipping (but a sort of instead), no multiplication, and no divisions !!!

Why don't you simply use the Bresenham algorithm and plot only the pixels when they are in the clipping rectangle ?

If I remember well you already developped such a routine and put in on the forum :wink:
User avatar
waskol
Flight Lieutenant
Posts: 414
Joined: Wed Jun 13, 2007 8:20 pm
Location: FRANCE, Paris

Post by waskol »

May be could you find the clipping points by dichotomy before drawing the line !

Thus you will be dealing only with SHR and ADD and you won't have no overflow at all.

Let's say that
(X0,Y0)=(23,0)
(X1,Y1)=(218,199)
Xn=50

And we are looking for Yn
Here is the algorithm (in pascal) :

Code: Select all

const x0=23; y0=0;
      x1=218; y1=199;
      xn=50; //We are looking for Yn
var xa,ya,xb,yb,xc,yc,n:integer;
    yy:real;
begin
  //initialize variables
  xa:=x0;
  ya:=y0;
  xb:=x1;
  yb:=y1;

  //Main dichotomy loop
  repeat
     Xc:=(Xa+Xb) shr 2; //(ADD followed by SHR)
     Yc:=(Ya+Yb) shr 2;
     
     if Xc>Xn then
     begin
       Xb:=Xc;
       Yb:=Yc;
     end
     else begin
       Xa:=Xc;
       Ya:=Yc;
     end;
  until (Xc=Xn);
 
  //We are done
  Yn=Yc  
end;
By computing with multiplications and divisions, the real value of Yn should be
Yn=Y0+((Y1-Y0)*(Xn-X0))/(X1-X0)=27,5538461538462
Watch this !
1 : Xc=120 Yc=99
2 : Xc=71 Yc=49
3 : Xc=47 Yc=24
4 : Xc=59 Yc=36
5 : Xc=53 Yc=30
6 : Xc=50 Yc=27

Result : Xc=50 Yc=27
Not bad, only 6 steps and very accurate ! :P

Let's play with bigger values !
x0=-25257; y0=-5555;
x1=2186; y1=199;
xn=50;
How fast will we be and how accurate ? (haha... :lol: )
Answer is straight :
1 : Xc=-11535 Yc=-2678
2 : Xc=-4674 Yc=-1239
3 : Xc=-1244 Yc=-520
4 : Xc=471 Yc=-160
5 : Xc=-386 Yc=-340
6 : Xc=42 Yc=-250
7 : Xc=256 Yc=-205
8 : Xc=149 Yc=-227
9 : Xc=95 Yc=-238
10 : Xc=68 Yc=-244
11 : Xc=55 Yc=-247
12 : Xc=48 Yc=-248
13 : Xc=51 Yc=-247
14 : Xc=49 Yc=-247
15 : Xc=50 Yc=-247
Xc=50 Yc=-247
True Yn=-248,857158473928
15 steps and 2 pixels close (who cares ? ) :P

You are just left with drawing the line :D
User avatar
Dbug
Site Admin
Posts: 4437
Joined: Fri Jan 06, 2006 10:00 pm
Location: Oslo, Norway
Contact:

Post by Dbug »

waskol wrote:May be could you find the clipping points by dichotomy before drawing the line !
Dbug wrote:(...) This week end I will try something doing some dichotomy loop (...).
;)

Thanks for confirming at least that the idea seems correct :)

The cool thing is that you tried, and thus answered about accuracy. One or two pixels off is not bad for a static drawing, but I fear that the accuracy jumps off/on when the things are moving, which may look strange.

In that case, a solution for me would be to use three bytes and do some fixed point operations like 16.8 shift and add. The additional cost is very light, and the accuracy probably a lot better. (Still need to compare only one byte since the clipping window needs to fit in screen resolution).

Chema, considering that the "n" step dichotomy is replacing a multiplication and a division, how do you think both method compare performance/accuracy wise ?
User avatar
waskol
Flight Lieutenant
Posts: 414
Joined: Wed Jun 13, 2007 8:20 pm
Location: FRANCE, Paris

Post by waskol »

The thing is that you can improve accuracy a lot.

If the computed (Xb-Xa) and (Yb-Ya) at a certain step come to be small enough in order to provide a multiplication without overflow (wich means that if (Xb-Xa) counts n significant bits, while (Yb-Ya) should count at most (15-n) bits, wich guarantee no overflow for their 16 bits multiplication), you can then use the straight forward equation.

This occurs for sure when both (Xb-Xa) and (Yb-Ya) are both coded on 8 bits


LEt's take our preceeding example :
0 : Xm=Xb-Xa=27443 Ym=Yb-Ya=5754
Xc=0 Yc=-11535
1 : Xm=13721 Ym=2877
Xc=1 Yc=-4674
2 : Xm=6860 Ym=1438
Xc=2 Yc=-1244
3 : Xm=3430 Ym=719
Xc=3 Yc=471
4 : Xm=1715 Ym=360
Xc=4 Yc=-386
5 : Xm=857 Ym=180
Xc=5 Yc=42
6 : Xm=429 Ym=90 //We can stop there !!! (Xm*Ym<=65535)
Xc=6 Yc=256
Then we compute an approximate value of Yn (without possible overflow) :
Yn:=Ya+((Yb-Ya)*(Xn-Xa)) div (Xb-Xa);=-248,321678321678 (in case of an euclidian division, it gives -249 !)
Wich is the exact value we were looking for the "real" Yn should be -248,857158473928)

But for this, it is necessary to find a fast way for evaluating the number of bits involved in the multiplication.

What do you think about that ?
User avatar
Chema
Game master
Posts: 3013
Joined: Tue Jan 17, 2006 10:55 am
Location: Gijón, SPAIN
Contact:

Post by Chema »

Dbug wrote:
Chema, considering that the "n" step dichotomy is replacing a multiplication and a division, how do you think both method compare performance/accuracy wise ?
Not sure, really. I am terrible at counting cycles, but for sure that it should be much faster than 16-bit multiplications and divisions, and the DIVXY routine is not much accurate anyway (as you saw in my previous post) so I'd say we could give it a try.

The 8 bit unsigned multiplication can be made VERY fast (thirty-something cycles IIRC), 16-bit signed multiplications based on that is much slower (I think it involves 4 quick 8-bit multiplications, and around 150 cycles). Division should take (whithout the C wrap and test code) something below 400 cycles.

For this routine Judd's states: "cumulative error of 1-2 pixels, up to 4 in rare cases."

These figure are approximate and I can provide code if you wish, because it is difficult for me to count cycles. I could pack all these routines into a library and maybe do some testing... But that should be next week...

So I think it depends on how fast the n-step loop can be done.

EDIT

Made some quick tests. For:

Code: Select all

   X0=23;
    Y0=0; 
    X1=218;
    Y1=199;
    Xn=50; 
And this test code (doing the operation 1000 times):

Code: Select all

for (a=0;a<1000;a++){
     d=miDiv(mimul16((Y1-Y0),(Xn-X0)),(X1-X0));
     Yn=Y0+d&255;
    } 
The answer is correct (27) and it took 121 100ths of a second.

With:

Code: Select all

X0=3;Y0=90; 
   X1=15;Y1=160;
   Xn=20;
The answer is correct (189) and it took 154 100ths of a second.

Beware that the division I am using:
a/ Is unsigned. For signed version sign shall be saved and restored (as in multiplication routine).
b/ Assumes the result is 8-bit, else won't work.
c/ Even if the multiplication routine assumes 32-bit result, the C-wrap code does not, so the function returns an int, which may overflow.

In fact b/ is the reason why it seemed not to work when dividing by 1. Judd said it is a special case where the result can be more than 8-bit, so the remainder is set to 255, which is a value that does not appear otherwise.

Now I should be easier to compare. I don't know how many cycles can be wasted due to the C code and how it is compiled... :roll:[/code]
User avatar
waskol
Flight Lieutenant
Posts: 414
Joined: Wed Jun 13, 2007 8:20 pm
Location: FRANCE, Paris

Post by waskol »

Chema.... Dbug...
I think we can use dichotomy without a loss of precision....

Sorry Chema, bu I think that your fast div routine will be beaten up as if Manchester United was playing against Porstmouth :P

Let me explain....

The trick is to simulate the floating point division with the use of an extra byte for Xa,Xb,Ya,Yb,Xc and Yc. We will work with 16 bits for the integer part plus an extra 8 bits for fractional part.

If you think well, all we do is to divide by to, thus we will deal only with fractional parts such like 0.5, 0.25, 0.75, 0.125, 0.625.
If we represent the fractional part the way it is usually done,

1st bit is 0.5, 2nd bit is 0.25, 3rd bit 0.125, etc...

let's divide an odd number by two (let's say 5)
B1 B2 B3
1st byte 2nd byte 3rd byte
00000000 00000101 00000000 ...Then We divide by 2 (SHR 1)
00000000 00000010 10000000 <- result (2.5 !)
If you look closely, each time we will have an addition (ADD) or a right shift (SHR) will be as if we were dealing with integers.

For the final result (Yc), all we need is an integer, let's see how we can round our 24 bits "floating ppoint notation"

if the fractional part of Yc>=0.5, the 3rd byte will be in the form 1xxxxxxx,
if the frationnal part of Yc<0.5, theis 3rd Byte will be like 0xxxxxxx

Thus, if (B3 and #80)=#80 , then we add 1 to B1|B2
Other else, We leave the result B1|B2 as it is (simple Truncation)

For sure we will have the exact result, and no pixel shift.
Fast, straight, without any division or multiplication :D
User avatar
waskol
Flight Lieutenant
Posts: 414
Joined: Wed Jun 13, 2007 8:20 pm
Location: FRANCE, Paris

Post by waskol »

What I have to say is that wat I said is a bit different with negative numbers (0s and 1s are inversed for the fractionnal part)
User avatar
Dbug
Site Admin
Posts: 4437
Joined: Fri Jan 06, 2006 10:00 pm
Location: Oslo, Norway
Contact:

Post by Dbug »

waskol wrote:Chema.... Dbug...
I think we can use dichotomy without a loss of precision....
(...)
Dbug wrote:In that case, a solution for me would be to use three bytes and do some fixed point operations like 16.8 shift and add. The additional cost is very light, and the accuracy probably a lot better.
The trick is to simulate the floating point division with the use of an extra byte for Xa,Xb,Ya,Yb,Xc and Yc. We will work with 16 bits for the integer part plus an extra 8 bits for fractional part.
(...)
For sure we will have the exact result, and no pixel shift.
Fast, straight, without any division or multiplication :D
:)

Looks like we are talking of the same thing.

24 bits based dichotomy, 16 bits integer part, 8 bits decimals.
Post Reply