That's not in the linebench code. Can you compile other projects?JamesD wrote:I'm getting an "Unresolved External : osdk_stack" message on the latest code so I'm dead in the water.

## Line routine - Second attempt

Another possibility is the same math I wanted to try for nearly horizontal lines so you can skip dealing with x until you know it will change. But I still haven't figured out the math (mostly because I'm procrastinating) so that's not going to help either.

What do you mean with 8 bit math? We are already doing 8 bit math.JamesD wrote:If you unroll the loop for the nearly vertical lines you should be able to do 8 bit math until you know a carry will take place. However, identifying the point where you start/end drawing like that may take more cycles than it saves or more table space than is practical.

Another possibility is the same math I wanted to try for nearly horizontal lines so you can skip dealing with x until you know it will change. But I still haven't figured out the math (mostly because I'm procrastinating) so that's not going to help either.

Unrolling would be about the only option (AFAIK) for further optimizing. But I thing Cheama would kill me then.

Have fun!

thrust26

thrust26

I think I simplified that too much to make sense. I'll explain both approaches a little.thrust26 wrote:What do you mean with 8 bit math? We are already doing 8 bit math.

In the section of code below, you have to check for a carry caused by X.

In the math approach you know how many times you step in Y before X changes and you simplify the inner loop at the cost of more in the setup and outer loop. This might work but I think it's yet another special case with only a minor speedup because the real time savings aren't in the line routines anymore.

The other approach lets you adjust Y by only updating the lower byte until you know the higher byte will change. You should be able to do this up to 6 times for certain lines. However, the complexity of finding where you can do that might negate any savings. Think of it as a vertical equivalent to the optimization I suggested for the nearly horizontal line. The difference is, you must add 40 to the low byte of the screen address each pass in the inner loop. At best, the code would be much larger due to the number of special cases. At worst it could be slower because of additional tests to find where you can do this.

I don't think it would be worth attempting this unless both ideas could be combined.

Code: Select all

```
loop
iny ; Step in y
__auto_adx
adc #00 ; +DX
bcc skip ; Time to step in x?
__auto_stepx
inx ; Step in x
__auto_dy
sbc #00 ; -DY
skip
; Set the new screen adress
sta save_a
lda _HiresAddrLow,y
sta tmp0+0
lda _HiresAddrHigh,y
sta tmp0+1
draw_pixel
; Draw the pixel
sty save_y
ldy _TableDiv6,x
lda _TableBit6Reverse,x
```

Ok, so someone else can mull over the math concept and possibly come up with a solution...

Here is the simple part of the math:

X1 - X2 = DX

Y1 - Y2 = DY

If DX > DY THEN

DX / DY = number of steps in Y per step in X

ELSE

DY / DX = number of steps in X per step in Y

However... that isn't an even number often so you have DIV and MOD results, plus a divide is slow.

So, somehow the steps must be adjusted for each inner loop based on the MOD. Perhaps the steps + a table lookup based on the MOD before each inner loop.

Then the inner loop just uses a counter based on that in the Y direction or sets that number of bits in the X direction. Or something along those lines.

Here is the simple part of the math:

X1 - X2 = DX

Y1 - Y2 = DY

If DX > DY THEN

DX / DY = number of steps in Y per step in X

ELSE

DY / DX = number of steps in X per step in Y

However... that isn't an even number often so you have DIV and MOD results, plus a divide is slow.

So, somehow the steps must be adjusted for each inner loop based on the MOD. Perhaps the steps + a table lookup based on the MOD before each inner loop.

Then the inner loop just uses a counter based on that in the Y direction or sets that number of bits in the X direction. Or something along those lines.

Last edited by JamesD on Mon Feb 08, 2010 7:42 pm, edited 1 time in total.

I took a look at that when I was messing with this stuff back in July but never got around to trying it. Short attention span.thrust26 wrote:There is another possible optimization, called "two-step" line drawing. It looks pretty interesting, but I haven't tried it yet.

Since it draws from both ends it should be similar to partially unrolling the loops and simpler to implement than the math idea. If you have a CPU with enough registers it could be much faster, but I don't see that being the case here. I think it would be eliminating up to half the loop testing and branching clock cycles but that's what? A guess would be less than 10 clock cycles * 1/2 line length per line? It should be faster though.

It would definitely be larger in size and you need more temporary variables, but then I think at this point you aren't going to be able to make any speed improvements without an increase in code size.

<edit>

BTW, I'm not sure how it would compare to the chunking optimization.

Since both ends can differ where in a chunk you start I'm not sure you'd do better on near horizontal lines.

Oricutrons timings should (in theory) be more accurate than euphoric. You shouldn't directly compare results from different emulators.JamesD wrote:Which emulator are you using? I'm not getting anywhere near that with Oricutron.

That said, i don't have a real oric to compare so if someone could try linebench on real hw, euphoric, and Oricutron and see which is nearest that would be a useful test.

Well, I'm getting 664 so I think the latest code isn't in the repository.Xeron wrote:Oricutrons timings should (in theory) be more accurate than euphoric. You shouldn't directly compare results from different emulators.JamesD wrote:Which emulator are you using? I'm not getting anywhere near that with Oricutron.

That said, i don't have a real oric to compare so if someone could try linebench on real hw, euphoric, and Oricutron and see which is nearest that would be a useful test.