## Line routine - Second attempt

Here you can ask questions or provide insights about how to use efficiently 6502 assembly code on the Oric.
JamesD
Flight Lieutenant
Posts: 354
Joined: Tue Nov 07, 2006 7:38 am
thrust26 wrote:
JamesD wrote:I think the only significant improvement will be in devising a way to calculate/write all pixels for a byte in one shot, possibly reducing the number of OR operations with screen memory.
Steve Judd has developed a quite brilliant routine (C=Hacking, ussue 10) for exactly this, which works without any tables in the inner loop.

Code: Select all

``````        initialize x, dx, etc.
xold = x
take a step in x: LSR X
have we hit the end of a column?  If so, then plot and check on y
is it time to take a step in y?
if not, take another step in x
if it is, then let a=x EOR xold
plot a into the buffer
let xold=x
keep on going until we're finished``````
It saves a lot of time for lines where dX > 2 * dY on the C64 the Atari 2600 (~33% saving there). Maybe it can somehow be adapted for the Oric?
Actually, I was thinking of something a little more radical.

If you have the starting X, you can load from the table with all bits set from that point in the byte on.

Then loop without the LSR until it's time to step in Y.

Use the new X in a 2nd table lookup with all the pixels set up to the last pixel and AND that with the first table lookup we used.

Where the two bitmasks cross you have set all the pixels you need to write to the screen for that byte.

Which is faster depends on up to 8 (or 6 with the Oric) LSRs vs 2 table lookups.

However, I was thinking there must be a way to skip the inner loop mathematically (hopefully simple math) to save clock cycles.
I need to sit down and play with it but I think it's possible.
That is what I was referring to.

So instead of the inner loop you would:
lookup1
math (table???)
AND lookup2
write to screen.
<edit>
Correction to my earlier edit.
IF I can come up with a way replace the loop with math/table logic I can move the column test to the bottom of that sequence which means it's not just eliminating LSR it's also cutting out that test/branch combo from the loop.

thrust26
Posts: 49
Joined: Wed Jan 27, 2010 7:34 pm
Location: Düsseldorf, Germany
JamesD wrote:Actually, I was thinking of something a little more radical.

If you have the starting X, you can load from the table with all bits set from that point in the byte on.

Then loop without the LSR until it's time to step in Y.
Like this?

Code: Select all

``````  MAC PLOT
IF XOR_MODE
eor     (buf_r),y   ; 5
ELSE
ora     (buf_r),y   ; 5
ENDIF
sta     (buf),y     ; 6 = 11
ENDM

MAC PLOTP_XC
dex                 ; 2
beq     Byte{1}     ; 2/3+50.38     12.5% taken
Cont{1}:
; average: 10.42 (4/55.38=87.5/12.5%)
ENDM

.loopXMajorC:
PLOTP_XC P_XC               ;10.42
bcc     .loopXMajorC        ; 2/3           >=50% taken
; average: 16.42
sbc     .dX                 ; 3
sta     .tmpSum             ; 3
lda     Pot2PCTbl-1,x       ; 4
eor     chunk               ; 3
PLOT                        ;11
lda     Pot2PCTbl-1,x       ; 4
sta     chunk               ; 3
lda     .tmpSum             ; 3
dey                         ; 2
bne     .loopXMajorCSub     ; 2/3=38/39(-3)``````
Have fun!
thrust26

JamesD
Flight Lieutenant
Posts: 354
Joined: Tue Nov 07, 2006 7:38 am
thrust26 wrote:
JamesD wrote:Actually, I was thinking of something a little more radical.

If you have the starting X, you can load from the table with all bits set from that point in the byte on.

Then loop without the LSR until it's time to step in Y.
Like this?
At first glance I don't see an AND operation in there so you are doing something different and at 3:30 a.m. I'm not up to figuring everything out. Ahem... some comments besides cycle times would have helped.

The idea is that you load a bitmask with all pixels set from the current bit that corresponds to x. (I'm going left to right for X in this example and bitmasks would be reversed to go right to left for X)
Something like this:

00111111
---^ starting bit

Then we find the ending x bit for the column and use an AND with a 2nd bitmask. The 2nd bitmask has all bits set in the opposite direction and where they cross is the bits for the current column.

11111100
-------^ ending bit

AND the two and you get:

00111100

That's all the bits set for this column and you just write it out with your PLOT macro.

CPU differences might make this less practical on the 6502 than 6803/6809.

JamesD
Flight Lieutenant
Posts: 354
Joined: Tue Nov 07, 2006 7:38 am
The code should read a little more like this but with actual code replacing a couple sentences.

Code: Select all

``````calculate first starting x here (this is done once)

.majorLoop
lda    table,x
sta    .tmp     ; .chunk?

loop to find last x in chunk here

lda    table2,x  ; you could use table+offset or something similar
and   .tmp     ; .chunk?
PLOT

update x and y here (note that x should be ready for next table load after this)

bne     .majorLoop
``````
<edit>
It's the inner loop I'd really like to do away with.
In theory there would be no best or worst case, it should always take the same time. But that depends on the required math.
Times might be slower than the hacking 64 version best case, but it should be significantly faster than worst case.
Last edited by JamesD on Fri Feb 05, 2010 2:45 pm, edited 1 time in total.

thrust26
Posts: 49
Joined: Wed Jan 27, 2010 7:34 pm
Location: Düsseldorf, Germany
JamesD wrote:Ahem... some comments besides cycle times would have helped.
That's how you code for the 2600. Cycle counting come first, second and third. And then maybe something else.
The idea is that you load a bitmask with all pixels set from the current bit that corresponds to x
The idea is done the same. In the setup code, you once load the remaining bits depending on the x-position. From there on you can use EOR to mask out the superfluous bits.

Examples:
chunk = 00xxxxxx (initial setup value)
Tbl, x = 0000xxxx
EOR => 00xx0000 (bits plotted)
chunk = 0000xxxx
...
Tbl, x = 000000xx
EOR => 0000xx00 (bits plotted)
...
Have fun!
thrust26

JamesD
Flight Lieutenant
Posts: 354
Joined: Tue Nov 07, 2006 7:38 am
Ok, I see what you changed. Then that should be it.
The question is, how do the cycle counts compare vs your/their previous version?
<edit>
Nice way of changing the logic to get rid of the 2nd table btw.

thrust26
Posts: 49
Joined: Wed Jan 27, 2010 7:34 pm
Location: Düsseldorf, Germany
JamesD wrote:Ok, I see what you changed. Then that should be it.
The question is, how do the cycle counts compare vs your/their previous version?
It saves ~3.8 cycles/pixel for a 4:1 line (which should be about the average line where it is used for).
Have fun!
thrust26

Dbug
Posts: 2881
Joined: Fri Jan 06, 2006 10:00 pm
Location: Oslo, Norway
Contact:
You are welcome trying to optimize the existing line routine.

Just add a new "line_chunk.s", add a section in "main.c" to run it, and let's extend the test with your chunk code

thrust26
Posts: 49
Joined: Wed Jan 27, 2010 7:34 pm
Location: Düsseldorf, Germany
Dbug wrote:You are welcome trying to optimize the existing line routine.

Just add a new "line_chunk.s", add a section in "main.c" to run it, and let's extend the test with your chunk code
I have two problems:
1. no setup to test
2. not enough knowledge about the Oric hardware, especially the frame buffer used.

Maybe you can help me here.
Have fun!
thrust26

Dbug
Posts: 2881
Joined: Fri Jan 06, 2006 10:00 pm
Location: Oslo, Norway
Contact:
thrust26 wrote:
Dbug wrote:You are welcome trying to optimize the existing line routine.

Just add a new "line_chunk.s", add a section in "main.c" to run it, and let's extend the test with your chunk code
I have two problems:
1. no setup to test
2. not enough knowledge about the Oric hardware, especially the frame buffer used.

Maybe you can help me here.
Sure, you can solve 1) by downloading and installing the OSDK (http://osdk.defence-force.org. It's just a matter of setting one environment variable, and most of the projects on the SVN repository should work.

You will also need to get the project from the SVN repository (
http://miniserve.defence-force.org/svn, the readme gives the global structure of the repository plus some links on where to get the programs to access it).

For the structure of the framebuffer, that's easy:
- 200 lines of 40 bytes each, starting in \$A000
- each byte contains 6 bits with the bitmap data, the two other bits control the display: top bit is for inverse video, bit 6 when set indicates the 6 lower bits are actual bitmap data, else they are a value that controls the display (change colors, etc...)

For the purpose of a line routine, just ignore all the attribute stuff, all you need is to be able to set the 6 lower bits to some values to draw the pixels, while keeping the bit 7 to 0, and the bit 6 to 1.

01000000 -> default "nothing drawn" value
01111111 -> "all pixels are drawn" value

01010101 |
01101010 +> dithering pattern

If for some reason the two top bits are not correctly set to 01, you're going to have screen corruption, but you can assume they are already correctly set in the framebuffer, assume that by default the screen is filled with the value 64 (all pixels black).

Have fun

thrust26
Posts: 49
Joined: Wed Jan 27, 2010 7:34 pm
Location: Düsseldorf, Germany
Dbug wrote:For the structure of the framebuffer, that's easy:
- 200 lines of 40 bytes each, starting in \$A000
- each byte contains 6 bits with the bitmap data, the two other bits control the display: top bit is for inverse video, bit 6 when set indicates the 6 lower bits are actual bitmap data, else they are a value that controls the display (change colors, etc...)
Seems pretty straightforward.

I think Y should be used for indexing in X direction.
Have fun!
thrust26

Dbug
Posts: 2881
Joined: Fri Jan 06, 2006 10:00 pm
Location: Oslo, Norway
Contact:
By the way, you can join us on IRC, chanel #oric on ircnet

JamesD
Flight Lieutenant
Posts: 354
Joined: Tue Nov 07, 2006 7:38 am
thrust26 wrote:
JamesD wrote:Ok, I see what you changed. Then that should be it.
The question is, how do the cycle counts compare vs your/their previous version?
It saves ~3.8 cycles/pixel for a 4:1 line (which should be about the average line where it is used for).
Glad to hear my idea wasn't totally crazy!
After looking at your logic change I like it even more!

JamesD
Flight Lieutenant
Posts: 354
Joined: Tue Nov 07, 2006 7:38 am
I just had a thought... unrolling the inner loop should make it faster.

The normal loop uses the branch taken over 50% of the time (3 clocks each) but an unrolled inner loop would use branch not taken over 50% of the time (2 clocks each) . And I'm thinking the last loop instruction is totally removed.

So... savings per byte would be:
1 pixel = +1 clock.
2 pixels = same clocks.
3 pixels = -1
4 pixels = -2
5 pixels = -3
6 pixels = -5 (on oric)

On an 8 pixel per byte machine it saves even more.
6 pixels = -4
7 pixels = -5
8 pixels = -7

That or I've had too much Pepsi tonight.

thrust26