Ok, so what I suggest is that you add comments on the various part of the code to indicate what they do and in which context they are used because right now it's pretty much impossible to follow what the code does since it's just macros calling macros and there are no comments other than the equivalent C code.
Regarding
The DDA is only used for very close walls that are taller than the screen view. In this case, the texture is oversampled.
Well, as it happens that's the case where the rendering is the slowest, so that would at least help on this specific case.
The relevant pointer is selected in the PREPARE_UNROLL_SAMPLE which is called before unrolling a column.
Ok, the fact PREPARE_UNROLL_SAMPLE calls DDA_STEP kind of made me believe it used DDA, if it does not, then maybe rename the macro, or alias it.
If you want people being able to help, the code has to be written so other people can understand it.
For example, PREPARE_UNROLL_SAMPLE does use "_ddaNbStep", which does not help.
Ok, so let's dig in the code (the unroll part, and tell me if I got it all right), I'm going to try to understand how all the macro stuff work together:
The core of the unroll is that simple snippet that uses three macros, with the UNROLL_SAMPLE taking two other macros that gets expanded in "prim_frst and prim_scnd):
Code: Select all
goUnroll
PREPARE_UNROLL_SAMPLE
DDA_STEP
UNROLL_SAMPLE(COLOR_LEFT_TEXEL_FIRST,COLOR_LEFT_TEXEL_SECOND)
PREPARE_UNROLL_SAMPLE and DDA_STEP are relatively simple and short:
Code: Select all
#define PREPARE_UNROLL_SAMPLE
lda _ddaNbStep: sta _ddaNbIter
tay
lda _adrTabIdxRd_low,Y
sta _ptrOffsetIndex
lda _adrTabIdxRd_high,Y
sta _ptrOffsetIndex+1
lda #0 : sta _nxtOffsetIndex
So ok, the "_ddaNbStep" and "_ddaNbIter" are used to access the _adrTabIdxRd_low/high to patch the _ptrOffsetIndex, which itself is read from DDA_STEP using the _nxtOffsetIndex.
Code: Select all
#define DDA_STEP
ldy _nxtOffsetIndex
lda (_ptrOffsetIndex), Y
sta _ddaCurrentValue
iny
sty _nxtOffsetIndex
Before proceeding to UNROLL_SAMPLE, let's check the two macros COLOR_LEFT_TEXEL_FIRST and COLOR_LEFT_TEXEL_SECOND
Code: Select all
; x contains _renCurrentColor
#define COLOR_LEFT_TEXEL_FIRST
lda _tabLeftRed,x
ldy #0
sta (_theAdr),y
lda _tabLeftGreen,x
ldy #40
sta (_theAdr),y
lda _tabLeftBlue,x
ldy #80
sta (_theAdr),y
bcc skip: inc _theAdr+1: skip
So what that does, is that X contains the texture pixel color, and using the three tables the texel is converted into values that can be displayed over three scanlines to generate a RGB triplet.
(There's an optimisation possibility there, you don't actually need the bcc skip/inc _theAdr+1 in the first macro.)
Code: Select all
; x contains _renCurrentColor
#define COLOR_LEFT_TEXEL_SECOND
lda _tabLeftRed,x
ldy #120
sta (_theAdr),y
lda _tabLeftGreen,x
ldy #160
sta (_theAdr),y
lda _tabLeftBlue,x
ldy #200
sta (_theAdr),y
clc
lda _theAdr
adc #240
sta _theAdr
bcc skip: inc _theAdr+1: skip
That one does the same thing for a second texel, the reason of having two dedicated code is that each scanline is 40 bytes long, 40*3=120, 120*2=240, so you can treat two pixels using the same Y indexing offset, after that you ned to change the pointer address.
Using my suggestion of the intermediate column buffer, you can just remove all the ldy #immediate as well as the indirect addressing, so instead of:
you just get
and you don't need a special case to treat them by blocks of two pixels, just iterate over the entire texture naturally, makes the code much easier.
Which brings us to the last big part, UNROLL_SAMPLE.
That one is massive, and get even larger with the expansion of the two COLOR_LEFT_TEXEL macros, so I'm splitting it in two parts.
Code: Select all
#define UNROLL_SAMPLE(prim_frst,prim_scnd)
lda #TEXTURE_SIZE: sta _ddaCurrentError
loop_000
lda _idxScreenLine
cmp #VIEWPORT_START_LINE
bpl end_loop_000
DDA_STEP
dec _ddaNbIter
inc _idxScreenLine
jmp loop_000
end_loop_000
The very first line is a remnant of your old code and seem useless, so you can remove the
lda #TEXTURE_SIZE: sta _ddaCurrentError.
Then I'm wondering why you do need to iterate at all: Can't you just precompute in a table all the possible parameters you need? With the resolution we have (which technically since we use three scanlines per pixels is no more than 200/3=66 pixels tall), that's not many possible combinations at all.
Also, since this loop contains DDA_STEP, it also means that for each iteration loop you recompute the _ddaCurrentValue... but then again I removed the DDA_STEP and it did not seem to change anything at all... maybe because DDA_STEP is also called in the main loop in the rest of the code after?
Code: Select all
ldy _idxScreenLine:lda _multi120_low,Y:clc:adc _baseAdr:sta _theAdr
lda _multi120_high,Y:adc _baseAdr+1:sta _theAdr+1
loop_001
DDA_STEP
ldy _ddaCurrentValue : lda (_ptrReadTexture),Y
tax
prim_frst
inc _idxScreenLine
dec _ddaNbIter
beq endloop_001
DDA_STEP
ldy _ddaCurrentValue : lda (_ptrReadTexture),Y
tax
prim_scnd
inc _idxScreenLine
lda _idxScreenLine
cmp #VIEWPORT_HEIGHT + VIEWPORT_START_LINE
bcs endloop_001
dec _ddaNbIter
beq endloop_001
jmp loop_001
endloop_001
So before the main loop we start by a computation of the base address using the multi120 table, the only two places where it's used have the exact same code, so a possible optimisation would be to have the table be a table of offsets instead of multiplication by 120, so you could skip the addition, transforming this:
Code: Select all
ldy _idxScreenLine
lda _multi120_low,Y:clc:adc _baseAdr:sta _theAdr
lda _multi120_high,Y:adc _baseAdr+1:sta _theAdr+1
into this:
Code: Select all
ldy _idxScreenLine
lda _multi120_low,Y:sta _theAdr
lda _multi120_high,Y:sta _theAdr+1
That's not a huge optimisation, but that can be helpful since now the carry is not modified, and technically the lda/sta could become ldx/stx which gives you some flexibility to use the table access in other locations.
When the code is expended, what follows transforms this:
Code: Select all
DDA_STEP
ldy _ddaCurrentValue : lda (_ptrReadTexture),Y
tax
into this:
Code: Select all
ldy _nxtOffsetIndex
lda (_ptrOffsetIndex), Y
sta _ddaCurrentValue
iny
sty _nxtOffsetIndex
ldy _ddaCurrentValue : lda (_ptrReadTexture),Y
tax
Because you are using many macros, it's hard to see the possible optimisations, but basically the
_ddaCurrentValue and
_nxtOffsetIndex variables are probably not necessary, you could just do that instead:
Code: Select all
ldy _nxtOffsetIndex
lda (_ptrOffsetIndex), Y
tay
lda (_ptrReadTexture),Y
inc _nxtOffsetIndex
tax
Not a huge optimisation, but that's 22 cycles instead of 24.
Further optimisations would probably require changing some table structures to get them page aligned.
The cool thing about this type of thing, is that it makes the code just short enough that you can avoid having to use the reverse branch + jmp idiom, so you can replace at the end this:
by this:
Not much, but that's done for every two pixels, so quite often in the end.
The final optimization I could see (without changing the code structure) would be to not have to test both the _idxScreenLine and _ddaNbIter to know if you are done, the number of iteration is known in advance, and one check should be enough, so you can just remove these sections of code:
and
Code: Select all
inc _idxScreenLine
lda _idxScreenLine
cmp #VIEWPORT_HEIGHT + VIEWPORT_START_LINE
bcs endloop_001
As far as I can see, the code still runs correctly.