I am a bit late but here are my attempts at accelerating the code that Symoon posted.
(This is the code on the forum, not from the archive, I have not read it (yet).)
Not all of these attempts bring acceleration (but many do) and some have drawbacks.
I think it is worth looking at all of them, they might give ideas to other coders.
The last one is I think the best if it can be adapted with your system.
It saves 14 cycles total and costs a few bytes. \(ˆˆ)/
There are variants indicated by "Note1 and "Note2"mentions. You can skip them once you get the gist of what they mean.
Please tell me if you find any errors.
Now, go grab a coffee because that is a long read.
Hope this helps!
Original code for reference
Code: Select all
(...)
0460 2 38 SEC
0461 2/3 B0 FE BCS -2 infinite loop (waiting for interrupt)
(...)
Interrupt code:
04D0 4 AE 00 03 LDX 0300 Reset flag on CB1
04D3 4 AE 08 03 LDX 0308 read timer (sinusoid duration) in X
04D6 4 8E 09 03 STX 0309 Rest timer counter (writing in #309 sets #308 with #F5 once instruction executed)
04D9 4 28 PLP Get the system flags saved by the interrupt
04DA 2 18 CLC Set C to 0 to leave the loop
04DB 3 08 PHP Save the system flags
04DC 6 40 RTI Back to the loop
Pre-calculated interrupt stack
Code: Select all
; Setup code (is run only once at the start of the loading routine).
; Prepares an interrupt return stack at the bottom of the stack = $100 to $102.
; It should never be overwritten during the tape loading routine.
xx00 2 A9 00 LDA #0 ; push the desired RTI flag value (=0)
xx02 3 48 00 01 STA $0100
xx05 2 A9 6A LDA #$64 ; push the desired RTI return address (046A LSB)
xx07 3 48 01 01 STA $0101
xx0A 2 A9 04 LDA #$04 ; push the desired RTI return address (046A MSB)
xx0C 3 48 01 02 STA $0102
(...)
Code: Select all
(...) ; Interrupt waiting loop : 16 cycles (previous = 4/5)
0460 2 BA TSX ; Save stack pointer into the second LDX below |
0462 4 8E 6A 04 STX $046A ; |-> cf Note1
0465 2 A2 00 LDX #0 ; Prepare the future RTI stack pointer
0467 2 38 SEC ; Make sure the carry is cleared (would be nice if the 6502 had a branch-always instruction)
0468 2 B0 FE BCS -2 ; infinite loop, waiting for interrupt, always 2c because it never succeeds
046A 2 A2 xx LDX #xx ; Reload the stack pointer |
046C 2 9A TXS ; |-> Cf Note2
(...)
Code: Select all
(...) ; Interrupt routine : 20 cycles (previous = 27)
04D0 2 9A TXS ; Point the stack to the prepared "return of interrupt" stack
04D0 4 AE 00 03 LDX 0300 ; Reset flag on CB1
04D3 4 AE 08 03 LDX 0308 ; read timer (sinusoid duration) in X
04D6 4 8E 09 03 STX 0309 ; Reset timer counter (writing in #309 sets #308 with #F5 once instruction executed)
04D9 6 40 RTI ; Back to the loop
; Total cycles:
; waiting loop (one iteration)
; + IRQ routine
; + failed branch test when coming back from interrupt
; original code = 4 + 27 + 3 = 34
; alternative = 16 + 20 = 36
Conclusion: worse than the original, and this trashes X so the value of the timer is lost.
Note1:
Code: Select all
Note1: if the stack pointer value is guaranteed to be constant when it reaches the waiting loop,
then it can be precomputed in advance during the setup phase and stored directly in
the instruction LDX #xx .
In that case, the waiting loop routine does not need to store it and becomes shorter
by moving the TSX/STX pair out of the loop and into the setup phase
This would bring the number of cycles to:
; original code = 4 + 27 + 3 = 34
; alternative = 10 + 20 = 30 (4 cycles gain)
Conclusion: 4 cycles faster than the original, but we still trash X to reload the stack pointer so we lose the value of the timer.
Note2:
Code: Select all
Note2: if no JSR instructions are used after coming back from the interrupt and
before branching back to the waiting loop,
then there is no need to restore the stack counter and the LDX + TXS pair at $046A can be moved
later after the loop is done.
Morever, this prevents the trashing of X/timer-value!
This would bring the number of cycles to:
; original code = 4 + 27 + 3 = 34
; alternative = 12 + 20 = 32 (2 cycles gain)
Conclusion: 2 cycles faster than the original, but we still trash X to reload the stack pointer so we lose the value of the timer.
but does not work if a JSR is done before the next iteration of the loop
because the stack is at $100 and that would be messy.
Code: Select all
If Note1 and Note2 apply together.
This would bring the number of cycles to:
; original code = 4 + 27 + 3 = 34
; alternative = 6 + 20 = 26 (8 cycles gain)
Conclusion: 8 cycles faster than the original and works without trashing X
but does not work if a JSR is done before the next iteration of the loop (stack at $100, messy).
JMP variant
It does still precompute an interrupt stack at $100 to $102 and pre-store it in setup.
But now, it never uses RTI but instead uses JMP to go back after the waiting loop
- allows to remove the TXS in the interrupt routine
but requires popping up the flags to clear the I flag before jumping back
Code: Select all
(...) setup phase is the same as previous attempt
(...) ; Interrupt waiting loop : 16 cycles (previous = 4/5)
0460 2 BA TSX ; Save stack pointer into the second LDX below |
0462 4 8E 6A 04 STX $046B ; |-> cf Note1
0465 2 A2 00 LDX #0 ; Prepare the future RTI stack pointer
0467 2 38 SEC
0468 2 B0 FE BCS -2 ; infinite loop, waiting for interrupt, always 2c because it never succeeds
046A 2 A2 xx LDX #xx ; Restore the stack pointer |
046C 2 9A TXS ; |-> Cf Note2.
(...)
(...) ; Interrupt routine : 19 cycles (previous = 27)
04D0 4 AE 00 03 LDX 0300 ; Reset flag on CB1
04D3 4 AE 08 03 LDX 0308 ; read timer (sinusoid duration) in X
04D6 4 8E 09 03 STX 0309 ; Rest timer counter (writing in #309 sets #308 with #F5 once instruction executed)
04D9 4 28 PLP ; Restore system flags before interrupt, this clears the interrupt disable flag
046A 3 4C 69 04 JMP $046A ; Jump right after the waiting loop
Code: Select all
; Total cycles:
; waiting loop (one iteration)
; + IRQ routine
; + failed branch test when coming back from interrupt
; original code = 4 + 27 + 3 = 34
; this code = 16 + 19 = 35 (-1 cycles gain)
Conclusion: slower than the original, oops.
Note1 applies as well (save stack in setup).
Code: Select all
This would bring the number of cycles to:
; original code = 4 + 27 + 3 = 34
; this code = 10 + 19 = 29 (5 cycles gain)
Conclusion: 5 cycles faster than the original, but we are still trashing X/timer-value.
Note2 applies as well (avoid restoring the stack in the loop).
Code: Select all
This would bring the number of cycles to:
; original code = 4 + 27 + 3 = 34
; this code = 12 + 19 = 31 (3 cycles gain)
Conclusion: 3 cycles faster than the original
but does not work if a JSR is done before the next iteration of the loop
because the stack is at $100 and that would be messy.
If
Note1 and
Note2 are applied together.
Code: Select all
This would bring the number of cycles to:
; original code = 4 + 27 + 3 = 34
; alternative = 6 + 19 = 25 (9 cycles gain)
Conclusion: 9 cycles faster than the original and works without trashing X
but does not work if a JSR is done before the next iteration of the loop (stack at $100, messy).
Merge loop and interrupt routine variant
The idea here is to:
- incorporate the interrupt routine directly in the waiting loop (oO)
- so after the interrupt is handled, there is no need to come back from the routine:
just pop the flags, restore the stack pointer and just continue without returning/branching
This has many advantages:
- there is no need for a dedicated interrupt stack anymore
- we can now use JSR since we are still on the regular stack
But we cannot use RTS to return from the current routine since the address of the BCS is still on the stack.
(This will be solved next variant.)
Code: Select all
(...) ; Interrupt waiting loop : 10 cycles (previous = 4/5)
0460 2 BA TSX ; Save stack pointer into the LDX below |
0462 4 8E D4 04 STX $04D4 ; |-> Cf Note1.
0465 2 38 SEC
0466 2 B0 FE BCS -2 ; infinite loop, waiting for interrupt, always 2c because it never succeeds
; Interrupt routine : 20 cycles (previous = 27)
0468 4 AE 00 03 LDX 0300 ; Reset flag on CB1
04DB 4 AE 08 03 LDX 0308 ; read timer (sinusoid duration) in X
04EE 4 8E 09 03 STX 0309 ; Rest timer counter (writing in #309 sets #308 with #F5 once instruction executed)
04D2 4 28 PLP ; Restore system flags before interrupt, this clears the interrupt disable flag
04D3 2 A2 xx LDX #xx ; Reload the stack pointer |
04D5 2 9A TXS ; |-> Cf Note2.
; here the code continues normally as if it was just after the waiting loop BCS
; ...
Code: Select all
; Total cycles:
; original code = 4 + 27 + 3 = 34
; this code = 10 + 20 = 30 (4 cycles gain)
Conclusion: 4 cycles faster than the original, but we still trash the X/timer value to restore the stack pointer.
Note1 applies as well (save stack in setup).
Code: Select all
This would bring the number of cycles to:
; original code = 4 + 27 + 3 = 34
; this code = 4 + 20 = 24 (10 cycles gain)
Conclusion: 10 cycles faster than the original, but we are still trashing X/timer-value.
Note2 applies as well (avoid restoring the stack in the loop).
Code: Select all
This would bring the number of cycles to:
; original code = 4 + 27 + 3 = 34
; this code = 10 + 16 = 26 (8 cycles gain)
Conclusion: 8 cycles faster than the original and works without trashing X
we can do JSR because we are still on the original stack
we cannot do an RTS though because the address of the BCS is still on the stack
If
Note1 and
Note2 are applied together.
Code: Select all
This would bring the number of cycles to:
; original code = 4 + 27 + 3 = 34
; alternative = 4 + 16 = 22 (12 cycles gain)
Conclusion: 14 cycles faster (YOOHOO) than the original and works without trashing X
we can do JSR because we are still on the original stack
we cannot do an RTS though because the address of the BCS is still on the stack
Final version (for now )
This illustrates how to make everything work together with the optims of both Note1 & 2.
This is the same as above but illustrates how to restore the stack effortlessly once the loop is done.
Code: Select all
(...) ; calling code
0xxx JSR setup_waiting_loop
; then do other stuff
(...)
Code: Select all
setup_waiting_loop: ; This runs only once
0460 2 BA TSX ; Save stack pointer into the LDX of the restoration code
0462 4 8E aa bb STX aabb ; aabb = see below
waiting_loop: ; Interrupt waiting loop : 4 cycles (previous = 4)
0465 2 38 SEC
0467 2 B0 FE BCS -2 ; infinite loop, waiting for interrupt, always 2c because it never succeeds
; Interrupt routine : 16 cycles (previous = 27)
0468 4 AE 00 03 LDX 0300 ; Reset flag on CB1
04DB 4 AE 08 03 LDX 0308 ; read timer (sinusoid duration) in X
04EE 4 8E 09 03 STX 0309 ; Rest timer counter (writing in #309 sets #308 with #F5 once instruction executed)
04D2 4 28 PLP ; Restore system flags before interrupt, this clears the interrupt disable flag
; The address of the BCC is still on the stack but we do not care.
04D3 ..... ; here regular code which handles the result of the interrup.
04Dx JSR some_ROM_code ; no problem, this works
04Dy ...... ; Do some other stuff here
...
; Exit condition of the loop: are we finished?
xxyy-2 2/3 F0 ?? ?? BEQ waiting_loop ; Could be any branch test.
aabb-1 2 A2 xx LDX #xx ; Restore the stack pointer
aabb+3 2 9A TXS
aabb+4 6 60 RTS ; back to calling code
Code: Select all
; Total cycles:
; original code = 4 + 27 + 3 = 34
; this code = 4 + 16 = 20 (14 cycles gain)
Conclusion: 14 cycles gain. No disadvantages that I can see.
I am not counting the branching back as a penalty because you probably already have one.
If the branching site of the loop is too far away from waiting_loop we lose a few cycles:
Code: Select all
; Exit condition of the loop: are we finished?
xxyy-2 2/3 D0 bb-1 aa BNE exit_loop ; Could be any branch test.
xxyy 3 4C 65 04 JMP waiting_loop ; Loop back
exit_loop:
aabb-1 2 A2 xx LDX #xx ; Restore the stack pointer
aabb+3 2 9A TXS
aabb+4 6 60 RTS ; back to calling code
Conclusion: 14-3-3 = 8 cycles gain. Still worth it.
Congratulations to everyone who managed to read everything.