Experimental very fast tape loading

NekoNoNiaow · Post by **NekoNoNiaow** » Thu Apr 05, 2018 5:32 am

Symoon wrote: ↑Sun Apr 01, 2018 8:34 am Is it a 3 cycles or 2 cycles advantage, since I'm exiting from the BCS?
I think I tried to, but it took more time, or same with more bytes Here the PLP/CLC/PHP/RTI is 15 cycles, couldn't find a shorter sequence...

Problem here is that:
- sadly registers A and Y must remain unaffected (or restored)
- I can't use much more bytes (I almost fill the page), but let's forget that for now

I have been working on that a bit during the week end and I have a solution which saves 6 cycles but trashes the X register and from what I understand it needs to contains the timer counter value after coming back from the interruption but this can probably be avoided depending on what the code after the waiting loop is doing.
This solution also puts some constraint on the stack (its depth at the moment of the waiting loop must be constant during the loading).

But if you are not doing any JSR or RTS after the waiting loop and before coming back to the waiting loop then the X trashing is not necessary and you would save 4 more cycles (so 10 total).

I will post it tomorrow, as I also have another idea I want to explore before.

Symoon wrote: ↑Sun Apr 01, 2018 8:34 am Oh BTW it's not just the interrupt that has to be faster, that would be too easy It's the whole loop decoding a byte (about 100 cycles + interrupt time). I need to save about 6 cycles I think.

I am like DBug, I would love to see it .

Symoon wrote: ↑Mon Apr 02, 2018 1:57 pm Tested and working on my "slow" Atmos!

Holy kitty, this is fast! Great job!

Symoon · Post by **Symoon** » Thu Apr 05, 2018 6:26 am

Thanks for your efforts

I'm currently working at 3 things in parallel:
1- cleaning and optimising a bit the working code (saving 4 or 5 bytes)
2- a "slow" version that could work on ROM 1.0 (would be around 25% slower I think)
3- coding a WAV generator that would allow to chose between "slow" 1.0/1.1 or "fast" 1.1-only version, including the loader at F16 speed, etc.

I'll try to finish 1/ first and post it this weekend!

Symoon · Post by **Symoon** » Fri Apr 06, 2018 1:13 pm

Here's the source code (well, in my usual unfriendly format) for the Oric part.
Cleaned and translated it quickly, sorry if awkward parts!

Novalight_v1.1a_source_ENGLISH.txt.zip: (4.41 KiB) Downloaded 383 times

I think it could be optimized a bit more if it didn't check for end program after each byte, but not sure it would be enough to allow a faster speed - need to check that. The last idea I have to go faster would be in total to save 23µs (one 44 kHz sample) on every main byte loop, so the start bytes could be 5, 6, 7 or 8 samples long instead of 6, 7, 8 or 9 now.
(EDIT: might be worth testing. Not checking end of program, that will be indicated by a 9 samples sinusoid, also saves the need for STA 2F... And moving the LDY#$00 into the repeat loop (befors RTS) saves 2 more µs. Total saved 19µs...)

BTW it's intended to be located in page 1 (stack) so it can't use much more bytes...

Post by **Dbug** » Fri Apr 06, 2018 7:48 pm

Adding labels would help: Having to locate where BNE -7 and BCC -27 go... is not super optimal.

Symoon · Post by **Symoon** » Fri Apr 06, 2018 8:17 pm

Aaaaah, but it helps me calculating the hex value, and know which jumps are relative or not and need to be changed when changing the program size... (yes I'm doing everything using notepad.exe

)
Give me some time to switch to better shared habits. Hey, I already tried to use $ and # correctly this time

Symoon · Post by **Symoon** » Sat Apr 07, 2018 7:54 am

Symoon wrote: ↑Fri Apr 06, 2018 1:13 pmI think it could be optimized a bit more if it didn't check for end program after each byte, but not sure it would be enough to allow a faster speed - need to check that. The last idea I have to go faster would be in total to save 23µs (one 44 kHz sample) on every main byte loop, so the start bytes could be 5, 6, 7 or 8 samples long instead of 6, 7, 8 or 9 now.
(EDIT: might be worth testing. Not checking end of program, that will be indicated by a 9 samples sinusoid, also saves the need for STA 2F... And moving the LDY#$00 into the repeat loop (befors RTS) saves 2 more µs. Total saved 19µs...)

Ok, not working - reading byte loop is too slow.
It would have saved about 5% loading time (for instance Zorgon would have loaded in 13.7 seconds instead of 14.5).
I can't see where to sav time now, except finding a way to get rid of the JSR to read a byte when loading the program (in 043D). But reading a byte is also called before, twice, to start and to load the header...

I guess I'd rather spend time working on the signal generator tool now!

Symoon · Post by **Symoon** » Sat Apr 07, 2018 3:24 pm

Symoon wrote: ↑Sat Apr 07, 2018 7:54 amOk, not working - reading byte loop is too slow.

Actually, it was bad programming from me (assuming C=0 when it's not actully sure has consequences... As well as exiting a program while being in a JSR

). There is still hope, but requires going back to work!

Symoon · Post by **Symoon** » Sat Apr 07, 2018 6:25 pm

Victory!

I'm saving a sample per byte

Two seem to almost work but a sinusoid is missed after a few hundred bytes, both on Euphoric and real machines, so I'll forget that (until an idea pops up to save more cycles

)

Another mystery: I noticed that with my "slow" Atmos, and only with it, I sometimes have to reboot the PC that plays the WAV file, otherwise there were errors with Novalight. But if I don't reboot and switch to another Atmos, everything is fine.
That puzzles me! And I wonder if it didn't interfere with the validity of some of my previous tests.

Chema · Post by **Chema** » Sat Apr 07, 2018 7:39 pm

Just wanted to pop in to express how impressed I am with your work!

When I saw the video I could hardly believe it was loading Xenon1 so quickly... Unbelievable! And now even faster.

Symoon · Post by **Symoon** » Sat Apr 07, 2018 7:47 pm

Thanks! Well, it's just me having fun, pushing Fabrice's wonderful tools to the limits

You probably won't notice any change with Xenon: it changes (without loader) from 11.38 seconds to 10.80. That's half a second faster, but at such short loading speed, doing significant progress becomes complicated

iss · Post by **iss** » Sat Apr 07, 2018 7:58 pm

Congrats for victory, Symoon!
Your source looks super, I tried to tested it with some quickly written encoding code based on explanations in your file but failed. Anyway, my tests are not relevant and only for fun, I'll wait until you release the encoding part.

Symoon · Post by **Symoon** » Sat Apr 07, 2018 9:24 pm

Thanks guys.
I'll have to set the loader in page 1 now, test it again, and write a decent WAV generator. With my rotten C memories, don't hold your breath

I think by default for multipart programs, the loader will be loaded again for each part (holding the name of the following high-speed program, so this remains compatible with CLOAD orders in programs). I will have to check how Fabrice and Chema did for this part.
Maby a "no loader" option, and maybe another option for 1.0 ROMs, but I'll put this last one aside until I finish 1st a complete 1.1 ROM version.

NekoNoNiaow · Post by **NekoNoNiaow** » Sun Apr 08, 2018 5:01 am

I am a bit late but here are my attempts at accelerating the code that Symoon posted.
(This is the code on the forum, not from the archive, I have not read it (yet).)

Not all of these attempts bring acceleration (but many do) and some have drawbacks.
I think it is worth looking at all of them, they might give ideas to other coders.

The last one is I think the best if it can be adapted with your system.
It saves 14 cycles total and costs a few bytes. \(ˆˆ)/

There are variants indicated by "Note1 and "Note2"mentions. You can skip them once you get the gist of what they mean.

Please tell me if you find any errors.
Now, go grab a coffee because that is a long read.

Hope this helps!

Original code for reference