I don't think I need to say anything, people can read what you wrote on the forum alreadyNekoNoNiaow wrote: ↑Sat Mar 31, 2018 7:03 am Huhu.
Don't worry, I do the same. Just ask DBug about the length of my emails/posts.
Experimental very fast tape loading
Re: Experimental very fast tape loading
Re: Experimental very fast tape loading
Ok, the problem is more complicated than I thought.
Actually, some Atmos run slower than others. Don't ask me why but I can clearly see it with the latest additions to the code:
- when decoding a byte that is in a dictionnary, the Oric runs the longest loop in the code (something like 100 cycles)
- if four bytes consecutively have to be read using this loop, it runs fine on two of my Atmos, but fails on a third one
The symptom is the the 3rd Atmos misses a sinusoid. If I put some neutral (stop bit) delay between those bytes, the problem vanishes.
From there, I have several options:
- knowing that the loop will be even longer at the end of a filled RAM page (4 more cycle to go to the next page), I can try optimising the loop but I don't think I can save enough cycles and only have two spare bytes to code ( )
- I could add an option for "old Atmos" that will generate a slower signal for those bytes (adding options is confusing, the user will not know if his Atmos is "old" or not, or if it fails because of an audio problem)
- I will have to slow down the speed to try being compatible with more Atmos
- old Atmos can die, it won't work with them !
Actually, some Atmos run slower than others. Don't ask me why but I can clearly see it with the latest additions to the code:
- when decoding a byte that is in a dictionnary, the Oric runs the longest loop in the code (something like 100 cycles)
- if four bytes consecutively have to be read using this loop, it runs fine on two of my Atmos, but fails on a third one
The symptom is the the 3rd Atmos misses a sinusoid. If I put some neutral (stop bit) delay between those bytes, the problem vanishes.
From there, I have several options:
- knowing that the loop will be even longer at the end of a filled RAM page (4 more cycle to go to the next page), I can try optimising the loop but I don't think I can save enough cycles and only have two spare bytes to code ( )
- I could add an option for "old Atmos" that will generate a slower signal for those bytes (adding options is confusing, the user will not know if his Atmos is "old" or not, or if it fails because of an audio problem)
- I will have to slow down the speed to try being compatible with more Atmos
- old Atmos can die, it won't work with them !
- NekoNoNiaow
- Flight Lieutenant
- Posts: 272
- Joined: Sun Jan 15, 2006 10:08 pm
- Location: Montreal, Canadia
Re: Experimental very fast tape loading
Interesting, any idea why this Atmos is slower?
I am asking because if this can be traced back to a particular hardware difference (like a different version of the video chip which steals more cycles from the CPU) maybe it can be detected at run time in order to change the parameters of the routine for this particular machine.
From what you say about the code, you may not have much to tweak but one never knows. Once you publish the code I am sure some kittens will find ways to optimize it even further.
What would users prefer as an experience?
Very slightly (probably imperceptible) slower loading but full compatibility with their Atmos, "It just works".
Or full speed loading but if this might randomly fail on their Atmos and they would have to figure out whether it is because the tape is corrupted or because of their Atmos model and then test with the "slower speed" tape.
I am asking because if this can be traced back to a particular hardware difference (like a different version of the video chip which steals more cycles from the CPU) maybe it can be detected at run time in order to change the parameters of the routine for this particular machine.
From what you say about the code, you may not have much to tweak but one never knows. Once you publish the code I am sure some kittens will find ways to optimize it even further.
I find that asking the question from the point of view of user experience is often worth it.Symoon wrote: ↑Sat Mar 31, 2018 11:37 am From there, I have several options:
- knowing that the loop will be even longer at the end of a filled RAM page (4 more cycle to go to the next page), I can try optimising the loop but I don't think I can save enough cycles and only have two spare bytes to code ( )
- I could add an option for "old Atmos" that will generate a slower signal for those bytes (adding options is confusing, the user will not know if his Atmos is "old" or not, or if it fails because of an audio problem)
- I will have to slow down the speed to try being compatible with more Atmos
- old Atmos can die, it won't work with them !
What would users prefer as an experience?
Very slightly (probably imperceptible) slower loading but full compatibility with their Atmos, "It just works".
Or full speed loading but if this might randomly fail on their Atmos and they would have to figure out whether it is because the tape is corrupted or because of their Atmos model and then test with the "slower speed" tape.
I am my own worst enemy.
Re: Experimental very fast tape loading
Considering the encoded data is at a constant speed, generated (I guess) assuming a standard 6502 running at 1mhz (which in theory means 1 million clock cycles per second), and considering that the actual 1mhz is just an approximation coming from some fast vibrating quartz, it happens that some machines do run faster than others, but since all the components are using the same derived clock, we never see that as a problem.
That's my theory at least (and that's why in the thread about merging multiple oric outputs, the first step was to have them all use the same clock)
That's my theory at least (and that's why in the thread about merging multiple oric outputs, the first step was to have them all use the same clock)
Re: Experimental very fast tape loading
Nope, but this Atmos was modified by the previous owner, so I guess it might affect something. Also noticed the keys sound is lower.
I don't dare opening it since there are several switches (one to reboot, another to switch from ROM 1.1 to ROM 1.0) and wouldn't like to damage the wires.
You're probably right. I'll have to convince myself, since in my mind so far, TAP2CD was the "fastest safe" option; I wanted first to push the limits as far as I could, would it be in "laboratory conditions"NekoNoNiaow wrote: ↑Sat Mar 31, 2018 8:05 pm I find that asking the question from the point of view of user experience is often worth it.
What would users prefer as an experience?
Very slightly (probably imperceptible) slower loading but full compatibility with their Atmos, "It just works".
Or full speed loading but if this might randomly fail on their Atmos and they would have to figure out whether it is because the tape is corrupted or because of their Atmos model and then test with the "slower speed" tape.
I'm currently between 2 and 4 times faster than TAP2CD. Slowing down could be up to a 25% speed loss, for an unkonwn amount of machines :-/
Anyway I will try first to savage my code, 1st idea is to try removing a JSR / RTS; that could save the situation...
(not that I don't want to share the code, but I don't like the idea of publishing an unfinished code, as long as I still have ideas to try )
- NekoNoNiaow
- Flight Lieutenant
- Posts: 272
- Joined: Sun Jan 15, 2006 10:08 pm
- Location: Montreal, Canadia
Re: Experimental very fast tape loading
Oh, if these modifications indeed do affect the speed as you indicate, then that would be worth knowing because that would mean you can safely keep using your current setup since real Atmos capturable in the wild would run fine with it.Symoon wrote: ↑Sat Mar 31, 2018 8:44 pm Nope, but this Atmos was modified by the previous owner, so I guess it might affect something. Also noticed the keys sound is lower.
I don't dare opening it since there are several switches (one to reboot, another to switch from ROM 1.1 to ROM 1.0) and wouldn't like to damage the wires.
This said, from an electronics stand point, neither of these two mods would explain why the machine is slower since they should both be passive.
It is puzzling though that the sound volume is lower, maybe there are other modifications on this machine which would deserve investigation.
DBug might be right that natural quartz frequency variations might push the machine too close to the edge where your system stops working but it would be interesting to run other timing tests to see if this affects other aspects of your machine (like timers/interrupts). Maybe you could run some of the timing tests used to validate emulators.
Well, as long as you are validation phase, you can still send the program for test by a larger number of people and see what the results are.Symoon wrote: ↑Sat Mar 31, 2018 8:44 pm You're probably right. I'll have to convince myself, since in my mind so far, TAP2CD was the "fastest safe" option; I wanted first to push the limits as far as I could, would it be in "laboratory conditions"
I'm currently between 2 and 4 times faster than TAP2CD. Slowing down could be up to a 25% speed loss, for an unkonwn amount of machines :-/
If your machine is the only one which suffers from the issue then you would be good to go.
I understand.
I cannot help too much with 6502 ASM yet but if you are stuck you could still release just part of the interrupt code to see if people have ideas to make it faster. If they do, then you could try it and continue experimenting without needing to do a full monty.
Re: Experimental very fast tape loading
I will run a few tests on other machines today (got 4 or 5 more to test, that I can only test during some weekends), but I think I'll have the problem on other machines. I also still have to set the thresholds values correctly - I kept modifying them, not understanding that the problem was elsewhere... So it's a real mess now.
I also noticed that the previous tests I asked in forums, gave rather constant results... But when loading a longer program on my machines, it wasn't so clear... Then again, I think both threshold and Atmos speed problems interacted.
So now that I have this in mind, I'm busy trying to find a more visual and global way to look at the problem:
The Oric waits for the interrupt in an infinite loop (this is Fabrice's TAP2CD idea!). It is mandatory for precision purpose: the loop only lasts 3 cycles, if it lasts longer there is not enough precision to separate the different sinusoids on all machines.
It means that after the interrupt, I need to go back to the code after the infinite loop, which requires a trick that costs time.
Note: the 2nd colum is the cycles cost.
If anyone has an idea to go faster, I'd be more than happy
I also noticed that the previous tests I asked in forums, gave rather constant results... But when loading a longer program on my machines, it wasn't so clear... Then again, I think both threshold and Atmos speed problems interacted.
So now that I have this in mind, I'm busy trying to find a more visual and global way to look at the problem:
Good idea Here we go:NekoNoNiaow wrote: ↑Sun Apr 01, 2018 5:33 am I cannot help too much with 6502 ASM yet but if you are stuck you could still release just part of the interrupt code to see if people have ideas to make it faster. If they do, then you could try it and continue experimenting without needing to do a full monty.
The Oric waits for the interrupt in an infinite loop (this is Fabrice's TAP2CD idea!). It is mandatory for precision purpose: the loop only lasts 3 cycles, if it lasts longer there is not enough precision to separate the different sinusoids on all machines.
It means that after the interrupt, I need to go back to the code after the infinite loop, which requires a trick that costs time.
Code: Select all
(...)
0460 2 38 SEC
0461 2/3 B0 FE BCS -2 infinite loop (waiting for interrupt)
(...)
Interrupt code:
04D0 4 AE 00 03 LDX 0300 Reset flag on CB1
04D3 4 AE 08 03 LDX 0308 read timer (sinusoid duration) in X
04D6 4 8E 09 03 STX 0309 Rest timer counter (writing in #309 sets #308 with #F5 once instruction executed)
04D9 4 28 PLP Get the system flags saved by the interrupt
04DA 2 18 CLC Set C to 0 to leave the loop
04DB 3 08 PHP Save the system flags
04DC 6 40 RTI Back to the loop
If anyone has an idea to go faster, I'd be more than happy
Re: Experimental very fast tape loading
If the idea is to go out of the loop as fast as possible, instead of PLP/CPC/PHP you could use code that takes about the same amount of time to instead patch the program counter return address stored in the stack to add +2 to it.
Since you have full control to the location of your code and initial stack value, you know exactly which byte of the stack you can patch to achieve that.
The advantage is that instead of having RTI bring you back at the start of the BCS to detect that now the carry is cleared and then exit, you directly return after the BCS itself, which gives you a 3 clock cycles advantage.
Also, if you can afford to to have A or Y containing a value before your waiting loop, the entire address change code becomes a simple STA or STY before the RTI, which takes only 4 clock cycles (instead of the 9 cycles taken by PLP/CLC/PHP) which means ultimately you exit your waiting loop 8 clock cycles earlier than on your current code.
Admittedly, it's ugly
EDIT: You can maybe also move the CB1 reset (first LDX) later in the code so you can read the timer sine value 4 cycles earlier, that should not impact the IRQ behavior.
Since you have full control to the location of your code and initial stack value, you know exactly which byte of the stack you can patch to achieve that.
The advantage is that instead of having RTI bring you back at the start of the BCS to detect that now the carry is cleared and then exit, you directly return after the BCS itself, which gives you a 3 clock cycles advantage.
Also, if you can afford to to have A or Y containing a value before your waiting loop, the entire address change code becomes a simple STA or STY before the RTI, which takes only 4 clock cycles (instead of the 9 cycles taken by PLP/CLC/PHP) which means ultimately you exit your waiting loop 8 clock cycles earlier than on your current code.
Admittedly, it's ugly
EDIT: You can maybe also move the CB1 reset (first LDX) later in the code so you can read the timer sine value 4 cycles earlier, that should not impact the IRQ behavior.
Re: Experimental very fast tape loading
Is it a 3 cycles or 2 cycles advantage, since I'm exiting from the BCS?Dbug wrote: ↑Sun Apr 01, 2018 8:10 am If the idea is to go out of the loop as fast as possible, instead of PLP/CPC/PHP you could use code that takes about the same amount of time to instead patch the program counter return address stored in the stack to add +2 to it.
(...)
The advantage is that instead of having RTI bring you back at the start of the BCS to detect that now the carry is cleared and then exit, you directly return after the BCS itself, which gives you a 3 clock cycles advantage.
I think I tried to, but it took more time, or same with more bytes Here the PLP/CLC/PHP/RTI is 15 cycles, couldn't find a shorter sequence...
Problem here is that:
- sadly registers A and Y must remain unaffected (or restored)
- I can't use much more bytes (I almost fill the page), but let's forget that for now
Oh BTW it's not just the interrupt that has to be faster, that would be too easy It's the whole loop decoding a byte (about 100 cycles + interrupt time). I need to save about 6 cycles I think.
Re: Experimental very fast tape loading
And that's why having the entire code would be handy
Re: Experimental very fast tape loading
You're right Dbug, it's just that it's so messy and with comments in French or outdated, that it would take a while to clean it for everyone to have a proper prensetation. For now, I'll use spare time to try alternatives first.
That being said, I can send the rough code without comments at all if you guys wish me to.
I think I found something. I'm calling (JSR) the rom in E56C, which is only something like 18 bytes.
So I have to find 15 bytes and I can save 12 cycles per loop (JSR/RTS), which I hope would save the thing
Trying...
That being said, I can send the rough code without comments at all if you guys wish me to.
I think I found something. I'm calling (JSR) the rom in E56C, which is only something like 18 bytes.
So I have to find 15 bytes and I can save 12 cycles per loop (JSR/RTS), which I hope would save the thing
Trying...
Re: Experimental very fast tape loading
Anyone willing to do a test?
Haven't tested it on my "slow" machine yet, but it works on 3 other Atmos and fails on one (which already failed almost all the previous tests).
Just do HIRES first, then CLOAD""
Haven't tested it on my "slow" machine yet, but it works on 3 other Atmos and fails on one (which already failed almost all the previous tests).
Just do HIRES first, then CLOAD""
Re: Experimental very fast tape loading
Tested and working on my "slow" Atmos!
- ibisum
- Wing Commander
- Posts: 1652
- Joined: Fri Apr 03, 2009 8:56 am
- Location: Vienna, Austria
- Contact:
Re: Experimental very fast tape loading
Wow! That is very fast indeed. What we need now: a time machine, to go back and demo this to us as 13-year olds.
Re: Experimental very fast tape loading
Lol, we would have to bring our computers with us, WAV players didn't exist by then, did they?
I recall as a child, my brothers and I would start the Xenon-1 tape in SLOW mode before going to lunch, so it was loaded when we were back (about half and hour loading IIRC)