Ok I coded a simple algorithm (I think is similar to LZO) and made a quick test. The basic idea is that it should be able to decompress any string inside the text with the minimum overhead in memory and code.
The algo works as follows:
1 Scan text searching for digrams (repetitions of pairs of symbols). If none is repeated more than N times (N at least 3) or there is no more room in the table then we are done. Else we go to step 2.
2 Take the digram which repeats most and add to the token table. Substitute the digram in the text with the token.
3 Go back to step 1
As you can see we end up with a kind of grammar which looks like:
<0> -> e, [space]
<1> -> d, [space]
<2> -> t, h
<3> -> s,
<4> -> i, n
...
<9> -> e, <1>
<10> ->[space], <2>
where <i> is a non-terminal symbol and the others are terminals (characters). So often repetitions of several characters are coded as combinations of the above rules. The trigram "ee[space]" is coded by rule 0 and rule 9, and ends up in the text as token 9. The tetragram "had[space]" is represented by token 83, and coded by:
<1> -> d, [space]
<27> -> h, a
<83> -> <27>, <1>
There is no waste of additional space normally, as, for example, rule 27 also codes the digram "ha" which is used quite often. More levels of recursion code pentagrams or hexagrams (if those are the correct terms)... as long as they repeat often.
The reason for this is that the table will be very simple to work with, as all the entries have only 2 bytes, and should not require too much 6502 code. I have a bare idea about how to do it using the processor stack and a quite simple loop.
The problem is that the maximum number of tokens (128 for this test, with a table of 256 bytes) are not enough to achieve great compression ratios which are in the order of 45-50%. For example I compressed the tol_text.s file from the link above (not the actual text of the game, but the actual source file) and it went from 30798 bytes to 15448, while RAR leaves it in 9.6K.
Not sure if this is the correct approach, but we can study it later, I suppose.
Regards,
EDIT: Had a small bug. The output for the test file is actually some hundreds of bytes less, but does not change the overall compression ratio much. Also for decompressing the code in C is, for each char in the compressed input:
void process_token(unsigned char token, FILE * textOut)
{
unsigned char left, right;
if (token < FIRST_TOKEN)
fputc(token,textOut);
else
{
left=Grammar[token-FIRST_TOKEN][0];
right=Grammar[token-FIRST_TOKEN][1];
process_token(left,textOut);
process_token(right,textOut);
}
}
I know it is recursive, but surely it is easy to implement in a compact way using the stack to store tokens that have to be processed and a loop. Access to the table is also inmediate (well, not really "inmediate"

), loading the token-FIRST_TOKEN in reg x and isssue
lda Grammar+1,x
pha
lda Grammar,x
pha
...