Symoon wrote: ↑Sun Mar 03, 2019 12:12 am
NekoNoNiaow wrote: ↑Sat Mar 02, 2019 11:39 pm
For every empty entry in the dictionary
Aaaah, this is the key that I missed! I kept willing to have the right order before assigning to the dictionary... Sorry it took so long to understand, I feel stupid
It all makes sense now, thanks a lot
I'll give it a try
Ahem, the fault does not lie with you, the reason is probably that I did not explain myself very well the first time.
However, I am
very ashamed to admit that unfortunately I was wrong in saying this is optimal.
After writing my last post, I decided to illustrate visually how that method was optimal and in the process I actually demonstrated to myself that this was incorrect.
An important point though, is that the method, although not optimal, is probably overall close to optimal as the data below will show. A combinatorial approach would work better (although it could be prohibitively expensive) but maybe an intermediate approach between the two would work well enough. (I will explain that more in detail below.)
Now to the test data: I generated some random stats corresponding to a file to compress, selecting a few compression candidates and decided some arbitrary encoding values for the dictionary entries and graphed visually the gain obtained by each candidate for each entry.
Here is the data itself, all gain values were computed using the formula given in previous posts:
The corresponding Google spreadsheet with the data and graph is available there for those who want to verify my computations.
https://docs.google.com/spreadsheets/d/ ... sp=sharing
Since the gain numbers are difficult to interpret, I graphed the gains for each candidate for every entry of the dictionary.
Each line in the graph below correspond to the gains of a given candidate and logically, they all decrease as the entries in the dictionary increase in encoded size.
As you can see,
the data proves me absolutely wrong : for several dictionary entries, there are cases where inverting the choices given by my method is the most optimal choice.
You will notice however that in most cases, there is not much difference between the choice suggested by my method and the inverted choice so the method globally is not too bad, but it is certainly not optimal.
This said, most of the lines of the graph decrease almost in parallel which generally means that my method kinda works well, however it is when they cross or are very close to one another that my method makes the wrong choice.
What to do then?
What I would recommend would be to mix my method with just one level of combinatorial exploration. The data seems to suggest that inversions usually concern only crossing or very close lines and these do not seem to involve more than two points (at least in my sample) so that may be enough.
Obviously more real-life testing would be needed to compare that with other methods you experimented with so far.
Here is what the algorithm could look like:
1 - For each candidate, compute their gain for the current empty dictionary entry, write down the candidate with the biggest gain and the candidate with the second best gain (respectively called entry1_c1 and entry1_c2).
2 - For each candidate, compute their gains for the
next empty dictionary entry, once again write down the gains of the best two candidates (respectively entry2_c1 and entry2_c2).
3 - Compute which of the choices is the best by computing the gain over the two entries:
Code: Select all
normal_gain_over_two_entries = entry1_c1 + entry2_c2
inverted_gain_over_two_entries = entry1_c2 + entry2_c1
4 - If normal_gain_over_two_entries > inverted_gain_over_two_entries, then simply select the best candidate in this round, otherwise, select the second best candidate instead.
5 - repeat for the next dictionary entry.
Here you go.
I must apologize again for insisting that my method was optimal when the data proves me completely wrong. I am not sure why I (wrongly) remembered that this
greedy algorithm (Wikipedia) was optimal but clearly this memory of mine was completely false.
So, sorry for that again, I will be more careful from now on.