Sunday 29 March 2015

more in less

Curiosity got the better of me again and now there's 3:30am night and early rise behind me and pretty much another whole weekend "down the drain".

I had another go trying to grok VCDIFF format but failed again: not sure if it's my encoder or decoder but I can't get them to agree on the address compression and so it fails on most data.

Putting that aside I experimented with some ideas on address compression from VCDIFF, and then ended up creating a new format based on what i learnt from it and other experiments. My work with VCDIFF demonstrated the utility of address compression and dual operation instructions.

Anyway I think it turned out quite neat and tidy so here's a summary. It might be useful if you're trying to understand VCDIFF - its very different but borrows similar mechanisms.

Opcodes

There are basically two families of opcodes: dual-operation and single operation. The MSb selects which so each has a full 7 bits of information to play with. This allows for quite a wide range of data lengths and operations to be encoded into a single byte.

Only 6 individual operations are defined together with 2 dual operations.

Integers are encoded like many formats as unsigned 7-bit big-endian with the MSb as a continue bit. Addresses are encoded independently using a separate mechanism as described below.

There are a couple of other format parameters:

smallest
The smallest copy size possible. Encoder dependent.
split
The split point for the single operation opcodes. Fixed at 100 but could be changed.

Dual operation instructions

The encoder tracks the last instruction and if possible can combine two into the same instruction. This is one of the things vcdiff uses to get more compact deltas.

0Taaabbb
  0 - selects a dual instruction
  T - the type of the first operation, 0=ADD 1=COPY
aaa - 3-bit length for first operation.  This allows a range of
      (1-8 ) byte appends or (0-7)+smallest copy.
bbb - 3-bit length for second operation.  It is always COPY.
      This allows a range of (0-7)+smallest copy.

3 bits allows for an ADD of 1-8 bytes and a COPY of (0-7)+smallest bytes. With a smallest size of 6 bytes this allows ADD(1-8)+COPY(6-13) or COPY(6-13)+COPY(6-13).

This is all bits exhausted so the only dual instructions possible. Any data or addresses required are encoded in order in the following bytes.

I also investigated a fixed function 3-bit ADD and 4-bit COPY but it wasn't as good (albeit with very limited investigation).

Single operation instructions

The single op instructions allow for longer immediate copies as well as extended instructions which encode the length as a separate parameter.

Rather than split the data in bit-selected blocks a single 128-value number is broken into several ranges which are interpreted differently.

1nnnnnnnn
  1 - selects single operation
  n - 7-bit number interpreted via inclusive ranges as follows:

000 - 099   copy (0-99)+smallest bytes
100 - 123   add (1-24) bytes
      124   read length.  copy length+100+smallest bytes.
      125   read length.  add length+24+1 bytes.
      126   read length.  run of length+3 bytes.
      127   eof/reserved or something

The split was based on some statistical analysis of a couple of files: copies cover a larger range than adds, and runs are rare. Well and 100 is easier to work with.

For a smallest of size 6, this allows a single instruction to encode a copy of 6-105 bytes or an append of 1-24 bytes. These cover the majority of cases for the data i've been testing with (not that it is a very big set) and the behaviour of the matcher.

The multi-byte encoding is such that the 6+ and 4+ bits already implied taken care of are removed from their lengths which can save the occasional `overflow' byte.

Addresses

This format generated slightly smaller deltas than the previous format but I knew from my experiments with VCDIFF that address compression could increase the gains. The problem here is now i've run out of bits to use so I had to come up with a solution which encoded addresses independently of any operation codes.

Setting aside even a few bits for each address would be too costly so after a bit of thought I came up with a solution based on the observation that most addresses will be >127 and require 2 bytes at least anyway, therefore if I just encode the rare all-7-bit addresses in 16 bits instead it leaves a full 7 bits to use for other encoding schemes whilst retaining the `natural use' of those bits for longer values.

The next problem is how to use those 7 bits. VCDIFF uses 3 bits to select from a near/same table to chose either a positive offset of the last 4 addresses or together with an octet to select a specific address from a table of 768 (using the default code table). I did some experiments to find out which aids more: 'next' is used more often but 'same' saves a lot each time it is used; but it needs 11 bits to do it. Too much here. I also found that adding a sign to the near addresses offset improved the results.

I chose a trade-off which has features of both but requires fewer bits. It combines the same and near table into the same array and adds a sign to the offsets of near addresses. Because I don't need the sign bit for 'same' addresses I can use that to increase the address space of the same table. This allows a full 6 bits to be used for the match table and 5 for the near table.

It is necessarily slower than VCDIFF because I perform a linear search over these values to find an exact match (64 elements) or the nearest match (32 elements). The tables could be overlapped: they are just the last 32 or 64 addresses encoded or decoded stored using a cyclic index. Like VCDIFF both encoder and decoder must maintain this in sync.

This is the encoding used:

00nnnnnn - This is an exact match and a complete address.
           n is an index into a 64-element table.
01Smmmmm - This is a near match.  S is a sign bit and m is
           an index into a 32-element table.  The offset follows.
1aaaaaaa* 0aaaaaaa - An absolute address.  To avoid clashing with
          the above it is forced to at least 2 bytes length.

Note that absolute addresses are just encoded as simple integers: with no `wasted bits' for the other information if their value is greater than 127; which is very likely in the common case.

The encoder is free to choose the smallest option of these for any address in the stream.

Results

These are with the same encoder settings of a 'smallest' of 6 bytes, as with the other benchmark data from the home page.

                       dez-1.2     dez-?.?     gzip -4
  GPL2 to GPL3         13 591      12 053
  jjmpeg.dll           10 809       8 770
  bible (compress)  1 731 250   1 539 501   1 550 998 

Runtime is a tiny bit longer for the shorter files due to the address table lookup, although i haven't optimised the address table scan yet. It's still 180x slower than gzip on the KJV.

This actually beats my current VCDIFF encoder but since that is still broken it's pretty much useless to compare to it. Even a single bug can radically alter the patch size.

But one (admittedly small) plus is that unlike VCDIFF this format is fully streamed and doesn't require staging. Actually another plus is that the code is quite a bit simpler due to a more orthogonal instruction set and few special cases, but it only has an implied-fixed code table so isn't as flexible.

Can it go smaller?

Because the string matcher performs an exhaustive search it may find multiple equivalent-length matches for a given target sub-string. This provides an obvious opportunity for selecting a source address that can be represented in fewer bytes than others. This could save some more bytes at the cost of encoding time.

Update: Boy, didn't even let the paint dry on this one. Tried the last idea.

                       dez-1.2     dez-?.?     addr opt    gzip -4
  GPL2 to GPL3         13 591      12 053      11 965
  jjmpeg.dll           10 809       8 770       8 725
  bible (compress)  1 731 250   1 539 501   1 507 072   1 550 998 

Well it's more than zero, so I guess that's a yes?

Saturday 28 March 2015

dez 1.2

So I came up with a couple more ideas to try ... actually they all pretty much amounted to nothing but here is a small update anyway. See the notes on the home page.

Some things I tried and discarded: incrementally chaining the hash table (avoiding the realloc+copy); inlining the hash function; calculating all hashes of source and target at the start; doing the same in parallel. Most of them amounted to no gain although the parallel hashing helped quite a bit in some cases but not often enough to be worth the memory required.

Thus this is mostly a post about the expanded homepage where i added a bit detail and more benchmarks.

I was a bit surprised with the memory use compared to the "competition" because dez amounts to a full exhaustive search, albeit index assisted, whereas the others do not.

Friday 27 March 2015

synthz 0.0

Decided to code-drop the audio synthesiser i hacked up a couple of weeks ago.

Probably of no use to anybody but what the hell. Maybe one day i'll make some better sfx for that space invaders I did.

It's got it's own home page up on the usual spot.

dez 1.0

I packaged up the best version of the delta generator into dez-1.0. It and some details are available on the dez home page.

A couple of nights ago I tried one of the hashtable ideas from a previous post. It made a big improvement so I kept whittling it away until there really wasn't much of a hash table left apart from a couple of arrays. Fastest version using the least memory. Open addressing just didn't handle some bad cases at all well.

I didn't see the point in keeping the others around so i only kept this one in the release.

And that should be me done with this for a while.

Tuesday 24 March 2015

Getting the runs.

As is usually the case I kept poking last night and added a runlength detector to the internal matcher loop of the delta processor.

To start with, any sequence which begins with 3 common bytes are not added to the hash table. For my problem file this reduces the hashtable load significantly.

Then when scanning the target data any location which starts with 3 common bytes are not looked up in the hash table and just converted to a byte run instead.

I had to add runs to the byte encoding and to do this and i noticed that ADD and RUN are typically tiny so just made the command+length byte always 1 byte with 6 bits of length. But I might change that to 5 bits of length so that it can still have a continue bit since most are under 32 bytes as well and that will protect against the occasional long length.

So something like this:

   00XXXXXX            - copy + 6 bit length
   10XXXXXX CXXXXXXX*  - copy + extended length
   011XXXXX            - add + 5 bit length
   111XXXXX CXXXXXXX*  - add + extended length
   010XXXXX            - run + 5 bit count
   110XXXXX CXXXXXXX*  - run + extended count

Expansion space? Why? If need be I could let $00 be an extended command byte or something.

The delta sizes are slightly larger but the runtime is ~2 orders better on this particular problem file (jjmpeg.dll) with similar settings. And it's like ~1s vs 10ms rather than 10ms vs 100us.

(6,1)      delta        size
 rev    orig     new    
  59  155323  155323    155323
  57   15509   15626    154093
  55    8168    8227    154093
  53   10425   10490    153839
  47    9942    9958    153095
  45   15879   16000    145425
  40   16584   16701    146274
  36   14561   14632    144658
  22   10751   10832    144145
  20      38      38    144145

           jjmpeg.dll

These are the deltas for jjmpeg.dll from the trunk of jjmpeg. It's a reverse delta so the top line is the latest version with no processing and each subsequent is an editing list which turns the previous version into it.

One probably obvious thing I hadn't realised is that reverse deltas work particularly well on source code - or anything that is added to over time. Since most edits are editions, a reverse delta is mostly just deletes (actually, just no commands at all).

 rev   delta    size
 147    3977    3977
 146      31    3801
 145      26    3671
  85     101    3177
  53      28    3180
  52      39    2591
  45      49    2488
  16      48    1645
  13      15    1556
  11      42    1300
   6      25    1194
   2      20    1051

     AVNative.java

I'm not working on a revision control system for source code, it just provides easy test data.

So with that i'm pretty well done with the that* ... unless i can do something about the hash table.

With short matching patterns required to get the best results certain data gets repeated a lot which is problematic for the open addressed hashing. Most solutions I can think of just blow the memory out since as soon as you start adding links it at least doubles the memory.

After a bit of thought a couple of possibly practical ideas spring to mind ...

Overflow and keep.
Once a search exceeds a certain range drop subsequent records into an overflow record. The record in the table can be a negative index to the overflow record.

Clashes still cause 'difficulty' because the match scan will have to deal with multiple overflow records. Although they could include the hash key to help filter them out.

Overflow all.
This is similar but rather than leave the entries there all values for the same key are extracted and placed into the overflow record and the rest are marked deleted. The first available slot is set to point to the overflow record.

The problem here is that the hash keys are not stored in the table and they are required to be able to filter the entries since they could include other hash keys. But! They can be regenerated quite cheaply.

Hybrid and keep.
Have a second smaller chained (or otherwise) hash table and if any search exceeds some limit store values in that instead.

This is the simplest but would require two lookups each time.

Hybrid all.
Combine the last two. Move all the other values of that key to the new table too.

This would only require two lookups for values which are not present in the overflow table, which by definition occur less often. It's not really the lookups which are the problem though.

Just thoughts, haven't tried anything.

The journey never ends

So while thinking about the next problem: the scalability of the brute force search of all the substrings with the same keys ... I thought of an exact solution that combines the string search with the index.

It's pretty obvious so probably has someone's name Update: Well sort of, it's a longest common prefix table, or LCP table. The below is definitely a naive approach with O(N log N) running time, but it is space efficient and simple to understand.

First, the whole file is pre-processed into a common prefix tree.

  1. Sort a list of every substring in the source;
  2. Iterate over the sorted list and count the length of the common prefix;
  3. (optional) build an index of the first row of each possible byte.

Because it just uses pointers/indices it doesn't take up any memory for the data and it only needs one entry for each location.

This data structure is basically a directed graph where each offset in 'starts' is a branch and each value in 'commons' is a link to the branch below.

 0 @ 15 c=1 '  Test.'
 1 @ 16 c=1 ' Test.'
 2 @ 7  c=1 ' a test.  Test.'
 3 @ 4  c=1 ' is a test.  Test.'
 4 @ 9  c=0 ' test.  Test.'
 5 @ 21 c=1 '.'
 6 @ 14 c=0 '.  Test.'
 7 @ 17 c=1 'Test.'
 8 @ 0  c=0 'This is a test.  Test.'
 9 @ 8  c=0 'a test.  Test.'
10 @ 18 c=4 'est.'
11 @ 11 c=0 'est.  Test.'
12 @ 1  c=0 'his is a test.  Test.'
13 @ 5  c=3 'is a test.  Test.'
14 @ 2  c=0 'is is a test.  Test.'
15 @ 6  c=2 's a test.  Test.'
16 @ 3  c=1 's is a test.  Test.'
17 @ 19 c=3 'st.'
18 @ 12 c=0 'st.  Test.'
19 @ 20 c=2 't.'
20 @ 13 c=1 't.  Test.'
21 @ 10 c=0 'test.  Test.'

This is the result for 'This is a test. Test.'. '@' is the position in the original string, and 'c=' is the number of characters of common prefix with the string below.

Using this table it is a simple walk to find the longest string from these which matches an arbitrary string.

And of course ... this is exactly the problem I was solving originally a few days ago.

Sorted Array Delta

So I started from scratch and wrote another delta builder based only on this technique. I only implemented the searches of the source buffer but it does do runs. I had to use a custom quick-sort as the java Arrays sort will not let me override the comparator for primitive arrays. I use an index for the first byte to jump to the right location to speed up the first level of the tree.

So how's it go?

Run-time wise it's something like 1.5-10x slower than the best hash based version I have depending on the data - the closest end is with jjmpeg.dll. The sorting stage can dominate the processing time, for jjmpeg.dll it's 5/6ths of the total processing time but for GPL2 it's only 1/2.

But it only needs half the memory of the best of the rest, so it's not all bad.

Rubbish

I continue to be surprised at how well java copes with lots of object allocation and garbage; despite using 10x the memory using a HashMap<Integer,ArrayList<<nteger>> is pretty much on par to the open addressed hash table even when it's working well and much faster compared to the failure cases for the dll file. It's tons faster than reference counting (and threads, java is dreams for threads) and I never found explicit allocation very fast to start with. Not quite as quick as alloca().

* - i'm never done with that.

Monday 23 March 2015

deltas and diffs again

I had a quick look at cleaning up the delta code I wrote yesterday and got sidetracked into making a more optimised version - one where the hash table and other bits are just in-lined into a single class.

Firstly, its about 20% faster so far: as i suspected simple micro-optimisations really count here.

Secondly, it's somewhat better at producing compact deltas. Oops, I guess I had a bug somewhere in the code - the original still produces working output but it doesn't find as many matches as it should.

With the new encoder the (6,1) case creates ~16K of delta compared to ~18K previously. And now using a smaller block size of (4,1) does even better at ~14K5 where previously it only hindered.

I also realised just how simple it was to add the currently-decoded target data into the search space of the encoder and copy space of the decoder. Literally a couple of lines in the decoder and under 10 in the encoder.

This drops the test case of GPL2 to GPL3 text to under 13 000 bytes for DEZ1(4,1) and about 13K5 for DEZ1(6,1).

This also allows it to work as a functional if not particularly terrific compressor. An empty buffer to GPL3 "delta" is about 17K. "gzip -9" (== GZipOutputStream defaults) is 12k with comparable execution times (gzip a bit better and scales better). Not that it is very useful in general but concatenating 10x GPL3 texts and then performing the same operation produces just an ~18K delta vs ~100K+ gzip stream.

It's pretty memory hungry on the best setting though; windows could be used.

I'm also curious as to applying this to line-by-line diffs.

Update: Ahah, not so good with some binary data. It creates pretty good deltas but the performance drops a few orders of magnitude. Any data with lots of '00 00 00' type sequences blow out the hash collision rate and break the performance of the open hashing algorithm. A straight java collections implementation (with arraylists and boxed integer's and all) scales a lot better (and surprisingly runs about the same speed in general) although it uses way more memory.

It can be mitigated by limiting the search for an empty slot or other trade-offs at a pretty hefty cost to the generated delta size. Maybe detecting runs would help some.

lambdas and finals

I just tried using -verbose:gc to get some basic memory statistics and it looks like the previous implementation uses way more temporary memory than it should. It's not allocating any differently so i can only think it's something to do with the foreach/lambda code, and this probably explains the extra runtime.

To confirm i moved a local variable i had accessed as a final to the an object field: yep, it drops ~500K of garbage. Moving the same lambda code to a field variable as well saved another ~500K too, but even then it's still ~1M more than the other implementation.

"hmm".

versioning, deltas and diffs

I've been jumping around a bit with hacking lately and this weekend i've been playing with a versioned database and related stuff. I have a fairly long article written about it but it's not quite ready yet. I came up with a few variations and some of the features included cheap copies, cheap branches, proper renaming and so on. Apart from the fun-factor I do have an end-use in mind but i'm still some way off.

One goal was also efficient storage so I worked on a system which stored changes as reverse deltas. After some searching i came across the vcdiff format and some java code to work with it - but much of each implementation is pretty undecipherable code and to be honest just not that hot and I had a niggling interest to look a bit further into the problem.

So today I basically wrote my own from scratch and surprisingly it creates smaller diffs than jvcdiff given that it is literally an afternoons hacking and a naive approach at best (at least on a single example). Runtime is comparable although in either case they are very quick. I'm still trying to get hard statistics on memory use and gc load but i'm pretty sure it's better too.

I also looked at badiff which works with very large files, but that isn't something I need and it generated very fat diffs, used a ton of memory, and wasn't very fast (putting it lightly, see below).

Anyway the problem turned out to be quite interesting along the way.

Delta Algorithm?

It's more or less based on Bentley-McIlroy's paper `Data Compression Using Long Common Strings' although that paper is so brief I didn't get a great deal from it after a skim read. It's guts I think is more or less based on Rabin-Karp's paper `Efficient randomised pattern-matching algorithms' but that was too dense for weekend reading so after my eyes glazed over I did a quick scan of the wikipedia article and then thought about what problem it was actually solving.

In more human terms I think it basically devolves into this: compare every sub-string in the target array to every sub-string in the source array to identify the longest match. Use it to do shit. The details come in trying to make this practical and efficient and what `shit' you do with `it'.

So the following algorithm is kinda probably the same thing but only based on a loose understanding thereof.

# algorithm to generate a list of editing commands
# which transforms one array into another

hsize = length of sub-string
hstep = bytes to step in input

htable = hash table mapping hash value to location

# build hash table
for each location in source in steps of hstep
  htable.addHash(source, location)
end

# find matches
for each location in target
  hash = generateHash(target, location)

  for each candidate in htable.lookup(hash)
     determine longest actual match
       of source @ candidate against target @ location
  end

  if there was a match and longest > threshold
     if any data not matched
       output an append command
     endif
     output copy command
     location += match length
  fi
end
add any left over

As one can see it's actually rather simple and mirrors the one-sentence English text description in a straightforward way. The hash table is only used as a helper to narrow the search space to possible candidates rather than having to check them all.

It requires the full source data but could trivially stream the target data. Restoring the target requires the source plus the list of editing commands.

hsize is typically 4-8 and determines the runtime and compression rate. A smaller number gives more potential candidates to match against (and more hashing clashes) but a larger hash will not match smaller strings.

hstep sets how often the source is sampled. Using a hstep of 1 gives the best result by far but results in a large hashtable and longer running time.

Rolling hashes are used so the hash length has no effect on that calculation but it does affect the runtime in other ways.

There are two interesting problems here - rolling hashes and the hash table itself.

Rolling hash

I tried to implement a Rabin-Karp hash but lost patience trying to get my head around the modulo maths. I think i just had some indexing bug rather than mathematical ones but in the end I felt more comfortable with a cyclic hash based on rotating bit arithmetic. The mathematics are still messy but even the 6510 cpu has a 'ror' instruction and so the actual code is trivial for a digital computer.

// Copyright (C) 2015 Michael Zucchi
// License is GNU GPL v3 or later
public class CyclicHash {
        int b;
        int hash;
        int fix0;
        static final int[] random = new int[256];

        static {
                Random r = new Random();
                for (int i = 0; i < random.length; i++)
                        random[i] = r.nextInt();
        }

        public CyclicHash(int b) {
                this.b = b;
                this.fix0 = ((b - 1) * 9) & 31;
        }

        public int init(byte[] data, int off) {
                hash = 0;
                for (int i = 0; i < b; i++)
                        hash = rotateLeft(hash, 9) ^ random[data[i + off] & 0xff];
                return hash;
        }

        public int update(byte leave, byte enter) {
                int leaving = rotateLeft(random[leave & 0xff], fix0);
                int entering = random[enter & 0xff];

                hash = rotateLeft(hash ^ leaving, 9) ^ entering;
                return hash;
        }
}

A random table is used to create the hash values for each possible byte. To add a new value to the hash it is first rotated and then xor'd with the random hash for the new byte. To remove an old value one only has to rotate by the length of the block and xor it again. Rotation is supposed to be modulo 32 but the java implementation in Integer is not and doesn't match the javadocs - to make sure I do it first (actually it worked before i did this so i know it works but this simplifies the call). The shift amount of 9 i plucked out of my arse.

Choosing better `random' values has some effect but this works.

Approximate Hash Table

I don't just want a simple hash table I want a hash table which lists all the potential match candidates against a given hash value. In java this could be expressed as:

        HashMap<Integer,List<Integer>> htable;

However this will (should, i imagine) have scalability and gc issues and boxed primitives are quite heavy weight to start with.

There are some good optimised java libraries for primitive hash tables but i don't think they support multiple entries off the same hash key and they tend to be gigantic libraries targeted at different problems. Anyway it's the sort of problem i find interesting (fun, actually) so I came up with my own.

I sat outside in the last of the rays of the sun with knuth and a beer and read up on open addressing hash tables. Typically i'm partial to the chained variety for their conceptual simplicity but from looking at some libraries like trove it seems open addressing has benefits in some cases and particularly for java. Armed with the knowledge and the setting sun I came up with a space efficient 'approximate' hash table which suits the problem at hand.

// Copyright (C) 2015 Michael Zucchi
// License is GNU GPL v3 or later
class MatchMap {

        int[] values;
        int mask;
        int logN;

        static final int EMPTY = Integer.MAX_VALUE;

        public MatchMap(int N) {
                int size;

                logN = 31 - Integer.numberOfLeadingZeros(N);
                size = 1 << (logN + 2);
                mask = size - 1;
                values = new int[size];

                Arrays.fill(values, EMPTY);
        }

        public void add(int hash, int value) {
                int i = hash & mask;
                int c = ((hash >> logN) | 1) & mask;

                while (values[i] != EMPTY)
                        i = (i + c) & mask;

                values[i] = value;
        }

        public void foreach(int hash, IntConsumer func) {
                int i = hash & mask;
                int c = ((hash >> logN) | 1) & mask;
                int v;

                while ((v = values[i]) != EMPTY) {
                        func.accept(v);
                        i = (i + c) & mask;
                }
        }
}

It uses open addressing with double hashing and doesn't bother with store the hash key, only the value. Knowing the size is important - but also known in this application. Deletes are a bit messy but fortunately not required. The table is sized so that the load factor is well less than 0.5.

Adding a new entry just starts probing at the modulo-size location and searches for an empty slot, stepping by the second hash (higher bits in the function or'd with 1 to make it odd). Knuth suggests using an odd number for the step which makes a lot of sense; it will eventually visit every location. This is an effective technique against clustering that occurs if linear probing is performed (I also tried that).

Retrieval just does the same thing. Any values are considered potential matches even if they were stored by a completely different hash key. Given that each location needs to be compared directly with the data arrays it doesn't make mistakes very costly and it even seems to improve the delta output in some cases.

One might note here ... this in total ... is probably less code than just the extra scaffolding required to use a java collections hashtable+list for the same purpose and probably 4-10x more space efficient. open-vcdiff has ~1KLOC for this purpose although that's about half comments explaining it's complexity - i don't even care to understand it enough to know if any of that is actually justified.

These two functions are 80% of the smarts and about 30% of the code. The rest is just to wrap it in some 'efficient' byte-streaming encoding, settle on a magic number, and write a decoder. vcdiff encoding would be nice but the encoding is a little more work than I have time left this weekend.

A result

I created a delta from the text versions of GPL 2.0 to GPL 3.0 using this and compared to jvcdiff and badiff. As I said i don't think jvcdiff is the best implementation and it appears to just be a port of open-vcdiff, but it's what I have handy. For jvcdiff i'm forcing a single window but otherwise using the defaults.

              badiff      jvcdiff    DEZ1
                                     (6,1)       (6,2)       (6,4)       (8,1)       (8,8)       (16,16)
source size   18092
target size   35147
patch size    47833       28829      18064       19053       20421       19128       23159       28877

encode        0.174870    0.002000   0.001550    0.001200    0.000950    0.001250    0.000720    0.001000
decode        0.000700    0.000170   0.000150    0.000150    0.000150    0.000120    0.000090    0.000060

hardware is kaveri cpu, jvm is oracle java 8 64-bit.

FWIW gzip is about 16K for both concatenated together, but that isn't really the same thing.

The value in brackets is (hsize,hstep). A larger hstep will reduce the size of the hash table generated and thus affect memory use, possibly significantly. hsize affects the minimum length of a match as well as hashing conflicts at the low end.

Times are in seconds and approx/rounded from the last few of 200 runs - jvcdiff has much more complex code so it takes nearly this many before it is fully optimised by the jvm, before that it is approximately 0.01s for encode. Because it's fast it might need even more loops for the jvm to notice. Due to randomisation of the hash in these test runs the DEZ1 encoding can vary between runs.

Dunno what's up with badiff, maybe this data isn't suitable for it and/or it's tuned to larger datasets.

As I said I don't have any hard data on the memory footprint but for DEZ1 the (x,1) worst-case it uses a 64K element int array for the hash table, plus the memory for the input and output buffers and half a dozen small utility objects. Decoding needs no extra memory apart from input, patch, and output buffer; and the output size is encoded in the delta so it only needs one allocation of non-garbage.

After writing most of this I tried the (16,16) option and considering the patch size is close to jvcdiff perhaps that matches the parameters it uses. Actually I kinda knew about this but forgot about it. Even then it is surprising that the delta size is about the same as jvcdiff uses much more advanced encoding techniques from VCDIFF that should reduce the size.

Format

There's probably not much point going into a lot of detail in the file format but lets just say i chose the most compact and simplest format I could.

Integers are encoded using a big-endian stream of 7-bits data + 1 bit continue indicator, much the same as vcdiff and dozens of other binary encodings.

Because I only have two opcodes as a further optimisation I encode the opcode as a single bit along with the length. This allows lengths of up to 63 bytes (hmm, 64 if i add one ...) to be encoded together with the opcode in a single byte.

I just encode the copy source address as an integer every time which means they will tend to take 3-4 bytes in typical files. This is where a lot of the smarts in vcdiff comes from but it also requires more faffing about and runtime overheads.

Improvements

So a relatively cheap and possibly significant improvement is to include the the incremental output as part of the search space: i.e. the part of the output which will have been decoded up until that time. As long as the hash table is big enough that should be relatively simple in the encoder and isn't much more work in the decoder.

I don't see the need for run-length encoding as it rarely comes up in real text (and yeah, real men use tabs).

Since the code is so simple it is open to gains from micro-optimisations. Being simple does help the compiler a lot to start with though and hotspot compiles faster code with less lead-time than the more complex implementations.

The hash table is a critical part of the performance and could be tuned further, not that there is much code to tune.

Currently any match of 4+ is encoded as a copy even if an add might be smaller. I don't know if 4 is even the best value. A look-ahead of a few characters may also yield longer matches and more compact encodings. Other address compression techniques as used in vcdiff could be used.

Scalability is unknown. My application is 'post-sized' text so it should be fine there. Increasing the block size (hsize) tends to speed up the encoder as it reduces the search space, but it misses small matches so the delta size increases. A two-tiered approach might work: a longer hash for the whole file and a smaller hash for a local region.

Having said all that, 'simplicity' and 'good enough' also have their own considerable charm and appeal so I might just clean it up and do some better testing and tuning and leave it at that.

Closing Thoughts

Well ... I guess that was a fun afternoon, ... which seems to have gone well beyond midnight, oops - these long articles take hours to write. Given I went in blind i'm pretty surprised that the results are at least comparable to existing code with such modest effort.

I'm also surprised - or maybe not surprised if i let my ego talk - at how simple the algorithms actually are compared to the complexity of any implementation i've seen. And it's a little difficult to see this case as a matter of being a subject expert thus making it appear simpler than it really is. The Bentley/McIlroy paper mentions a working implementation in ~150LOC of C which was hard to believe looking at these libraries; but now i do.

And also ... back to work (later) today. It's been an indescribable 3 month break, but not as good as that sounds. I guess I lost 5 kilos in the last 6 weeks.

Wednesday 18 March 2015

from link to dlsym call

I've been spending an inordinate amount of time the last few days playing with make and doing some work on zcl. Been staying up way too late and then doing it all again the next day. Been a bit of fun poking at little problems though.

The java makefile system i've been working on has got a little fiddly due to some features i'm trying to implement; but so far all problems have been solvable in a fairly straightforward way and it's mostly now taking time because i'm writing some fairly long article(s) about the work. Most of the complication is with generated source and resource files and dealing with the command line to jar which is like but not quite the same as tar.

So I took a break to play with zcl again. Because the sdk version of libOpenCL may not match the system version i've added an option to compile with dynamic linking - i think the linker can also do this automagically but it's something i'd need to learn about first. But just changing it over turned out to be pretty easy an infact most of the work is hidden in a single auto-generated file together with a few small bits of checking to throw exceptions when functions are not available.

I wrote a horrible bit of perl that parses the very strict format of cl.h and creates a list of static function pointers and an init function to set them up, as well as C macros which automagically change the source to go via the function pointers rather than via dynamic linking.

So for example it creates the following bits for a given function:

...
static cl_int (*pclBuildProgram)(cl_program, cl_uint, const cl_device_id *, const char *, void (CL_CALLBACK *)(cl_program, void *), void *); 
...
static void init_clfunctions(void) {
        ...
        pclBuildProgram = cl_symbol("clBuildProgram");
        ...
}
...
#define clBuildProgram (*pclBuildProgram)
...

And I just include this at the head of the file and call init during the JNI OnLoad. Saved most of the work since none of the rest of the code needs to change unless I want to add pointer null checks. This opens the possibility of a "sdk-free" build of zcl since without the need to link it can get everything it needs from the header files which I can bundle.

I also looked into a mechanism for implementing CL extensions - I think i've worked out a decent solution (~interfaces on the platform) but haven't (fully) implemented any yet. I'm having second thoughts on whether the array interfaces for asynchronous calls for buffers are really adding enough to counter the work required to implement them and the bloat of the api size, but for now they will remain.

It's not really so "simple binding for java" anymore at ~10KLOC, but then again most of the parts are simple on their own.

After I remembered how to get going in OpenCL I did a little bit of testing on some of the new interfaces in opencl 2.0 - so far they seem to function; although for some reason my catalyst driver is only reporting the CPU device so it's not the best test - i think i need to reboot the machine because of memory fragmentation. Netbeans has turned into (more of) a pig of late and a few hundred tabs in firefox tends to swallow (all) memory. I guess I made a mistake on the memory size in this machine.

Monday 16 March 2015

zcl 0.4

Another thing over the weekend I did was update zcl to cover OpenCL 2.0 and drop out a 0.4 release. It's completely untested at this point.

See the homepage

Once I do some testing and perhaps revisit the libCL linking mechanism I guess I'll do a 1.0, but given 0.3 was a year ago, ... yeah - don't hold your breath :)

metaprogramming make

I spent the last few days mucking about with writing some in-depth cookbook articles on using make to build java projects but only just came across this cool series of articles about metaprogramming make today. I haven't read it all yet but i wish i'd known about using $(info ) earlier.

As a quick preview the following is a roughly minimal example of the sort of GNU Makefile i've got working:

JAVA_PROGRAMS = thing

thing_JAVA_DIRS=src
thing_JAVAC_LIBS=lib/library-0.jar     # compile-time
thing_JAR_LIBS=lib/library-0.jar       # run-time & added to manifest
thing_MANIFEST=thing-manifest          # sets entry point or other

DIST_VERSION=-0.0
DIST_NAME=thing
DIST_EXTRA=Makefile java.make COPYING

include java.make

This provides jar, sources, javadoc, package (binary with bundled libs, source + doc jars), and dist targets automatically. All build stages go into 'build' and all final outputs go into 'bin'.

java.make is only about 100 lines of make code, mostly a template for all the targets.

I've also started on a junit.make which does some basic junit stuff but I don't really know much about using that and running it outside of some 'mega-tool' doesn't seem very common.

Friday 13 March 2015

ahh shit, google code is being scrapped

Joy. I guess it's not making google enough money.

I'll make a copy of all my projects and then delete them. Probably sooner rather than later (probably immediately because i'd rather get it out of the way, i have a script going now). I'm keeping all the commit history for now although i never find it particularly useful. Links in old posts may break.

Some of them may or may not appear later somewhere else, or they may just sit on my hdd until i forget about them and delete them by accident or otherwise.

Wherever they end end up though it wont be in one of the other commercial services because i'm not interested in doing this again.

I need to do the same to this blog, ... but that can wait.

Update: So ... it seems someone did notice. Don't worry i've got a full backup of everything. In part I was pissed off with google and in part I wanted to make it immediately obvious it was going away to see if anyone actually noticed. Who could know from the years of feedback i've never received.

If you are interested in particular projects then add a comment about them anywhere on this blog. All comments are moderated so they wont appear till i let them through but there is no need to post more than once (i'll delete any nonsense or spam unless it makes you look like a bit of a wanker). I will see about enabling anonymous comments for the moment if they are not already on. I can then decide what to do depending on the interest.

I will not be using github or sourceforge nor the like.

In a follow-up post i'll see where i'm going with them as well as pondering the future location and shape of the blog itself. I've already written some stuff to take the blogger atom stuff, strip out the posts, download images, and fix the urls.

(in resp to Peter below) I knew the writing was on the wall when google code removed binary hosting: it's kinda useless for it's purpose without it. This is why all my new projects subsequent to that date are hosted differently.

Thursday 12 March 2015

glNext/Vulkan/SPIR/stuff

Decided it was time i had a look into 'the state of things' wrt these new apis. I've not been keeping up with OpenCL of late either so I guess I should refresh on that (oh that's right, c++ kernels, well fuck that shit).

Vulkan definitely looks interesting. I mean vulkan looks like a huge amount of boilerplate from what little is available so far (the GDC 2015 slides from khronos) but it all makes plenty of sense. Pity it's still work in progress so all I can do is read about snippets. One hopes JavaFX will use it as a backend in the future though, and maybe i'll do a ZVK to compliment ZCL myself (it's good way to learn about an api).

There's a bizarre thread on the khronos forums where one poster wants to scare away beginner programmers from Vulkan least they get confused by such wild concepts as memory allocation and asynchronous synchronisation. That's ... well that's just wrong. All that information hiding in OpenGL is what makes it such a pain to understand beyond trivial cases, particularly since fixed pipeline went away. If people are using C (as they must to use it) then malloc is hardly an insurmountable barrier. And asynchronous sync is the only technique which makes multi-threading code manageable and really should be the main technique for MT code taught in any software engineering course. To be honest I think it's all that learning about critical sections and mutexes which do all the damage and make multi-threading seem so complicated and they should be something unlearnt as quickly as possible (as a general MT sync technique that is, obviously they are needed if only to implement more useful primitives like queues and barriers).

Anyway we know how it will play out for beginners and learning professionals alike: we'll all just use google to find the first example of the boilerplate that compiles and then keep using that for years. If I was khronos I'd make sure it's their up to date examples and documentation that using google finds and not some shitty agglomeration site full of out-dated shit.

SPIR-V

SPIR looks a little "odd" to me at first glance, but i'm not a compiler writer so it probably should. To my mind i would describe it (a bit) like assembly language that has been pre-processed and annotated ready for an optimising compiler and then it's internal object representation flattened into a flat global space referenced by index. Rather than registers or addresses results are referenced by the instruction that produces them. It should make a frontend compiler fairly simple as the compiler doesn't need to deal with registers or stacks, and it makes a backend a bit simpler as some of the work is already done. And apparently it's not a file format, although it sure looks like one and i can't imagine it would make sense to anyone to create a different one to store the same information.

I wonder what this mean for HSA (or infact, any other compiler system using it's own virtual processor)? I've not looked into it so maybe something is already announced but if not i suspect SPIR is in it's future too. BRIG takes quite a different approach in that a specific register-based virtual machine is targeted and it's up to the frontend compiler to perform register optimisation and the backend isn't (supposed to be) much more than a machine code translator (which is still strictly speaking, a compiler). I just can't see hardware vendors keeping both maintained.

Oh my gods, it's full of compilers ...

One realisation I had when playing with the soft-GPU code on the parallella was that modern GPU drivers are mostly just compilers with a bit of memory and queue management bolted on the side. And similarly that modern GPUs are really just moderately wide vector CPUs with a little bit logic for fragment generation and texture lookup. of HSA itself is more about compiler writers than the end-user too.

It's just a bit of a personal bummer because although compilers are a bit interesting they're not enough interesting to me to spend the years required to get up to speed.

Tuesday 10 March 2015

dun dit dun

Following on from the last post I did some i/o sutff - went with an s-expression based thing since it's a trivial parser and easy to use - then I rewrote the whole model and some of the ui again. Added zoom for instance. Then I played with it for a while but couldn't manage to create the sort of sounds I was trying to create. I couldn't find modules that did what I wanted to do and trying to construct them from primitives got messy and didn't work either.

I was starting to lose interest.

So then I wrote a whole new synthesiser from scratch followed by enough of a midi sequencer to get a midi of Lazy Jones working - and I think it sounded better than the bundled Java midi synthesiser (more hard/harsh just like the original). It uses a thread to render a frame of samples at a time based on a master clock and uses a simple 'runLater()' style mechanism to synchronise parameter updates on the various components so threading issues are just hidden.

I tried a few other files and none worked either at all or very well but after another day of hacking I've got fully polyphonic voices and adsr envelopes and so on and so forth. Most midi files apart from Lazy Jones still sound kinda shit because I don't have any real instruments or the midi processing units (and well, it's midi), but they do play with the right timing and tone (I think, hard to know since midi files usually sound a bit out to me). For efficiency I created a multi channel oscillator and envelope bank which works pretty well but limits the instruments to simple waveform + ADSR envelope.

Most of it was pretty straightforward considering i just did it off the top of my head but I found I had to do some filtering on the amplitude changes to avoid pops when using a sooth waveform. One problem I have is that square and sawtooth are just so much louder than triangle; some filtering seems to help but I guess that's just the way it is and it's up to the musician to compensate.

Probably wont get much farther than that but it was a fun few days. It can be made an interesting real-time problem and parts are paralellisable so if I was keen i'd probably try some epiphany code but i don't think i'm that keen and it's fairly light cpu wise anyway.

I guess I forgot about the space invaders sfx. If the latency is acceptable i might try just using my synthesiser as a real-time sound mechanism, maybe maybe.

Friday 6 March 2015

Interactive synthesiser

Rather than 'play some more' i went head-first into writing a better display/model/control thing for the synthesiser stuff.

Doesn't look much different on the surface since the stylesheet is much the same but pretty much everything is backed by controls and skins and observable properties.

I have a 'Workbench' control which is the main controller and it handles the human wiring process and so on. I abstracted the connection points as 'Sockets' in a way which allows me to add other non-JSyn objects such as the dials which just feed out their value to some input port. Not shown but you can also reconfigure input ports to change their name and bounds so that they connect properly to dials.

I added the ability to unplug and move the ends of a given wire which works quite nicely. If you grab a wire near one end and 'pull' it far enough it detaches and becomes a free end, which can either be attached to some other compatible socket or letting go just deletes that wire.

I/O next I guess? Always the pain ...

FWIW The circuit above makes a pretty basic "idling chopper" "fop-fop" sound.

Thursday 5 March 2015

Connecting the dots (with wires)

Yesterday I hacked up a "workbench" gui for JSyn. I just started experimenting with auto-generating the gui based on the UnitGenerator interfaces, it's designed with this in mind so it is fairly simple.

Blocks are added by clicking on the workspace, wires are connected by dragging a line between the discs, blocks can be arbitrarily dragged around, wires or blocks are deleted with delete or backspace keys; typical basic graphical user interface stuff.

One of the more difficult problems was how to wire up the ports ...

The difficulty arises because one uses container classes to form grouped gui components and do the layout, and then you want to be able to bind a linkage between an inner component of the group and components working in a different local coordinate system.

A diagram of the layout hierarchy as it currently exists helps to explain (this is just a prototype so the layout is a bit shit).

     Ellipse    Label
        +----+----+
            HBox
             +
          EndPoint (Pane)
       - ----+---- -
            VBox
       - ----+---- -
         Generator (BorderPane)    Wire (Line)
       - ----+-----------------------+
           Pane

So the Line representing the wire needs to connect to the Ellipse level, but the Line is in the coordinate space of the lower Pane and the Ellipse is in the coordinate space of the HBox which is in the coordinate space of the EndPoint, which is ...

So I created a new set of properties socketX, socketY, which reside on each EndPoint which maintain the location of the centre of the Ellipse relative to the lower Pane. The calculation is straightforward and based on transforming to and from scene-relative coordinates. It gets updated whenever the EndPoint is moved (relative to the Generator, i.e. after layout), or when the Generator is moved (i.e. relative to the Pane). The Wire can then just bind to these locations using javafx properties. I've got each generator maintaining the endpoint updates relative to it's parent but i'll probably have to come up with something better.

There's still a problem to solve though; the user interface allows you to plug a wire in to one socket and then drag it to another socket (i.e. click and drag). Unfortunately javafx currently has no way of picking a widget - i.e. looking it up by location or by the mechanism the event system uses. At first I tried using drag and drop mechanisms but it interfered with other gui operations and I couldn't find a way to show the wire following the mouse pointer.

So unfortunately I have to maintain a separate list of all the EndPoints in a model which I then scan for a match during the drag gesture. But given that this allows me to use layouts and stylesheets and so on, this is acceptable.

Duty Cycle

I created a new square wave oscillator type for JSyn as i was surprised the one it comes with doesn't have a duty cycle (pulse with) parameter. It was one of the more useful features of the square wave oscillator on the SID chip and features heavily in C=64 music. I think it could be created using a sawtooth generator and a gate but this is easier to use and was simple to make.

The circuit above generates a rather grating undulating / phasing sound effect like a 'robot thinking', or with a lower frequency a (metallic) misfiring engine.

I guess i'll add more of the unitgenerators from JSyn and do some experimentation with the code I have. If it's interesting enough i will probably dig deeper into making it a real application; it will need a better system model and probably a way to create a control panel separate from the inner workings of a circuit, as well as oscilloscope and spectrum analyser blocks (JSyn has most of this but it's all Swing).

Tuesday 3 March 2015

bang!

Been getting pretty bored of being stuck in the house. Enough so that I coded up a basic space invaders game for something to do. At this point almost everything is there apart from the mother ship going past and some other game state things or animations. I've tried to keep the ship movement and wave structure correct (something clones usually try to 'improve') based on the C=64 version.

I'm using javafx for the graphics (i.e. most sprites are just a Rectangle - the actual shape isn't terribly important for the mechanics, and saves copyright issues) but the barriers are WritableImage's and i do the hit detection and bomb erosion on separate image arrays that back them. Because of the way the hit detection works it would probably be easier just to do everything on a bitmap manually but I didn't feel like starting from scratch after starting with a scenegraph version.

Yesterday I added some sound effects. The invader explosion sound is a flat beep but the rest are more or less correct including the 4-note 'tune' which speeds up as the invader movement does. It doesn't sound right when there's only one left but otherwise it's close.

This lead me to play a bit with the java sound api ... which is somewhat better than I thought it was; e.g. it provides automatic mixing and so on. It also meant I had to do some sound synthesis for the tune and effects. So far i'm generating sample Clip objects from some synthesis code I wrote myself but today I looked into it deeper and found some existing synthesiser libraries that should let me generate some better sounds. But i'll need to play with them a fair bit before getting anywhere. Actually unless I find something i'll probably have to write some tools for it anyway which starts to make it a fairly significant effort.

Last time i played with synthesisers was with the SID around 1990 and I never got very far with that, so i'm starting from a point of somewhat ignorance here.