I think the benchmark can be fairly simple. First you use samples and pixel comparing to confirm that all features of a codec are supported. Secondly you run something like timecodec to measure the FPS the codec achieves on a certain architecture. Thirdly you measure the ability of the codec to deal with errors in the stream by using files that are broken in a specific way. Using the above 3 values you can calculate several other values, including a weighed percentage.