xiphmont

Before we get into the update itself, yes, the level of magenta in that banner image got away from me just a bit. Then it was just begging for inappropriate abuse of a font...

Ahem.

Hey everyone! I just posted a Daala update that mostly has to do with still image performance improvements (yes, still image in a video codec. Go read it to find out why!). The update includes metric plots showing our improvement on objective metrics over the past year and relative to other codecs. Since objective metrics are only of limited use, there's also side-by-side interactive image comparisons against jpeg, vp8, vp9, x264 and x265.

The update text (and demo code) was originally for a July update, as still image work was mostly in the beginning of the year. That update get held up and hadn't been released officially, though it had been discovered by and discussed at forums like doom9. I regenerated the metrics and image runs to use latest versions of all the codecs involved (only Daala and x265 improved) for this official better-late-than-never progress report!

Flat | Top-Level Comments Only

From: (Anonymous)

Hi Monty,

I enjoy reading the updates on Daala, not so much because I would be urgently waiting for the next video codec, but because of the fresh ideas on technology that are interesting to read.

While preceding updates contained a lot of "out of the box"-thinking, I was surprised on how "conventional" your description of the challenge to encode I-frames came along.

Did you ever consider ditching the whole "I-/B-/P-frame" methodology? If not, please do: After all, all frames are displayed for the same period of time. There's no obvious sense in spending a lot of bits on few of them, and only few bits on most others. When you say that there have to be reference frames for seeking and such: That is not true. It would be just as possible to encode a Group Of Pictures as a whole, where the decompression would yield all frames of the GOP at once. If experience with using Vector Quantization has shown us one thing, that is the efficiency of encoding "as many correlated pieces of information as possible" - and the frames inside a GOP certainly are highly correlated. So why not consider the time (inside a GOP) just one dimension of vectors that also contain spatial and color information in other dimensions, and encode them alltogether, using the available bits on all frames of the GOP equally, not handling some "I-frame" specially?

I understand that encoding/decoding whole GOPs does require some memory and would mean a lower limit on latency for streams depending on the number of frames inside a GOP - but hey - that's not so much different as with h.264 etc., where there are already multiple frames depending on each other such that you have to decode "future" frames before you can display the "current"...

I hope I didn't just miss any preceding discussion on the ideas above, if so, I apologize and would be greatful for some link to it.

Thanks for listening and keep up the good work!

From:

xiphmont.livejournal.com

Our early Tarkin codec tried this strategy-- it encoded entire blocks of frames, the equivalent of a GOP, en-masse. The problem is that one must quantize to get effective compression in video, and transforms with decent compaction are acausal. As a result, motion artifacts show up before the motion begins, and the slightest hint of pre-motion artifacts stands out like a sore thumb. When we ratcheted the precision high enough to avoid the problem, there was no longer any benefit to encoding the entire block at once (but a number of remaining disadvantages).

Motion compensation via 3D transform is likely kind of doomed. Frames are relatively speaking 'far apart' temporally and the motion changes between them aren't very smooth. Not much useful redundancy just falls out as a result of handling the block of frames all at once, and then you have to buffer multiple hundreds of megabytes of frame data, gigabytes for HD. Someday that much memory will be free, but by then, we'll likely be up to super-mega-128k+-UHD video and we'll need orders of magnitudes more.

Edited Date: 2014-12-25 09:13 pm (UTC)

Ok, understood, thanks for the explanation.

But even if it's not possible to avoid keeping frames kind of seperate with regards to their temporal placement - wouldn't it still be possible and even beneficial to overcome the "one (I-)frame is the golden reference taking much more bits to encode than every other in a GOP"-paradigma?

I would assume that even if you choose which frame to encode as an I-frame cleverly, chances are that this I-frame will contain parts (e.g. out-of-focus or motion-blurred areas) that could have been derived from another frame in the GOP (where the same objects are more in focus or less motion blurred) better while spending bits more efficiently there.

I could envision that all frames of a GOP are first scanned for regions that are (a) rich of detail information and (b) have less detail-rich counterparts in other frames of a GOP, and then any frame of a GOP could be declared "the reference frame for a certain region", which the other frames only encode differences to.

BTW: Has "blurring" of a region, in general, ever been considered to be a useful transformation that helps encoding part of a frame from another frame that holds a "sharper" version of the same region? I would expect that one could often find both "blurred" and "sharp" versions of the same objects within a sequence of frames, do to motion of that object starting or stopping.

While I am in brainstorming mode, one more completely different, wild idea: You've certainly heard of seam-carving and the C/C++ library "Liquid Rescale" that implements it. I wonder whether anybody has ever considered using seam carving for compression purposes. I am not quite sure this would work, but theoretically, one could try to re-target an image to a smaller size during compression (finally compressing the residual smaller image) and do the reverse during decompression. That, of course, would lose information, and maybe it's of no practical use. But unless falsified, one could speculate that a an image retargeted to really small dimensions might be usable as an interesting "prediction" start point for reconstructing the full-size image, because seam carving has the tendency to get rid of image areas that aren't so important to human viewers, anyway.

Hope you don't get bored reading my ideas, but I had to spill them somewhere :-)

Another question: Have you considered the possibility that the key-frames be applied only to a part of the image (or even to a couple of blocks)? Consider e.g. moving scene where a new object appears in the part of the image...

Yes I, know, there would be signalling costs but maybe there could be some way to make this efficiently...

A Fabulous Daala Holiday Update

A Fabulous Daala Holiday Update

What about thinking outside the box of I-/B-/P-Frames?

Re: What about thinking outside the box of I-/B-/P-Frames?

Re: What about thinking outside the box of I-/B-/P-Frames?

Re: What about thinking outside the box of I-/B-/P-Frames?

Profile