MPEG-4 Overview - Printer Friendly version

1. Part 1

MPEG-4: The Interactive Revolution - Page 1

MPEG-4 is an ISO standard (ISO/IEC international standard 14496) developed by the Moving Picture Experts Group (MPEG). It defines the deployment of non-proprietary multimedia content independently of platform or transmission medium. It has relied on and taken from a number of existing technologies while at the same time adding a number of innovative tools and concepts.

It is also more than just another highly efficient standard for delivering video and audio sequences. It integrates a number of key technologies into a solid and uniform platform by avoiding the use of proprietary and non inter-working formats, while promoting and delivering user interactivity to the wider consumer market, beyond the desktop computer.

Traditional radio and television broadcasting has up until now been a process of preparing the entire presentation at the point of origin (i.e. radio or TV studio), where the individual components, such as video footage, titles, background music, voice over, etc, are mixed together in the studio, modulated onto a carrier and transmitted over the airwaves.

An internet web page presentation on the other hand, does exactly the opposite. All the mixing and rendering is performed at the receiver i.e. the user’s computer. A server sends the makeup of the presentation in textual form as a series of commands using a language known as HyperText Markup Language or HTML. The receiving programme, known as a web browser, interprets the commands and then negotiates the download of each component used in the presentation, which it then renders on the user’s screen.

There are pros and cons with each system. Perhaps the biggest difference between them is that broadcast TV is a one-way communications system, where the viewer is passive, whereas with the internet, the user is an active participant. However, broadcast TV offers much better quality and performance for real time transmission of video content.

The above comparison provides an insight into part of what the MPEG-4 standard attempts to integrate. In the executive overview of the Overview of the MPEG-4 Standard from the International Organization for Standardization (ISO), the following extract summarizes very succinctly what the aim of the standard is:

MPEG-4 builds on the proven success of three fields:

Digital television
Interactive graphics applications (synthetic content)
Interactive multimedia (World Wide Web, distribution of and access to content)

MPEG-4 provides the standardized technological elements enabling the integration of the production, distribution and content access paradigms of the three fields.

Probably the major advancement that MPEG-4 introduces is a new level of interactivity, where the user (not just a viewer anymore) takes part in the presentation. This interactivity is even more pronounced than it is with the internet.

2. Part 2

MPEG-4: The Interactive Revolution - Page 2

The standard also caters for a wider range of devices, operating over varying communications channels, from broadcast TV and broadband networks down to low bit-rate networks such as dial-up connections and wireless networks (mobile telephone). As with the internet, content compilation and composition is performed at the receiving equipment. Unlike the internet, an MPEG-4 enabled receiving device is capable of intelligently rendering the presentation depending on its capabilities or limitations, scaling down content when necessary or even ignoring it altogether. This means that content needs only to be designed once, leaving it up to the receiving device to decide what and how to render.

The aim eventually, is that the TV set of the future will be able to handle interactive multimedia content along with high quality broadcast content. Similarly, the internet will provide broadcast quality performance and presentations within its already multimedia rich layouts.

As was stated earlier, the internet uses HTML to describe scene content. HTML on its own however, only describes the placement of the content and not its behaviour. Usually, designers need to resort to scripting languages in order to achieve dynamic, interactive presentations. MPEG-4 provides a much more versatile language known as Binary Format for Scenes or BIFS, which is more closely related to the Virtual Reality Modelling Language (VRML), and is widely used on the internet for 3-D modelling and manipulation of 3-D objects.

BIFS not only offers a way to describe the scene’s contents and their placement, but also how objects behave in response to user events. It can also be used to animate objects and change their characteristics dynamically. One other feature of BIFS, and in keeping with the overall philosophy of the MPEG-4 standard to make the most efficient use of the available bandwidth, is that it is a binary format and not text as in the case of HTML and VRML, making it much more compact.

At the heart of any presentation is its content and MPEG-4 is very precise about the definition of its content. A single element used in a presentation is referred to as an object. Objects can be combined to produce compound objects, and a collection of objects and/or compound objects make up a scene.

An object can be natural or synthetic (i.e. computer generated), and includes still images, audio, video, text, 2D and 3D meshes, and synthetic face and body objects. Each object is independent of all other objects and can exist in 2 or 3 dimensional space, including sound. As has been said, objects can be combined to form new, compound objects such as a human figure with its associated voice. The designer or author creates a scene by combining as many objects and compound objects as are required.

An obvious advantage in using objects is reusability, since an object is defined once but can be used as often as required in a scene, with each instance having different characteristics such as size or colour. Perhaps the major advancement however, is the ability for the author to allow the end-user to interact with the scene’s objects – move them to a different location, change an object’s characteristics, change viewpoint. For example, in a hypothetical advertisement, where the latest model four-wheel-drive is cruising down a serene country road, the user can change the colour of the car, swap in the 2-door or 4-door model in the scene, add or remove roof-racks, or even change the background scenery from an autumn to a summer’s day.

3. Part 3

MPEG-4: The Interactive Revolution - Page 3

Special consideration has also been given to synthetic or computer generated media for both graphics and audio. As an example, text to speech takes plain text as input, with parameters that characterise the type of voice to be used (age, gender, speech rate), and generates intelligible synthetic speech. A special language called Structured Audio Orchestra Language (SAOL) allows for the definition of instruments that can emulate characteristic sounds of natural acoustic instruments. These instruments can be combined to produce an orchestra, while a musical composition or score is created by sending a time-sequenced set of commands through a language known as Structured Audio Score Language (SASL). Added functionality, which can also be applied to other audio objects, includes speed change without a change in pitch or a change in pitch without altering the time scale.

In the realm of synthetic graphics, things are just as interesting. Among others, two objects, the facial animation object and body animation object are defined, both working in pretty much the same way. Facial Definition Parameters (FDP) and Facial Animation Parameters (FAP) control facial characteristics while the corresponding Body Definition Parameters (BDP) and Body Animation Parameters (BAP) control the virtual body model. It is thus possible to describe and render a face with almost any characteristics, which can change dynamically and whose behaviour can also be defined. Lip movement can be synchronised with text to speech (creating a compound object), to create a talking head.

Artificial movement can also be applied to real objects. For example, a static two-dimensional image of a flag can be made to wave and flutter in the wind by mapping a 2-D mesh onto the image. By applying mathematical transformations to the mesh, and hence because the flag and the 2D mesh are tied together, any alterations on the mesh’s make up will also create the corresponding deformities in the underlying object, giving the illusion of movement.

Another feature is alpha coding, whereby for any given graphics based object, the level of transparency of the rendered pixels is determined by the corresponding pixel value in the alpha channel. That is, the value of a pixel in the alpha channel, determines whether the corresponding pixel in the image is visible or not, and how it blends with any underlying graphical content. This can be used not only to create arbitrarily shaped objects (i.e. not just rectangular), but also to provide smooth blending with any background objects, thus avoiding what is known as aliasing, which appears as jagged edges around any overlayed object.

Objects are represented in a special coded form, so that a minimum amount of information needs to be transmitted to the receiving terminal, and hence rendering of the object can be accomplished locally on the receiving equipment. Local rendering provides several advantages. The local equipment can determine how best to render any object based on its capabilities.

For example, currently on the internet, something as simple as screen resolution poses a major problem for web creators, where they are forced to design for the most common resolution, overflowing outside screen boundaries on smaller resolution screens and leaving wide, blank spaces on larger ones. And if providers wish to cater for wireless devices, then a completely separate design needs to be created, not only due to screen resolutions and capabilities, but also bandwidth constraints. On MPEG-4 enabled devices, this is not a problem with the receiving equipment deciding on how best to render an object.

4. Part 4

MPEG-4: The Interactive Revolution - Page 4

MPEG-4 boasts of highly efficient compression techniques for both video and audio. MPEG-4’s Advanced Audio Coding codec for example, offers much better quality and smaller file sizes compared to the very popular mp3. The scheme used to transport the content is just as efficient while at the same time being highly versatile. All content is conveyed in elementary streams (ES), and any object may require one or more elementary streams. The standard further defines object descriptors (OD) to keep track of which elementary streams belong to each object.

An object descriptor keeps a list of elementary stream descriptors, which in turn link elementary streams with an object. Elementary stream descriptors can also be used to describe the type of media decoder required to render the object, configuration information used by the synchronisation layer, intellectual property information and quality of service (QoS) requirements.

Synchronisation is also another consideration and is achieved by time stamping the streams. This is especially important in cases where real time operation is necessary. Here, a constant end-to-end delay time must be established and data streams must contain timing information. This timing information establishes a time base, based on the encoder’s clock to which the decoder at the receiving equipment must synchronise. Time stamps are then attached and indicate when the data must be decoded and presented.

Streams requiring similar quality of service connections are then multiplexed together using MPEG-4’s FlexMux tool, before being transported according to the standard’s Delivery Multimedia Integration Framework specification (DMIF). The DMIF provides a transparent delivery interface for the application regardless of whether the source is a remote service or on local storage media.

The Future

The MPEG-4 standard has been in circulation for a few years now, and as yet there hasn’t been a lot of action to speak of. Most of what has gone on, has taken place on the internet. Apple for example, has adopted the standard in its QuickTime platform (it should be stated that MPEG-4 uses the QuickTime file format as the format for MPEG-4 video files), while IBM has released a free toolkit for generating MPEG-4 content that is Java based. IBM claims that because of this, the toolkit will run on any platform supporting Java.

If the entertainment industry does not get actively involved, it will most probably take an MP3 style revolution on the internet, to get things moving. While MPEG-4 is designed to work with almost any bandwidth, in reality it will be the spread of broadband internet connections that will allow MPEG-4 content to develop. Another very promising platform is the cell phone where data speeds are on the increase, and where both manufacturers and service providers are looking for new ways to attract customers. Of course, there must also be a regular supply of fresh content, beyond the usual array of trailers that are currently used for demos.

Certainly, the entertainment industry’s low profile may be justified, quite possibly fearing a spillover of piracy of copyrighted material from mp3 into visual content. If MPEG-4 is to make its mark though, it will have to do so soon, otherwise it may end up being dissected, with only bits and pieces being used, such as its mp4 file format for highly compressed video. But this of course defeats the purpose of the standard in the first place, which was to integrate all these various technologies and provide enhanced interactivity to all levels of users.