1. Part 1
MPEG-4: The Interactive Revolution - Page 1
MPEG-4
is an ISO standard (ISO/IEC international standard 14496) developed by the
Moving Picture Experts Group (MPEG). It defines the deployment of non-proprietary
multimedia content independently of platform or transmission medium. It has
relied on and taken from a number of existing technologies while at the same
time adding a number of innovative tools and concepts.
It is also more than just another highly efficient standard for delivering
video and audio sequences. It integrates a number of key technologies into
a solid and uniform platform by avoiding the use of proprietary and non inter-working
formats, while promoting and delivering user interactivity to the wider consumer
market, beyond the desktop computer.
Traditional radio and television broadcasting has up until now been a process
of preparing the entire presentation at the point of origin (i.e. radio or
TV studio), where the individual components, such as video footage, titles,
background music, voice over, etc, are mixed together in the studio, modulated
onto a carrier and transmitted over the airwaves.
An internet web page presentation on the other hand, does exactly the opposite.
All the mixing and rendering is performed at the receiver i.e. the user’s
computer. A server sends the makeup of the presentation in textual form as
a series of commands using a language known as HyperText Markup Language or
HTML. The receiving programme, known as a web browser, interprets the commands
and then negotiates the download of each component used in the presentation,
which it then renders on the user’s screen.
There are pros and cons with each system. Perhaps the biggest difference between
them is that broadcast TV is a one-way communications system, where the viewer
is passive, whereas with the internet, the user is an active participant. However,
broadcast TV offers much better quality and performance for real time transmission
of video content.
The above comparison provides an insight into part of what the MPEG-4 standard
attempts to integrate. In the executive overview of the Overview of
the MPEG-4 Standard from the International Organization for Standardization
(ISO), the following extract summarizes very succinctly what the aim of the
standard is:
MPEG-4 builds on the proven success of three fields:
- Digital television
- Interactive graphics applications (synthetic content)
- Interactive multimedia (World Wide Web, distribution of and access
to content)
MPEG-4 provides the standardized technological elements enabling the integration
of the production, distribution and content access paradigms of the three
fields.
Probably the major advancement that MPEG-4 introduces is a new level of interactivity,
where the user (not just a viewer anymore) takes part in the presentation.
This interactivity is even more pronounced than it is with the internet.
2. Part 2
MPEG-4: The Interactive Revolution - Page 2
The standard also caters for a wider range of devices, operating over varying
communications channels, from broadcast TV and broadband networks down to low
bit-rate networks such as dial-up connections and wireless networks (mobile
telephone). As with the internet, content compilation and composition is performed
at the receiving equipment. Unlike the internet, an MPEG-4 enabled receiving
device is capable of intelligently rendering the presentation depending on
its capabilities or limitations, scaling down content when necessary or even
ignoring it altogether. This means that content needs only to be designed once,
leaving it up to the receiving device to decide what and how to render.
The aim eventually, is that the TV set of the future will be able to handle
interactive multimedia content along with high quality broadcast content. Similarly,
the internet will provide broadcast quality performance and presentations within
its already multimedia rich layouts.
As was stated earlier, the internet uses HTML to describe scene content. HTML
on its own however, only describes the placement of the content and not its
behaviour. Usually, designers need to resort to scripting languages in order
to achieve dynamic, interactive presentations. MPEG-4 provides a much more
versatile language known as Binary Format for Scenes or BIFS, which is more
closely related to the Virtual Reality Modelling Language (VRML), and is widely
used on the internet for 3-D modelling and manipulation of 3-D objects.
BIFS not only offers a way to describe the scene’s contents and their
placement, but also how objects behave in response to user events. It can also
be used to animate objects and change their characteristics dynamically. One
other feature of BIFS, and in keeping with the overall philosophy of the MPEG-4
standard to make the most efficient use of the available bandwidth, is that
it is a binary format and not text as in the case of HTML and VRML, making
it much more compact.
At the heart of any presentation is its content and MPEG-4 is very precise
about the definition of its content. A single element used in a presentation
is referred to as an object. Objects can be combined to produce compound objects,
and a collection of objects and/or compound objects make up a scene.
An object can be natural or synthetic (i.e. computer generated), and includes
still images, audio, video, text, 2D and 3D meshes, and synthetic face and
body objects. Each object is independent of all other objects and can exist
in 2 or 3 dimensional space, including sound. As has been said, objects can
be combined to form new, compound objects such as a human figure with its associated
voice. The designer or author creates a scene by combining as many objects
and compound objects as are required.
An obvious advantage in using objects is reusability, since an object is
defined once but can be used as often as required in a scene, with each instance
having different characteristics such as size or colour. Perhaps the major
advancement however, is the ability for the author to allow the end-user
to interact with the scene’s objects – move them to a different
location, change an object’s characteristics, change viewpoint. For
example, in a hypothetical advertisement, where the latest model four-wheel-drive
is cruising down a serene country road, the user can change the colour of
the car, swap in the 2-door or 4-door model in the scene, add or remove roof-racks,
or even change the background scenery from an autumn to a summer’s
day.
3. Part 3
MPEG-4: The Interactive Revolution - Page 3
Special consideration has also been given to synthetic or computer generated
media for both graphics and audio. As an example, text to speech takes plain
text as input, with parameters that characterise the type of voice to be used
(age, gender, speech rate), and generates intelligible synthetic speech. A
special language called Structured Audio Orchestra Language (SAOL) allows for
the definition of instruments that can emulate characteristic sounds of natural
acoustic instruments. These instruments can be combined to produce an orchestra,
while a musical composition or score is created by sending a time-sequenced
set of commands through a language known as Structured Audio Score Language
(SASL). Added functionality, which can also be applied to other audio objects,
includes speed change without a change in pitch or a change in pitch without
altering the time scale.
In the realm of synthetic graphics, things are just as interesting. Among
others, two objects, the facial animation object and body animation object
are defined, both working in pretty much the same way. Facial Definition Parameters
(FDP) and Facial Animation Parameters (FAP) control facial characteristics
while the corresponding Body Definition Parameters (BDP) and Body Animation
Parameters (BAP) control the virtual body model. It is thus possible to describe
and render a face with almost any characteristics, which can change dynamically
and whose behaviour can also be defined. Lip movement can be synchronised with
text to speech (creating a compound object), to create a talking head.
Artificial movement can also be applied to real objects. For example, a static
two-dimensional image of a flag can be made to wave and flutter in the wind
by mapping a 2-D mesh onto the image. By applying mathematical transformations
to the mesh, and hence because the flag and the 2D mesh are tied together,
any alterations on the mesh’s make up will also create the corresponding
deformities in the underlying object, giving the illusion of movement.
Another feature is alpha coding, whereby for any given graphics based object,
the level of transparency of the rendered pixels is determined by the corresponding
pixel value in the alpha channel. That is, the value of a pixel in the alpha
channel, determines whether the corresponding pixel in the image is visible
or not, and how it blends with any underlying graphical content. This can be
used not only to create arbitrarily shaped objects (i.e. not just rectangular),
but also to provide smooth blending with any background objects, thus avoiding
what is known as aliasing, which appears as jagged edges around any overlayed
object.
Objects are represented in a special coded form, so that a minimum amount
of information needs to be transmitted to the receiving terminal, and hence
rendering of the object can be accomplished locally on the receiving equipment.
Local rendering provides several advantages. The local equipment can determine
how best to render any object based on its capabilities.
For example, currently on the internet, something as simple as screen resolution
poses a major problem for web creators, where they are forced to design for
the most common resolution, overflowing outside screen boundaries on smaller
resolution screens and leaving wide, blank spaces on larger ones. And if providers
wish to cater for wireless devices, then a completely separate design needs
to be created, not only due to screen resolutions and capabilities, but also
bandwidth constraints. On MPEG-4 enabled devices, this is not a problem with
the receiving equipment deciding on how best to render an object.
4. Part 4
MPEG-4: The Interactive Revolution - Page 4
MPEG-4 boasts of highly efficient compression techniques for both video and
audio. MPEG-4’s Advanced Audio Coding codec for example, offers much
better quality and smaller file sizes compared to the very popular mp3. The
scheme used to transport the content is just as efficient while at the same
time being highly versatile. All content is conveyed in elementary streams
(ES), and any object may require one or more elementary streams. The standard
further defines object descriptors (OD) to keep track of which elementary streams
belong to each object.
An object descriptor keeps a list of elementary stream descriptors, which
in turn link elementary streams with an object. Elementary stream descriptors
can also be used to describe the type of media decoder required to render the
object, configuration information used by the synchronisation layer, intellectual
property information and quality of service (QoS) requirements.
Synchronisation is also another consideration and is achieved by time stamping
the streams. This is especially important in cases where real time operation
is necessary. Here, a constant end-to-end delay time must be established and
data streams must contain timing information. This timing information establishes
a time base, based on the encoder’s clock to which the decoder at the
receiving equipment must synchronise. Time stamps are then attached and indicate
when the data must be decoded and presented.
Streams requiring similar quality of service connections are then multiplexed
together using MPEG-4’s FlexMux tool, before being transported according
to the standard’s Delivery Multimedia Integration Framework specification
(DMIF). The DMIF provides a transparent delivery interface for the application
regardless of whether the source is a remote service or on local storage media.
The Future
The MPEG-4 standard has been in circulation for a few years now, and as yet
there hasn’t been a lot of action to speak of. Most of what has gone
on, has taken place on the internet. Apple for example, has adopted the standard
in its QuickTime platform (it should be stated that MPEG-4 uses the QuickTime
file format as the format for MPEG-4 video files), while IBM has released a
free toolkit for generating MPEG-4 content that is Java based. IBM claims that
because of this, the toolkit will run on any platform supporting Java.
If the entertainment industry does not get actively involved, it will most
probably take an MP3 style revolution on the internet, to get things moving.
While MPEG-4 is designed to work with almost any bandwidth, in reality it will
be the spread of broadband internet connections that will allow MPEG-4 content
to develop. Another very promising platform is the cell phone where data speeds
are on the increase, and where both manufacturers and service providers are
looking for new ways to attract customers. Of course, there must also be a
regular supply of fresh content, beyond the usual array of trailers that are
currently used for demos.
Certainly, the entertainment industry’s low profile may be justified,
quite possibly fearing a spillover of piracy of copyrighted material from mp3
into visual content. If MPEG-4 is to make its mark though, it will have to
do so soon, otherwise it may end up being dissected, with only bits and pieces
being used, such as its mp4 file format for highly compressed video. But this
of course defeats the purpose of the standard in the first place, which was
to integrate all these various technologies and provide enhanced interactivity
to all levels of users.