EPUB 3 Media Overlays

February 20, 2014

Adapted with permission from EPUB 3 Best Practices (O'Reilly Media)

This article will concentrate on how to provide synchronized audio narration, one of the key new additions introduced in EPUB 3. Media overlays, as you’re about to see, are the secret behind how this magic works.

When you watch words get highlighted in your reading system as a narrator speaks, the term media overlay probably doesn’t immediately jump to mind to describe the experience, but what you are in fact witnessing is a media type (audio) being layered on top of your text content. Apple was the first to adopt this technology, adding it to their support of EPUB 2 under the name Read Aloud while the EPUB 3 specification was still being finalized, but other reading systems, like Readium, have since appeared offering beta support. Even Amazon has jumped into the synchronization game, if not with the same media overlays technology.

But the ability to synchronize text and audio in ebooks isn’t a new development in the grander scheme of ebook formats. It goes back fifteen years in DAISY digital talking books, and even further in antecedents to that format. The value of text and audio synchronization for learning has an equally long history, which is why the technology is so important across the entire spectrum of readers. Media overlays are more than just an accessibility feature of EPUB 3, in other words.

Overlays enable any reader to quickly and easily switch from one reading modality to the other. For example, you don’t have to be interested in reading along with the words to find value. You might switch from visually reading a book while at home to listening aurally while commuting on the train to work. You might want to temporarily switch from visual to audio reading while turbulence is rocking you around on your latest business trip and making you sick to your stomach, and then switch back when it passes.

But overlays run even deeper than the ability to switch from visual to aural for casual reading. Imagine you are working in an environment where you need your hands free, but also need to hear instructions or have a QA checklist read back. If you turn on a media overlay, you can listen while you work. No need for embedded videos. No need to rely on potentially incomprehensible text-to-speech (TTS) rendering.

Foreign language learning is another example of where overlays shine. Without precise audio pronunciations, books that teach second languages are often extremely difficult to follow. Listen to the proper pronunciation while you read and you’ll be fluent in no time. There’s a lot more to media overlays than just storybook reading for children.

The audiovisual magic that overlays enable in EPUBs is just the tip of the iceberg, too. Overlays represent a bridge between the audio and text worlds, and between mainstream and accessibility needs, which is what really makes this new technology so exciting.

From a mainstream publisher’s perspective, overlays provide a bridge between audiobook and ebook production streams. Create a single source using overlays, and you could transform and distribute across the spectrum from audio-only to text-only. If you’re going to create an audiobook version of your ebook, it doesn’t make sense not to combine production, which is exactly what EPUB 3 allows you to do now. Your source content is more valuable by virtue of its completeness, and you can also choose to target and distribute your ebook with different modalities enabled for different audiences.

From a reader’s perspective, media overlays ensure they can purchase a format that provides adaptive modalities to meet their reading needs: they can choose which playback format they prefer, or purchase a book with multiple modalities and switch between them as the situation warrants—listening while driving and visually reading when back at home, for example.

In other words, media overlays are the answer to a lot of problems on both sides of the accessibility fence.

The EPUB Spectrum

Audio synchronization brings a new dimension to ebooks. The old line that clearly separated ebook functionality from audiobook playback is gone, and in its place is a new spectrum that crosses from one extreme to the other. The new possibilities between the traditional formats can be broadly broken down into three categories:

Full text and audio
Providing full text and full audio narration can be seen as the center of the spectrum, blending the full power of ebooks with audiobooks.
Mixed text and audio
A step in the text direction is full text with partial audio. Not every producer will want to narrate every section of their book. Back matter (bibliographies, indexes, etc.), for example, is often tedious and expensive to narrate. Media overlays can be provided for the primary narrative, and back matter left to the reading system’s text-to-speech functionality to voice.
Structured audio
A step in the direction of audiobooks yields structured audio. Picture being able to quickly and easily move through an audiobook using the power of the EPUB navigation document and you can see the benefit of an audiobook wrapped up as an EPUB.

The most common use for media overlays will undoubtedly be to provide either a full or partial audio track covering the text content of the EPUB. Looking at the DAISY production world, where similar kinds of books have been made for many years now, it’s possible to project two models for how these kinds of EPUBs will be recorded.

The first is to load the publication into recording software that allows the narrator to synchronize the text with their narration as they go. The DAISY Tobi tool is one example of software that enables this kind of recording (it is currently being upgraded to support EPUB 3). The other model, which is seeing some traction already, is to record the audio (e.g., using Audacity) and then export synchronization points and merge them with the text, whether manually through another application or automatically.

Which of these models will work for your production process depends on your intended outputs. Recording on top of the text is simple and easy for anyone from a self-publisher to a large publisher to do. The downside is it requires waiting until the very end of the text production process before recording can begin (i.e., after the EPUB output and QA stages, meaning archiving and reusing the audio may require extracting it from the EPUB). Recording the audio separately and then synchronizing has many challenges for books of any size, and is most effective when the process can be automated (e.g., automatically synchronizing the text with the audio using third-party tools that can analyze audio waveforms and return the sync points).

Full audio also does not mean that a human has to narrate the entire text, either. Because support for the new text-to-speech enhancements is going to take time to develop, prerecording synthetic narration and synchronizing it with an overlay is a possibility. This approach would be useful for recording back matter for distribution, for example.

Using EPUB to create structured audio files is another intriguing possibility, and one that has been largely overlooked so far. Text content is still needed, but in this scenario, the content of your EPUB is simply a document that lists the major headings (i.e., so that you have a text point to synchronize with the audio, but we’ll get to those details shortly).

If you start with a full text-and-audio-synchronized EPUB, though, it would be simple to process it down to this type of enhanced audiobook. It will take some XSLT programming talent, but all you’d have to do is strip away everything but the major headings from both your content and overlay documents to get your structured audio EPUB.

Overlays in a Nutshell

Although the only realistic way to work media overlays into your ebook production process is through recording tools and/or applications that can automatically map your audio to your text, this article wouldn’t be complete if we left it at that and didn’t look under the hood to see exactly how the technology works.

The first thing to understand is that an overlay is just a specialized kind of XML document, one that contains the instructions a reading system uses to synchronize the text display with the audio playback. Overlays are expressed using a subset of SMIL that we’ll cover as we move along, combined with the epub:type attribute for adding semantic information.


SMIL (pronounced “smile”) is the shorthand way of referring to the Synchronized Multimedia Integration Language. For more information on this technology, see http://www.w3.org/TR/SMIL

Each content document in your publication can have its own overlay document, which may synchronize audio with all or some of the text in it. After the reader initializes playback of the audio, each successive overlay document is loaded in parallel with the loading of the new content document. For example, when the end of Chapter 1 is reached, the reading system automatically loads Chapter 2 and simultaneously looks up its corresponding overlay document to continue playback. This is how the appearance of seamless playback is maintained, much the way the reader never notices the transition from one content document to the next while reading visually.

The order of the instructions in each overlay document is what defines the logical reading order for the ebook when in playback mode. The reading system automatically moves through these instructions one at a time, or a reader can manually navigate forward and backward, including escaping and skipping of unwanted structures (e.g., using the forward and back arrows, similar to how rich markup assists traversing content using an assistive technology). See Skipping versus Escaping for more.

As a reading system encounters each synchronization point, it determines from the provided information which element has to be shown (by its id) and the corresponding position in the audio track at which to start the narration. The reading system will then load and highlight the word or passage for you at the same time that you hear the audio play. When the audio track reaches the end point you’ve specified (or the end of the audio file if you haven’t specified one), the reading system checks the next synchronization point to see what text and audio to load next.

This process of playback and resynchronization continues over and over until you reach the end of the book, giving the appearance to the reader that their system has come alive and is guiding them through it.

Synchronization Granularity

As you might suspect at this point, the reading system can’t synchronize or play content back in any way other than what has been defined. For example, as a reader, you cannot dynamically change from word-to-word to paragraph-by-paragraph synchronization to suit your reading preference. The magic is only as magical as the content creator makes it, at least at this time.

With only a predefined level of playback granularity available, the decision on how fine a playback experience to provide has typically been influenced by the disability you were targeting, going back to the origins of the functionality in accessible talking books. Books for blind and low-vision readers are often only synchronized to the heading level, for example, and omit the text entirely (structured audio). Readers with dyslexia or cognitive issues, however, often benefit more from word-level synchronization, using full-text and full-audio playback to aid in comprehension and following the narrative.

But reader ability is not the only consideration in a mainstream format like EPUB 3. Coarser synchronization (for example, at the phrase or paragraph level) can be useful in cases where the defining characteristics of a particular human narration (flow, intonation, emphasis) add an extra dimension to the prose, such as with spoken poetry or religious verses. Word-level synchronization can add value in educational contexts where a reader may need to step back over words multiple times.

One convenient aspect of media overlays is that even though you can only define a single overlay per content document, it doesn’t mean you can’t mix granularity levels within one. Consider language learning. You may opt to synchronize to the paragraph level for sections of explanatory prose while providing word-by-word synchronization for practical examples the reader is most likely to want to easily step through. Finer granularity may only add value in certain situations, in other words.

But let’s turn to the practical construction of an overlay now to discover why the complexity increases by how granular your synchronization is. Understanding this issue will give better insight into the model you ultimately decide to use.

Constructing an Overlay

Every overlay document begins with a root smil element and a body, as exemplified in the following markup:


There’s nothing exciting going on here but a couple of namespace declarations and a version attribute on the root. These are static in EPUB 3, so of little interest beyond their existence. There is no required metadata in the overlays themselves, which is why even though a head element exists in the specification there is rarely a need to use it (it exists only for custom metadata, so this article doesn’t cover it).

Of course, in order to now illustrate how to build up this shell and include it in an EPUB, we’re going to need some content. We’ll be using the Moby Dick ebook that Dave Cramer, a member of the EPUB working group, built as a proof of concept of the specification for the rest of this section. This book is available from the EPUB 3 Sample Content site.

Figure 6-1 shows the first page of Chapter 1, displayed in Readium. We’ll be looking at synchronizing this text as we go.

If you look at the content document for Chapter 1 in the source, you’ll see that the HTML markup has been structured to showcase different levels of text/audio synchronization. After the chapter heading, for example, the first paragraph has been broken down to achieve fine synchronization granularity (word and sentence level), whereas the following paragraph hasn’t been subdivided and represents a very coarse level of synchronization.

Chapter one’s heading and the first two paragraphs of text
Figure 6-1. Chapter 1 of Moby Dick, displayed in Readium

Compressing the markup in the file to just what we’ll be looking at, we have:

<section id="c01">
   <h1 id="c01h01">Chapter 1. Loomings.</h1>

      <span id="c01w00001">Call</span>
      <span id="c01w00002">me</span>
      <span id="c01w00003">Ishmael.</span>
      <span id="c01s0002">Some years ago...</span> ...

   <p id="c01p0002">
      There now is your insular city of the Manhattoes, belted round by...

You’ll notice that each element containing text content has an id attribute, because that’s what you’ll reference when you synchronize with the audio track.


A less intrusive alternative to using elements and IDs is found in EPUB Canonical Fragment Identifiers. Canonical Fragment Identifiers can reference anywhere into the text, but no tools can generate these yet, and no reading systems support them for overlay playback or highlighting.

The markup additionally includes span tags to differentiate words and sentences in the first p tag. The second paragraph only has an id attribute on it, because you’re going to omit synchronization on the individual text components it contains to show paragraph-level synchronization.


You can now use this information to start building the body of the overlay. Switching back to the empty overlay document, the first element to include in the body is a seq (sequence):

      epub:type="bodymatter chapter">


The seq element serves the same grouping function the corresponding section element does in the markup, and you’ll notice the textref attribute references the section’s id. The logical grouping of content inside the seq element likewise enables escaping and skipping of structures during playback, as you’ll see with some structural considerations discussed later.

The epub:type attribute has also reappeared. Similar to its use in content documents, it provides structural information about the kind of element the seq represents. Here it conveys that this seq represents a chapter in the body matter. Although the attribute isn’t required, there’s little benefit in adding seq elements if you omit any semantics, because a reading system will not be able to provide skippability and escapability behaviors unless it can identify the purpose of the structure.

It may seem redundant to have the same semantic information in both the markup and overlay, but remember that each is tailored to different rendering and playback methods. Without this information in the overlay, the reading system would have to inspect the markup file to determine what the synchronization elements represent, and then resynchronize the overlay using the markup as a guide. Not a simple process. A single channel of information is much more efficient, although it does translate into a bit of redundancy. You also wouldn’t typically be crafting these documents by hand, and a recording application could potentially pick up the semantics from the markup and apply them to the overlay for you.

Parallel Playback

You can now start defining synchronization points by adding par (parallel) elements to the seq, which is the only other step in the process of building an overlay document. The par element, as its name suggests, is going to define what has to be synchronized in parallel. It must contain a child text element and may contain a child audio element, which define the fragment of your content and the associated portion of an audio file to render, respectively. (We’ll address why the audio element is optional in Advanced Synchronization.)

For example, here’s the entry for the primary chapter heading:

<par id="heading1">
   <text src="chapter_001.xhtml#c01h01"/>

The text element contains a src attribute that identifies the filename of the content document to synchronize with and a fragment identifier (the value after the # character) that indicates the unique ID of a particular element within that content document. In this case, the element indicates that chapter_001.xhtml needs to be loaded and the element with the id c01h01 displayed (the h1 in our sample content, as expected).

The audio element likewise identifies the source file containing the audio narration in its src attribute and defines the starting and ending offsets within it using the clipBegin and clipEnd attributes. As indicated by these attributes, the narration of the heading text begins at the mid-24-second mark (to skip past the preliminary announcements) and ends just after the 29-second mark. The milliseconds appended to the start and end values give an idea of the level of precision needed to create overlays and why people typically don’t mark them up by hand. If you are only as precise as a second, the reading system may move readers to the new prose at the wrong time or start narration in the middle of a word or at the wrong word.

But those concerns aside, that’s all there is to basic text and audio synchronization. So, as you can now see, no reading system witchcraft was required to synchronize the text document with its audio track! Instead, the audio playback is controlled by timestamps that precisely determine how an audio recording is mapped to the text structure. Whether synchronizing down to the word or moving through by paragraph, this process doesn’t change.

To synchronize the first three words “Call me Ishmael” in the first paragraph, for example, simply repeat the process of matching element IDs and audio offsets:

   <text src="chapter_001.xhtml#c01w00001"/>
   <text src="chapter_001.xhtml#c01w00002"/>
   <text src="chapter_001.xhtml#c01w00003"/>

You’ll notice each clipEnd matches the next element’s clipBegin here because you have a single continuous playback track. Finding each of these synchronization points manually is not so easy, though, as you might imagine.

Synchronizing to the sentence level, however, means only one synchronization point is required for all the words the sentence contains, thereby reducing the time and complexity of the process several magnitudes. The par is otherwise constructed exactly like the previous example:

   <text src="chapter_001.xhtml#c01s0002"/>

The process of manually creating overlays is primarily complicated by the total number of time and text synchronizations involved, as is undoubtedly becoming clear. Moving up another level, paragraph-level synchronization reduces the process several more magnitudes, as all the sentences can be skipped. Here’s the only entry you’d have to make for the entire 28 second paragraph:

   <text src="chapter_001.xhtml#c01p0002"/>

But manually creating these files is rarely a realistic option, except perhaps with very short children’s books. Narrating on top of the text makes the process much simpler, but only to a point. Narrating to the heading, paragraph, or even sentence level can be done relatively easily with trained narrators, as each of these structures provides a natural pause point for the person reading, a simplifier not available when performing word-level synchronization.

Using a recording tool to synchronize to the word level is generally too complex to do while narrating, even if the narrator is assisted by someone else controlling the setting of sync points while they talk. Human language is too fast and fluid to keep up with, and a manual process is too prone to errors (e.g., syncing too quickly or too slowly).

Another process that holds promise is to use applications that can take the text and audio and generate the media overlay for you, although none are known to exist at this time specifically for EPUB overlays. As noted at the outset, there are commercial programs that can analyze an audio file and return the start and end point that corresponds to a provided string of text. It’s already possible to use these to feed in the text from an EPUB and generate an overlay, but the process requires a developer to glue the pieces together. The costs of software and development will put this process out of reach of any but large-scale producers, at least at this time.

One other consideration to be aware of is that the greater the granularity you choose, the less likelihood there is for playback issues. If you opt to synchronize to the paragraph level, and only part of the paragraph is showing in the viewport, the page won’t turn until the entire paragraph is finished being read (i.e., narration will continue onto the hidden part on the next page). The reader will have to manually switch pages, or wait. (This problem gets masked in fixed-layout books by virtue of each page constituting a new document.)

Although the ideal is to provide word-level synchronization, all of the above considerations will play into the kind of synchronized EPUB you can create, which is why there is no simple answer to which you should produce. You’re going to have to find a balance between what would provide the best playback experience and what your production processes are capable of handling.

Adding to the Container

Although that wraps up the overlay document itself, you’re not completely done yet. There are still a few quick details to run through in order to now include this overlay in our EPUB.

Assuming you save the overlay as chapter_001_overlay.smil, the first step is simply to include an entry for it in the manifest:


You then need to include a media-overlay attribute on the manifest entry for the corresponding content document for chapter one:


The value of this attribute is the ID you created for the overlay in the previous step.

Finally, you need to add metadata to the publication file indicating the total length of the audio for each individual overlay and for the publication as a whole. For completeness, you’ll also include the name of the narrator:

<meta property="media:duration" refines="#chapter_001_overlay">0:14:43</meta>
<meta property="media:duration">0:23:46</meta>
<meta property="media:narrator">Stuart Wills</meta>

The refines attribute on the first meta element specifies the ID of the manifest item we created for the overlay, because this is how you match the time value up to the content file it belongs to. The lack of a refines attribute on the next duration meta element indicates it contains the total time for the publication (only one can omit the refines attribute).

Styling the Active Element

The last detail to look at is how to control the appearance of the text as it is being narrated. You aren’t at the mercy of the reading system, but can tailor the appearance however you want through CSS.

The Media Overlays specification defines a special media:active-class metadata property that tells the reading system which CSS class to apply to the active element.

For example, if you defined the following meta tag in the package document metadata:

<meta property="media:active-class">-epub-media-overlay-active</meta>

you could then apply a yellow background to each section of prose as it is read by defining the following CSS class definition:

   background-color: rgb(255,255,0)

Figure 6-2 shows how Readium displays the highlighting when media overlay playback is turned on.

“Chapter title with yellow background shading”
Figure 6-2. Text highlighting in Readium

Be mindful of the reading experience before going too crazy with this functionality. If you set a dark background color, for example, you may obscure the text that is being read. Setting a thick border to surround each word as it is read can also reduce the legibility of your text. Simple yellow backgrounds are traditionally how highlighting is done, as they are not distracting and do not obscure the text.

And do not add this CSS class to any of the elements in your content documents, or they will have a yellow background by default. The reading system will automatically apply the class to each element referenced in a par’s text child element as it becomes active. You only need to make sure that every content document that has an overlay links to the style sheet that defines this class. If the reading system cannot find it in an attached style sheet, it may use its own default rendering, or it may use nothing at all. Avoid the risk.

Structural Considerations

We briefly touched on the need to escape nested structures, and skip unwanted ones, but let’s go back to this functionality as a first best practice, because it is critical to the usability of the overlays feature in exactly the same way markup is to content-level navigation.

If you have the bad idea in your head that only par elements matter for playback, and you can go ahead and make overlays that are nothing more than a continuous sequence of these elements, get that idea back out of your head. It’s the equivalent of tagging everything in the body of a content file using div or p tags.

Using seq elements for problematic structures like lists and tables provides the information a reading system needs to escape from them.

For example, here’s how to structure a simple list:

<seq id="seq002" epub:type="list" epub:textref="chapter_012.xhtml#ol01">
   <par id="list001item001" epub:type="list-item">
      <text src="chapter_012.xhtml#ol01i01"/>
      <audio src="audio/chapter12.mp4" clipBegin="0:26:48" clipEnd="0:27:11"/>
   <par id="list001item002" epub:type="list-item">
      <text src="chapter_012.xhtml#ol01i02"/>
      <audio src="audio/chapter12.mp4" clipBegin="0:27:11" clipEnd="0:27:29"/>

A reading system can now discover from the epub:type attribute the nature of the seq element and of each par it contains. If the reader indicates at any point during the playback of the par element list items that they want to jump to the end, the reading system simply continues playback at the next seq or par element following the parent seq. If the list contained sub-lists, you could similarly wrap each in its own seq element to allow the reader to escape back up through all the levels.

A similar nested seq process is critical for table navigation: a seq to contain the entire table, individual seq elements for each row, and table-cell semantics on the par elements containing the text data.

A simple three-cell table could be marked up using this model as follows:

<seq epub:type="table" epub:textref="ch007.xhtml#tbl01">
   <seq epub:type="table-row" epub:textref="ch007.xhtml#tbl01r01">
      <par epub:type="table-cell">...</par>
      <par epub:type="table-cell">...</par>
      <par epub:type="table-cell">...</par>

You could also use a seq for the table cells if they contained complex data:

<seq epub:type="table-cell" epub:textref="ch007.xhtml#tbl01r01c01">

But attention shouldn’t be given only to seq elements when it comes to applying semantics. Readers also benefit when par elements are identifiable, particularly for skipping:

<par id="note21" epub:type="note">
   <text src="notes.xhtml#c02note03"/>
   <audio src="audio/notes.mp4" clipBegin="0:14:23.146" clipEnd="0:15:11.744"/>

If all notes receive the semantic as in the previous example, a reader could disable all note playback, ensuring the logical reading order is maintained. All secondary content that isn’t part of the logical reading order should be so identified so that it can be skipped.

This small extra effort to mark up structures and secondary content does a lot to make your content more accessible.

Advanced Synchronization

Synchronizing text and audio is not the only feature media overlays provide. We’ve already made the case that not all audio may be narrated, but some sections, such as back matter, could be left to the reading system’s text-to-speech capabilities. That doesn’t mean that you have to omit an overlay for those sections you don’t narrate and leave it to the reader to turn on TTS playback.

You might skip human narration for a bibliography, for example, but provide human narration for additional appendices that follow it. In this case, you want to provide a linear progression through the content, and allow the reader to decide when to drop in and out of playback mode. If you omit an overlay for the TTS section, the reader will immediately drop out of playback when it is reached, and have to manually navigate to the next section and then re-enable the new overlay.

To create an overlay that automatically triggers the reading system TTS capabilities, you simply omit an audio element from the par. To return to the earlier Moby Dick example, you could set the first heading to be read by TTS like this:

<par id="heading1">
   <text src="chapter_001.xhtml#c01h01"/>

When all of the referenced text content has been rendered by the TTS engine, the reading system will move to the next par.

Another application for media overlays is to automatically begin playback of embedded multimedia resources. Overlays wouldn’t be nearly so powerful if a reader had to skip all your audio and video clips in order to listen to the playback, or drop out of playback each time one was encountered.

To automatically start playback of an audio or video clip, you simply reference its id in a text element.

For example, if you had the following video element in Chapter 1:

<video id="vid01" src="video/v01.mp8" controls="controls"/>

you could initiate its playback with the following par:

<par id="heading1">
   <text src="chapter_001.xhtml#vid01"/>

When encountering a text element that references an audio or video element, the reading system is expected to play the clip in its entirety before moving on to the next par.

You can also layer audio narration on multimedia clips by including an audio element in the par (e.g., to provide video scene descriptions for blind readers):

<par id="heading1">
   <text src="chapter_001.xhtml#vid01"/>
   <audio src="desc/c01.mp4" clipBegin="0:8:30.000" clipEnd="0:17:33.500"/>

When adding an audio track, be aware that the length of the audio clip will constrain how much of the video gets rendered. If your audio track is shorter than the length of the video, for example, playback will end when the audio track runs out.


This automatic playback functionality is only available for content referenced from the audio and video elements. You cannot use overlays to initiate content in an object, for example.

Audio Considerations

Before moving on, it’s worth looking briefly at the audio files containing the narration. Like audio in content documents, the audio narration referenced in overlay documents can also live outside the EPUB container. When hosting remotely, the manifest entry for the overlay must indicate this fact:


Remote hosting may cause playback issues for readers, because there will inevitably be a delay the first time that playback is initiated and the reader has to wait for the buffer to fill. Unless the reading system can detect and buffer the next document’s audio track before reaching the document, which none are known to do at this time, this lag may occur with each new document in the spine.

The sheer size of the audio narration may require external hosting, but one option to increase performance might be to include the first audio file in the container so that it can begin playing immediately. Assuming that reading systems get more sophisticated in their rendering of overlays, this would potentially allow the reading system to focus its resources on buffering the next remote audio file.