Daisy Pipeline 2, Presented at The Ebook Accessbility Symposium at Nfb - December 2011

I am Romain Deltour, I am working for the DAISY Consortium, and I am going to spend the first half hour talking about the DAISY Pipeline 2 project, which is an open framework for automatic document processing. I am going to give an introduction of the tool and then briefly describe some possible workflow for production, and then if we have some time left for questions I'd be happy to answer them. So let's first talk about the background - there is an ever-existing demand for accessible content. This demand comes from a wide range of user groups and content must be published through a wide variety of distribution channels - online content, CDs, SD cards - and to an increasing variety of devices. Visually impaired users use different kinds of devices, like Braille displays, hardware DAISY players, laptops, iPads, tablets, whatever. Because there is a wide range of user groups and a variety of devices, there are a lot of different output formats. This project is the followup to the Pipeline 1 project which was started in 2002 by the DAISY Consortium and is now in maintenance mode. It has been quite successful and used a lot in the DAISY community to produce accessible content. But at the time we were creating the projects, some technologies and standards were not ready. Now, we've decided to come up with the Pipeline 2 project and totally redesign the software to better rely on open standards and new technologies. Our high-level objective is to be efficient - enable the tool to produce many documents in a short time. When a publisher wants to produce a newspaper for the next day, it has to be quite efficient. The tool has to be low cost, which means it is easy to develop, adapt to a new publishing workflow, and easy to maintain over time. And it needs to be versatile, which means it can adapt quite well and it can be interoperable with different systems and it needs to be able to produce different output formats. Our approach is to come up with a modular system - why a modular system? Because if your system is modular, based on several components, it's easier to extend. If I want to augment this EPUB3 production with MathML processing, I just add a MathML-aware module to the system. It must be easy to customize - if I have some special needs in my organization to produce the content, I must tweak the production workflow to meet these needs. It makes it more easy to integrate and open the tool to commercial and non-profit use. The module system is a plugin system itself, so commercial companies can come up with their own plugins for the The other big item in our approach is to promote single-source publishing - that's not a requirement, but it is recommended. What is single-source publishing? It means that we use an XML master document in order to produce different output formats. Markus said earlier that when you have a reasonable, satisfying level of good structure in your document and good semantic inflections in your document, then you almost have the accessibility features for free, and that's what we are talking about now. If we have a rich XML master, rich enough, then with automatic production we can transform it into a variety of accessible output formats such as DAISY digital talking books, EPUB3 books, Braille content, large print. Of course, this is not a requirement nor limitation of the tool. It is what we're suggesting, but the production workflow can be adapted to different use cases and workflows. We can transform input formats into these XML masters, then into other output formats, or we can go a totally different route. And as for the XML master, currently we are focusing on DAISY A.I. also known as the DAISY 4 Authoring and Interchange standard, which is the successor to the DTBook format. It's an authoring format with an XML schema that can be used to describe almost every document. The third big item of our approach is that we focus on accessibility and quality. A valid EPUB book is not necessarily accessible. We strive to produce some content that is inherently accessible and inherently well-structured, which makes it a quality publication. Now as for the architecture, I won't dive into the tiniest technical details here, but just to give you an overview of what kind of technologies we are using. We are relying on W3C standards, notably XProc, XSLT, and XPath, which are all open recommendations and native XML processing technologies. We do that because it's easier to manipulate XML to produce XML with XML tools with technologies that have been made for the job. XProc is the XML pipeline language. It's a language that has been develop to orchestrate XML processing steps in a workflow. It has been a recommendation since May 2010, and there are already many open source and commercial engines available. Then, on top of this core XProc engine, we're adding a module system. Again, each step in the production workflow is an independent cohesive software component, which we call a module. It can be implemented in several technologies like XSLT, XPath, and Java code, and it is all orchestrated by XProc. Then we have the runtime framework, it is like the glue code that ties all these components together. It makes the XProc engine aware of the modules. It makes it feasible to run these modules with the XProc engine and it's based on the Java technology and the OSGI module system, which helps us to come up with a service oriented approach where we can plug in different pieces of functionality like job management, logging, web services, you name it. So that's it for the architecture, we have this core XProc technology and the core processing technologies are implemented with open standard recommendations and then everything is run in a Java-based open source runtime which implements a module system. So what are the deployment options for our tool? We have the possibility to use the tool as a command line tool. This is already available and the tool will be revamped early next year. The tool can be called by a RESTful web service API. This tool is already available, it's going to be gradually enriched and improved based on feedback. We also want to develop a web application for the tool, a web UI that you can access with your browser. The target release for this web application is June 2012. And ultimately, we'll also come up with a lightweight standard desktop UI. It's going to be a sequence of dialogs to guide you through the conversion process. The goal is to be able to embed that into third party applications. For instance, if you have ever used the Word Save as DAISY plugin, it calls the Pipeline under the hood and it pops up a sequence of dialogs to invoke the Pipeline process. So that's the kind of user interface we're looking at. This desktop UI is planned for the first half of 2013. Now I'm going to describe, rather than demoing an automated tool (It's not very interesting. I just start the process, it runs, and it gives me a file so there's no real point in showing that) but instead I'm going to describe briefly some sample workflows that are available or in the works for the tool. First, I'm going to briefly talk about EPUB production, how we do that, and what it takes. I won't dive into every step of this workflow, but basically we have a generic process when we talk about EPUB production. We look at the input file set, we determine the reading order of the file set, and then based on this reading order we process the content to convert it into HTML5, possibly add some media overlays. When we've done that, we extract the metadata from the documents, we automatically create a navigation document, then we package the file set (zip it). What's interesting here is that we try to have each of these steps as independent as possible from the previous ones which means that they are interchangeable (if possible) and reusable for different production workflows, depending on the input and output. We try to automate, of course, as much as we can. For instance, the navigation creation is fully automated based on the structure of the content document. We look at the HTML markup, and if it is well structured, if it has the proper markup and top-level sections, semantic inflections, and things like that, we can automatically generate the navigation files. I'm now displaying another workflow diagram that shows an instance of the workflow applied to a DAISY 3 to EPUB 3 conversion. It basically shows how we use some of the components when we have a complete conversion requirement. For instance, when we have a DAISY 3 file set, to determine the reading order of the final EPUB 3 publication, we are looking at the DAISY navigation file for that. When we want to generate media overlays in an EPUB 3 production being published out of a DAISY 3 DTBook, we are taking the existing .smil files and audio from the DAISY 3 file. This next workflow is very high-level. I'm going to briefly describe how we can use this tool to add some advanced TTS notations to an EPUB publication. This workflow is like this: Start with the XML master, for instance, in DAISY authoring interchange format, and then, depending on the original markup of the document, it may need to be improved. For instance, I give here an example of a sentence, which is: "Have you seen the movie 'La Vita e Bella?'" It's just a paragraph, it's tagged as a paragraph, and "La Vita e Bella" is not marked up. So we can have some preprocessing tools to enrich this markup and make it better by tagging "La Vita e Bella" within a name element. This kind of markup enrichment can be either fully automated or needs human interaction. Sometimes there are things that an automated tool cannot do, and at this time we need some human interaction. Here, for instance, to identify movie titles or proper names, e-mail addresses, whatever, we can query some databases to do that. So once we have this properly tagged XML master, we then transform that into an EPUB. During this transformation, we can plug in a module that will talk to a remote lexicon in one of the available and the lexicon will know how to pronounce this foreign proper name. So it will be able to add the EPUB annotations to the produced EPUB file. It will say, I am using this phonetic alphabet, and the phonetic description of the title is this. The interesting parts here are that this is really a good use case for automated production because we can automatically query some organization-wide lexicons and data to improve the publication - either improve the document we intend to archive, or improve the live publication process. The Pipeline 2 project has built-in support for remote services - we can basically make some HTTP requests and call some web services. So if an organization is making their lexicons available for the public or parters as an online service, this service can be called from the automated production tool to enrich the outcome of the EPUB 3 product with these text to speech annotations or other accessibility features. It's particularly interesting for text-to-speech annotations because building a comprehensive and rich lexicon is very time consuming and costly, so usually users are maintaining large lexicons and they gradually enrich and enhance the lexicons over time. So here in this very example I showed how to add some inline pronunciation hints, but you can also convert the big organization lexicon into a small subset of it that you integrate natively within your EPUB. To summarize, the Pipeline 2 project is an open platform, it's all based on open source software either from third party developers or from the DAISY Consortium, it's currently licensed as LGPL which is a commercial friendly license, but we are maybe looking at other licenses, we are discussing other license possibilities such as the Apache License. It's a collaborative project, it's maintained and led by the DAISY Consortium but it also involves DAISY Consortium members: the National Library for the Blind of Norway, the Swiss Library for the Blind, RNIB, and other organizations. It's a really collaborative project. It also has built-in accessibility and customizability, which means that we intend to make the tool extensible for special purposes and customizable to special organization workflows. What is available today in terms of concrete conversion? We can go from DTBook (DAISY XML) to DAISY A.I., the new version of DAISY XML. We can go from DAISY A.I. to EPUB 3. And we can go to DAISY 2.02 (Digital Talking Book) to EPUB 3 plus optional media overlays. But that's just for the first iteration of the software. We have many things in the works, at the top of the list is improved EPUB 3 support. We're going to come up with additional input formats that will be transformable into EPUB books. I'm thinking of other DAISY formats like DAISY 3, HTML, RTF, things like that. By improved EPUB 3 support, I'm also talking about these TTS annotations. This is not already available, but it will be deployed, it's already planned for the next year. In the works is also Braille production, being able to produce Braille. There will be several prototype solutions developed by an independent working group, working on this Braille topic. We'll also work on TTS based production. Usually users prefer when the text is narrated by a human, but sometimes we don't have the time to have the text narrated by humans, when you want to deliver newspapers for instance. More and more reading devices and reading systems will have the built-in ability to speak text so they will have their own TTS engines built in. At the same time, if you do the TTS processing upfront, you can rely on more server resources and processing power and better lexicons, which all makes for a better TTS based production. So we are still targeting TTS based production. Okay, that's it. Thank you.