You are here

JAPP: Jallib Automated Publishing Process

Author: Sébastien Lelong | Jallib Group
This document describes the main processing steps that brings DITA based content to justanotherlanguage.org website.

What is JAPP, and why using it ?

JAPP, Jallib Automated Publishing Process, is the name given to the process used to publish DITA based content to justanotherlanguage.org website. The idea is automatically monitor DITA files, and publish contents. Because this an nice acronym, JAPP is powerful and sounds like a nice tool...

DITA files are XML files. This documents doesn't explain why DITA is used, but basically, one nice DITA's feature is to be able to assemble documents, and even small portion of documents, into a new one. This target document can then be compiled into several outputs, like HTML and PDF (any many more).

Our website is based on Drupal. Drupal is CMS (Content Management System), widely used, and quite fun to use and develop into. In order to ease and speed up document publication, and any content in general, JAPP is used to automate the process of monitoring DITA files, compile them into HTML, preprocess HTML to suit Drupal's way to store content, and publish content to Drupal.

Overview

Figure 1. JAPP Overview

JAPP works like the following:
  1. JAPP monitors DITA files. This is actually done by a buildbot1, which listens to SVN changes
  2. If something has changes, JAPP tells DITA compiler to produce an HTML file, from each DITA modified files (it iterates over changes)
  3. Produced HTML files are then preprocessed. For URL links and images are changed to suit Drupal's.
  4. Preprocessed HTML is then used to generate a kind of special mail, with specific instructions in it. This mail is sent to justanotherlanguage __at__ gmail dot com.
  5. This email address is monitored on Drupal's side. When a new email is here, it retrieves, and process it by creating a node2from email content and instructions.
  6. Eventually, website reviewer may publish content.

Monitoring DITA files

This section is empty, mainly because it's not done yet :)

Compiling DITA files to produce HTML

Compiling DITA files is done using DITA Open Toolkit. This is probably to easiest part, provided your DITA environment is well configured. DITA configuration is not explained here though. A shell script is dedicated to this task: dita2html.sh. It takes a DITA file and an output directory as arguments. It basically produces the appropriate command line to run the compiler.

When compiled, HTML files are produced in this output directory. Images are also copied, following path in the original DITA files. Ex: if DITA image specified a path like images/mypix.jpg, then DITA compiler will create a images directory into the output directory, and copy mypix.jpg.

This important because this output directory contains everything related, needed to render original DITA file. Everything. And this is a problem... If the DITA file contains references to other DITA files, those other DITA files will get compiled too !3

Preprocessing HTML files

Preprocessing HTML is a major step in JAPP. This is a tricky part, luckily automated... So, when DITA produces HTML, there are few things to adjust:
  • DITA produdes a full HTML page, we just want to get the content. So, we need to extract inner "body" content.
  • When creating a node in Drupal, you need to specify a title. This title, in HTML, is in a "h1" element. This element also has also to be removed, else we'll get title twice (see later while publishing by email).
  • Image URLs have to be adjusted too, to match Drupal's. So <img href="images/mypix.jpg"/> should be converted to <img href="/sites/default/files/mypix.jpg"/> or something like that
  • Links to other HTML pages should be adjusted too. Very tricky here, as Drupal does not have the same links as DITA's. By default, Drupal's URL looks like node/20, while you specified a DITA file named amazing-page for instance. Luckily, we got a change to make it work anyway. Using path module, pages can be accessed via another URL, more human friendly: http://...../content/amazing-page.
Preprocessing HTML does all of this, using htmlize.py script. It takes a HTML file as input. It generates a subdirectory, usually named topublish, and put several things into it:
  • content : contains inner body, URL and images adjusted. Title is removed
  • title : contains the title found in the original HTML file
  • path : this is the path that will be used on Drupal side. It corresponds to the DITA filename without extension.
  • attachments: this is a directory containing all images extracted by DITA compiler. It can also contain any other file: what's in this directory will be put as attachements in emails. Attachments are showned on Drupal's page, at the end. So this is a good place to put a PDF version of this DITA file, for instance.

Once all of these stuffs are created, the whole content is ready to be published.

Sending content via email

Python script publish.py handles emails sending. From previous step, several files have been create: content, path, title, and all attached images. This script just glues the whole to build an email:
  • it build a multipart email, to put content and attachments
  • concatenate mailhandler commands with content (more on this later)
  • email's subject is what stored in title file, as subject will be used to set node's title.

Setting up Drupal, with MailHandler and MailSave

Now let's have a look on what's happening on Drupal's side. Drupal needs to monitor an email address, and produce content from submitted emails. This can be done using mailhandler and mailsave modules.

mailhander module takes care of monitoring an email address, retrieve emails and process them. The From address in email identify a Drupal user, allowed to have access to all mailhandler machinery, create nodes, upload files, etc... This special user is named japp.

Processing emails means it's able to extract some special commands. These commands must appears at the very beginning of email, and tell mailhandler what to do. There can be default commands (configured in mailhandler) and email commands. The current configuration is the following:

Default commands

type: page
status: 0
promote: 0
pathauto_perform_alias: 0

Here we tell mailhandler to create Page node type by default. Page, compared to Story type for instance, doesn't display author information. This is what we want because authorship is done in original HTML, not by Drupal (else all nodes would be owned by user japp).

We then tell mailhandler that created node won't be published (only special users can see it, like reviewers). It won't be promoted to front-page too, and no autoaliasing will be performed (we'll set our own path).

This is default commands, valid for all submitted emails.

Email specific commands

pass: mysecret
path: content/tutorial_pwm1

This is an example. Password is here identified user japp, to avoid spamming issues. We also tell Drupal which path to use for this content.

Remaining content is what will be used to produce node's content. Title is taken from email's subject. Note we submit HTML, but as raw text, we don't submit a real HTML MIME part. So, an example of content could be:

This is my <b>content</b>

and not:

This is my content

mailsave is used later, to extract attachments, just like if they were uploaded via the website.

JAPP module: playing with Drupal

Currently, JAPP publishes content on the website, but manual operations are still needed: you need to tell Drupal where to attach your page, that is, in which menu or book it should appear.

Automation of this part may be for another JAPP version. For now, manual operations, in Drupal, may be required to assemble pages together. Luckily, this has to be done at first, then very rarely, as content will be updated more than created (I guess).

Now, imagine the following scenario. You create a DITA document named "my tutorial". It gets published to the website. Now you update your tutorial. Should this triggers another page creation ? Is so, this means each updates needs manual operations. And URL for this updated page can't be the same as the first one, so all referring pages would need to be updated ! Instead, node's content must be updated, and attachments renewed. For this, we need to know, when sending the email, if a previous page exists, and what is its node ID. This typically can't be done that way, because when sending emails, we can't know if a node exists for this page, and if so, what is its ID...

So, a solution is to write a Drupal module, and implement a hook_mailhandler. A hook is a special PHP function (it has a special name, here it finished with _mailhandler) in Drupal. This hook will be called before node's creation. From given path command, it'll lookup in Drupal database to find corresponding node. Does it exist ? If not, it triggers a page creation (it just lets mailhandler processes the email as usual). If it exists, then it fetches its node ID and revision ID (version), and put "nid: XX" command and "vid: YY" command. mailhandler then continues its processing and now knows this is for an existing node.

japp module is under jallib SVN repository.

Reviewing content

A final step would consist into checking for unpublished content on website, reviewing it and publish it. This is also during this final step that content is being attached to a menu, books... This step isn't necessary during content updates: nodes are updated in this case, this means menu or book configuration is kept.

1 buildbot is python application which can monitor many source of information, like a SVN (Subversion) repository, and reacts according to given rules.
2 a node is the name given to a page in Drupal. There can be many node type, like Page, Story, Book, Blog, etc...
3 I, Seb, still did not found the magical compiler option, after trying a lot...
AttachmentSize
Image icon japp_overview.jpg91.02 KB