This blog is for anyone who wants to gain an insight into some of the automated IT processes that Visiolink uses to process their newspapers or magazines. Happy reading!
Visiolink's SaaS core backend can essentially be broken down into three parts:
- Automatic collection of the right PDF files from the publisher's FTP account at the right time and queuing them for further processing.
- Queue-based conversion of PDF files to two different formats for web and smartphones/tablets respectively.
- A wide range of web services which the web and device clients use to extract data from.
I plan to concentrate here on the first part, which has recently changed beyond recognition. Since no two publications are the same, their deadlines and release times and the output from editorial systems are often different too. This transformation has been caused by new requirements in terms of release times and by the fact that there may be several versions of the same publication.
Internally, in our world, this system is known as a normaliser. But what exactly is it that's being normalised? Well, it's like this... Our systems expect publishers to upload single-page PDF files to their FTP accounts. All these individual pages have to be combined into a publication and queued for further processing, so they can initially be displayed in our AUTHENTIC and DYNAMIC solutions on smartphones and the web. But how?
Clearly, a publication consists of a number of pages, and each page may belong to a section. In addition, a publication also has a publication date, because there may be files for lots of different dates. Consequently, each PDF file must, at minimum, specify which page and date it relates to. If the publication has several sections, this must also be specified. In addition, if the editors need to modify the content during the course of the publication period, there may be several versions of each page. And all this information must be included in the file names.
Since file naming, including date format, varies from one publisher to another, it is managed by a set of specific regular expressions for each customer, each of which contains a “group” for each of the required elements and, where appropriate, a pre-defined section. In addition, a tolerance threshold can be configured for the number of missing pages within a section, release time and other details.
Everything was easier in the “good old days”
In the old days, the only way of triggering a completed file collection was to upload a so-called "XML trigger". This XML file would contain the publication date (the same date as the date in the PDF file names), a publication time and a version. This file would often be created and uploaded manually, which often gave rise to confusion and mistakes. This method of completing a file collection is still available today for those who wish to use it.
As the number of publications on our systems increased, the number of requirements in this respect also increased. The most important requirement was automatic collection of files from a specific time in the day without an XML trigger. This was added and worked well for a long time.
All this was before the industry's requirement for early release, i.e. the ability to release the publication the day/evening before the publication date. Consequently, the system was only designed to collect files from the same date as the run time. In other words, there was no differentiation between processing date and publication date. Consequently, if, for example, you were an evening paper that in exceptional situations published slightly late just after midnight, the files would not be collected automatically. Moreover, once a paper had been run, it was not possible to re-run it with updated pages. Clearly, both parts could be processed manually by logging into our administration interface and forcing collection from a specific date, but our focus has always been on automating the overall process, so that when the industry comes to us with new requirements, we can incorporate them into our automated process.
Thus, for example, as time went on, early release became a requirement within the newspaper industry. With the help of the above-mentioned XML trigger, we were able to accommodate this requirement easily. The advantage of the XML trigger is that the customer has 100% control over when the files are collected, just as they can also re-run a publication using the XML trigger. As already mentioned however, the XML trigger also had a number of disadvantages, and since more and more customers wanted automatic early release and re-runs, we decided to build a new normaliser.
In the “even better recent times”, things became more complex
So we incorporated the concept of “file sets” in the new normaliser. A file set essentially consists of all the files for a specific date which are eligible for inclusion. The first major task in a normaliser run is therefore to find all the files on the customer's FTP, match them with the file naming rules and group them by date.
Each run contains information about each individual known file set, so we can see at a later date if a file has been modified since the last run. If this is the case, and the rest of the configuration so permits, the file set will be collected and queued for processing. However, the collection criterion is not applied in all cases, because a file set is not collected if a file has recently been uploaded or deleted. This is designed to ensure that we don't collect a file set while the customer is in the process of uploading or deleting files. In that case, we wait to collect the files until just after a change has been made.
Without going into too much detail, it's relatively easy, with this file set information and collection configuration, to identify, for each individual date (if files for this date are to be collected) when changes have been made to the files on the FTP account.
The decision to overhaul the system also resulted in a number of ideas for further improvements, which were implemented. Key improvements include:
- Rather than only being able to collect files the day and evening before the publication date, the system has been designed in such a way that it is possible to configure a number of days in the past and the future respectively, e.g. run time, where the files for a publication should be collected. The number of days in the future is particularly useful for magazines, which are ready well before they are released. The number of days in the past can be used for re-runs. Subsequently, we have also added a specific time of day for this time limit, which - in cases where there is a publication deadline - corresponds to the time when we should have received all the files.
- Rather than only being able to collect files automatically between a specific time of day and midnight, we have added the option of configuring one or more arbitrary periods of the day. For many publications, the system simply needed to be configured to collect files throughout the day, so there would be no manual runs.
- Previously, only log information from the last run was available. With the new system, all recent logs are available, so information can be accessed for each individual run. This means, for example, that we can display the log for the latest run and for the latest run, where files were collected in the administration interface.
- Clearly, what we are dealing with is important journalistic content, so, if the front page is missing, a file set will be considered flawed. It is also possible to configure the system in such a way as to ensure that section front pages are not missing. If the file set is sent for processing, the missing pages will become blank pages later on in the process, which doesn't give the reader the optimum experience, particularly in a coverflow display. And, generally speaking, the system doesn't know how many pages there should be in a given publication, so we can't stop a collection where pages/files are missing at the end of a section. Consequently, it is also possible to configure whether the number of pages in a section should be divisible by 2 or 4, which is generally a sensible number for what were originally physical publications. All these configuration options are there to help individual publishers achieve a high level of quality without errors and omissions in their digital publications.
Looking to the future
This is a very superficial overview of the new normaliser system, and there has naturally been a whole host of quirks that have made it challenging and exciting to develop. We believe, however, that the product will be extremely valuable to both our customers and ourselves. We are currently in the process of migrating the many publications over to the new normaliser. And even though we believe we've accommodated the majority of the industry's current requirements, there's no doubt that in five years' time we'll know a whole lot more, just as we do today.