AVAA Toolkit

Introduction

Corpora analysis is a complex task, requiring to learn editors for different file formats and multiple tools, often command-line based, or with programming knowledge prerequisite.

$ATK makes it easy to create pipelines connecting ecosystems to process raw data (automated transcriptions, formats conversion..), and query large corpora of annotations coming from various sources to extract advanced statistics and generate beautiful, always up-to-date charts and timelines.

$ATK is also a flexible converter ; it takes as input XML files describing the style and operations to generate a HTML document, and takes care of exporting only relevant portions of videos and their thumbnail snapshots, minimizing final document size and potential load times if hosted online.

Annotations Formats

$ATK understands the following file formats

Some formats available thanks to the TEI-CORPO project.

Raw Media Formats

$ATK can also process the following media types

Prerequisite

The following software must be installed:

Installation

Simply extract the latest release zip

It is possible to add other XML folders to the editor by specifying their path as arguments (edit .bat file to see, check the cli arguments)

Update

When $ATK is already installed, follow these steps to update:

  1. Close $ATK if it is running
  2. Delete these folders: scripts, editor, tests
  3. Download the latest release zip
  4. From the zip, copy to your installation folder (replace existing) avaa-toolkit.jar, include, scripts, editor, tests
  5. Restart toolkit

Editor

An editor for $ATK's XML documents is available in the browser.

To begin, start $ATK by running the launcher (avaa-toolkit.bat on windows or avaa-toolkit.sh on linux),
then navigate with your browser to avaa-toolkit.org
If internet is not available, use the provided offline editor in your installation folder (open index.html)

XML Structure

$ATK will process XML files and convert them to HTML.
It expects a document with the following structure

Queries

$ATK is all about querying and filtering annotations. Inside the VIEW or CHART tag, complex queries can be built to extract only specific annotations. This is done via the SELECT tag, various attributes can be combined to make a curated selection of annotations:

Attributes of type regexp (*-match) have additional options:

When multiple attributes are used, the selection will consist only of the annotations fulfilling all the constraints.

##scripts##

Processor Pipelines

When using processors, a pipeline is created for each section of the document.

A pipeline initially contains a virtual copy of the corpus and its associated media files.
The media files are then modified sequentially by each processor.

Pipeline Input Modes

The pipeline can be fed different initial media files, by defining the processor-pipeline-input setting.

  1. corpus: this is the default mode if the processor is placed at the beginning of a section, and will feed the pipeline with the corpus media files
  2. section-assets: this is the default mode if the processor is placed after a view which exported clips, and will feed the pipeline with all the exported clips/snapshots (of the section) until this processor was reached
  3. all-assets: this mode must be manually selected, and will feed the pipeline with all the exported clips/snapshots of the document until this processor was reached

The corpus mode is useful to process corpus files directly (audio-anonymization, formats conversion...), while for instance all-assets mode could be used to apply effects only on the exported media of the document intended for sharing with peers.

Pipeline Chain

Processors inside a pipeline (that is for now, a section of the document) are executed one after another, each processor using the results of the previous one to work on.

Complex chains of processors can be built to automate heavy tasks alleviating the burden of manually running each step and verifying its consistency.

Pipeline and Views

Views placed after a processor (in the same section) will inherit its modified media files when exporting clips and snapshots.
This can be helpful extracting annotations from cuts of raw media files, to avoid processing long corpus media file when testing samples ; or preprocessing a media file before it is exported into clips during later views generation.

Processors generating annotations will make these annotations immediately available in the main corpus (and not only for the current pipeline), hence for all subsequent views and processors in the document.

Styling

It is possible to change the style via CSS. The HTML code generated makes it easy to target specific elements or apply styling rules for the whole page. Each view has its own structure of elements, and a simple "Inspect Element" from browser will reveal selectors.

Embedded CSS

Styles can be defined directly in the XML file, by using a STYLE tag.

These styles will only apply to this specific HTML document.

<STYLE> .view-timeline td { border-color:red; } .view-timeline tr.tier-header { text-align:right; } </STYLE>

CSS File

Styles can be defined in a separate CSS file, that must be placed in the include folder.

All the generated HTML documents will load this file and have these styles in common.

e.g. my-styles.css h2 { color:green; } section { border-left: 2px solid gray; }

Styling Views

Views generate simple HTML code and try to follow common guidelines so that applying styles is straightforward

Annotations' text labels always have the annotation class, so for instance to change the color of all annotations:

.view .annotation { color:red; }

Making PDF

$ATK can also generate PDF, though interactive features like videos or dynamic charts won't work in this format, for obvious reasons.

Chrome (or Chromium) must be installed on the system, and the cli argument --pdf must be specified.

Chrome executable should be detected automatically, if that fails it is required to provide its path with the --chrome-exe argument.

If everything works correctly, a file.pdf should be generated along the file.html document.

Command Line

$ATK is made for the command line and can integrate seamlessly in any tool chain.

##cli##

Advanced Video Processing

Some processors require a full ffmpeg version to work.

Troubleshooting

Installation and first run

Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.UnsupportedClassVersionError:
    org/avaatoolkit/Main has been compiled by a more recent version of the Java Runtime (class file version 55.0),
        this version of the Java Runtime only recognizes class file versions up to X

Solution: Your version of java runtime is outdated, follow these steps


java.net.BindException: Couldn't bind to any port in the range `42042:42042`.
    at org.glassfish.grizzly.AbstractBindingHandler.bind(AbstractBindingHandler.java)
    at org.glassfish.grizzly.nio.transport.TCPNIOTransport.bind(TCPNIOTransport.java)
    at org.glassfish.grizzly.http.server.NetworkListener.start(NetworkListener.java:)
    at org.glassfish.grizzly.http.server.HttpServer.start(HttpServer.java)
    at org.avaatoolkit.server.Daemon.start(Daemon.java)
    at org.avaatoolkit.Main.main(Main.java)

Solution: The toolkit is already started with the --server argument, close it before running a new instance.
Solution: Your firewall has a strict policy regarding localhost port bindings, add a rule to allow localhost:42042

Custom Java Runtime

On some operating systems, the installed java runtime might not be up-to-date and prevent $ATK from executing properly.
To run $ATK, at least java 11 is required. To install a valid runtime only for $ATK:

  • Go to Open JDK and download the archive for your system
  • Extract the archive into $ATK installation's folder
  • Rename the extracted jdk-20.0.x folder to jdk
  • The folder tree structure should be avaa-toolkit / jdk / bin /
  • The launcher should now use the provided runtime in the jdk folder automatically