Get the Text-to-Speech from Articulate Storyline 360
5 min read

Get the Text-to-Speech from Articulate Storyline 360

Get the Text-to-Speech from Articulate Storyline 360

I needed a process to extract the TTS content from an Articulate Storyline file, and make it reproducible.  Here's my process and outcome.

The Spec

I've started to investigate a bit the format, and to my surprise, it was the same approach as MS Office: a zip file with many files containing the relevant data.

The story/story.xml inside the zip file contains the list of all slides in order, in two places:

  1. sceneLst tag contains the list of scenes with slides by ID,
  2. toc contains the list of slides by refG tag, which is present in each slide.

sceneLst

The structure of the sceneLst node (story/sceneLst) is like:

<sceneLst>
  <scene g="86678ffa-72f4-4a33-87ba-06660cb7d6a5" 
         verG="e50b1569-154f-4faa-9556-1f054857ff8b" 
         name="An example" 
         desc="" 
         primaryId="00000000-0000-0000-0000-000000000000" 
         sceneType="scene" 
         collapse="false">
    <sldIdLst>
      <sldId>R6NvRGVHRMwC</sldId>
      <sldId>R6TQMcrHEncM</sldId>
      ...
    </sldIdLst>
  </scene>
  <scene ...>
    <sldIdLst>
      ...
    </sldIdLst>
  </scene>

It is a list of scene nodes, each one with a list of sldId node.

toc

The structure of the toc node is somewhat similar:

<toc g="b3cb855a-dbd8-4bef-8990-0c0871574584" 
     verG="a2c27a8d-8d11-4e37-af79-eb5ee2d6989a" 
     projectId="e1fb1df7-a022-4b46-8188-6f033e4b343c">
  <entryLst>
    <tocSceneEntry g="9aabbd20-fd20-474c-bacf-8d27f6350dde" 
                   verG="86341bf7-ed25-4811-a60c-3931a34ca671" 
                   refG="86678ffa-72f4-4a33-87ba-06660cb7d6a5" 
                   corG="898b081a-cb37-4156-b7f8-8f3548fd0eff" 
                   expanded="true">
      <entryLst>
        <tocSlideEntry g="af15fa61-7c9c-44e7-8004-b6ae34492dd5" 
                       verG="ee293cb0-4225-4b30-907b-aa6e5d9b3c73" 
                       refG="84bb9348-7232-4a61-9fc3-0799a8b5a8e5" 
                       corG="10d347fc-fe2a-41ae-a77c-2c2950ba77a7" 
                       expanded="true">
          <entryLst/>
        </tocSlideEntry>
        <tocSlideEntry ...>
          <entryLst/>
        </tocSlideEntry>
       </entryLst>
    </tocSceneEntry>
  </entryList>
  ...
</toc>

The refG attribute is in the slide.

Get the text to speech

The _rels/story.xml.rels has a list of (sldId, slide XML file name ) pairs.

For the text to speech process, we form the ttsPrps tag in the slide XML file `sld/shapeList/tts/ttsPrps.

Close captioning

Resources for each slide are stores in slides/_rels/(slide name).xml.rels and looks something like this:

<?xml version="1.0" encoding="utf-8"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
	<Relationship Type="media" Target="/story/media/R6aYHe1Ciu1x.png" Id="Rf6f47d320aa74f1c"/>
	<Relationship Type="media" Target="/story/media/R5glk1klqvrM.mp3" Id="Rcb2472a5c22347e7"/>
	<Relationship Type="media" Target="/story/media/R5xtk5W3yuYF.vtt" Id="Rb4d0d34c283547a2"/>
	<Relationship Type="media" Target="/story/media/R6ZuoNYZStjK.jpg" Id="R588c9b3261194ae4"/>
</Relationships>

The .vtt files for the slide, which contain the close captioning of the text. The format of a TTS file is:

WEBVTT
Kind: captions
Source: Articulate Closed Captions Editor
Source Version: 3.80.31058.0

00:00:00.150 --> 00:00:04.992
This is an example! 

00:00:05.142 --> 00:00:09.292
This is another subtitle! 

In principle, we can leave the close captions as they are, because the audio lengths should not vary too much between the different voices.

The Code

I wrote a bit of code to do this automatically, and export a CSV file with a File name and a Content field. The code I wrote is in Python and it's fast enough to retrieve the data from 60+ slides.

Helper functions

Given the data is formatted in XML (barring the .vtt files), I wrote few helper functions.

The first function is to extract the attributes of an XML node ( dom.minidom )

def get_attributes(node):
    return dict(node.attributes.items())

The second function is to extract the text from a node:

def get_text(nodelist):
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
    return ''.join(rc)

Get the ID -> Filename Mapping

This function opens the rels file described above and creates a dictionary where the keys are the IDs and the values are the file names associated with those IDs:

def get_id_file(zipfile):
   '''
   Params:
   - zipfile - The PyZipFile object created for the .storyline file
   Returns: a dict containing the slide ID (key) and slide file name (value)
   '''
    map = zf.open("story/_rels/story.xml.rels")
    map_xml = parseString(map.read())
    result = {}
    for item in map_xml.getElementsByTagName("Relationship"):
        attrs=get_attributes(item)
        if attrs['Type'] == 'slide':
            result[attrs['Id']] = attrs['Target']

    return result

Processing all scenes

Once we got the mapping, we can start processing the scenes. The function below gets the list of scenes from the story/story.xml file described above.

def process_scenes(zf, map):
    # Read the story
    story = zf.open("story/story.xml")
    story_xml = parseString(story.read())

    scenes = story_xml.getElementsByTagName("sceneLst")[0].childNodes

    for scene in scenes:
        # There's only one scene in a list
        process_scene(zf, scene, map)

As you can see, it just iterates through the scenes and calls process_scene, the function to process one scene, described just below.

Processing one scene

The single scene processor is as follows:

def process_scene(zf, scene, map):
   '''
   Params:
   - zf - The PyZipFile object created for the .storyline file
   - scene - the `scene` XML node
   - map - the slide ID/slide file name dict
   '''
    scene_properties = get_attributes(scene)

    # Get the slide IDs
    idx = 1
    for id in scene.childNodes[0].childNodes:
        id = get_text(id.childNodes)
        file_name = map[id]

        process_slide(zf, idx, file_name[1:], scene_properties['name'])
        idx += 1
    pass

The function does the following:

  1. Get the scene name (from scene_properties)
  2. Gets the list of slide IDs
  3. For each ID, it retrieves the file name and calls process_slide to, well, process the slide :)

For convenience, we also have an iterator, to help with slides with the same names.

Processing one slide

The slide processor function does all the necessary stuff to extract the text-to-speech(TTS) string(s):

def process_slide(zf, slide_idx, file_name, scene_name):
   '''
   Params:
   - zf - The PyZipFile object created for the .storyline file
   - slide_idx - the slide order number
   - file_name - the file name for the slide
   - scene_name - the scene name - used for the output file name CSV column
   '''
    # Read the story
    slide = zf.open(file_name)
    buf = slide.read().decode("utf-8")
    buf = ">\n<".join(buf.split("><"))

    scene_xml = parseString(buf)
    scene_attrs = get_attributes(scene_xml.getElementsByTagName("sld")[0])
    # print(scene_attrs)

    tts_items = scene_xml.getElementsByTagName("ttsPrps")

    i=0
    if tts_items:
        for tts in tts_items:
            tts_attrs = get_attributes(tts)
            print(f"\"{scene_name}/{slide_idx:02d} - {scene_attrs['name']}.{i:02d}.mp3\",\"{tts_attrs['synthTxt']}\"")
            i += 1
    else:
        print(f"\"{scene_name}/{scene_attrs['name']}\",\"NO TTS FOR THIS SLIDE\"")

There are few things that happen here:

  1. We open and read the slide's XML file
  2. We create the minidom representation of the file
  3. We identify all TTS components in tts_items
  4. For each component, we write a CSV line containing a representation of an output file name and the TTS content

Few notes here:

  • We add a \n in between > and <, because otherwise minidom barfs
  • The output file name has a path component (the scene name), and the file name component containing the order of the slide and the slide name. This is to avoid the case where multiple slides have the same name
  • The TTS string may have line breaks

All together now!

Now that we have everything, the main part of the script is:

# Open the zip file
zf = PyZipFile("storyline.story")

id_to_slide = get_id_file(zf)

print("File name, Content")

process_scenes(zf, id_to_slide)

This will allow us to get a CSV like this:

File name, Content
"Slide 1/01 - Glossary.00.mp3","ABC. The Alphabet"
"Slide 1/02 - Glossary.00.mp3","BEL. Belgium"
"Slide 1/03 - Glossary.00.mp3","RO. Romania"

You can import the file in Excel following this StackOverflow question.

HTH,