17 Jan 2013

WEBVTT: Refactoring the Parser

So I've been working on refactoring the parser a bit over the last week. I talked about this in the one of my previous posts. You can check out the work I've done so far on my GitHub.

Squashing Code is Nice

So I have successfully squashed the node and cuetext token data code and it's worked out nicely. This reduces the overall amount of code that needs to be written and the complexity of the code that needs to be written to access those structs. Here's what they look like now:

Token:

struct
webvtt_cuetext_token_t {
  webvtt_cuetext_token_type token_type;
  webvtt_string tag_name; // Only used for start token and end token types.
  union {
    webvtt_string text;
    webvtt_timestamp time_stamp;
    webvtt_cuetext_start_token_data *start_token_data;
  };
};

/**
* Represents a start tag in the cue text.
* These take the form of <[TAG_NAME].[CLASSES] [POSSIBLE_ANNOTATION]> in the cue text.
*/
struct
webvtt_cuetext_start_token_data_t {
  webvtt_string_list *css_classes;
  webvtt_string annotations;
};

Node:

struct
webvtt_node_t {
  /**
  * The specification asks for uni directional linked list, but we have added a parent
  * node in order to facilitate an iterative cue text parsing solution.
  */
  webvtt_node *parent;
  webvtt_node_kind kind;

  union {
    webvtt_string text;
    webvtt_timestamp time_stamp;
    webvtt_internal_node_data *internal_data;
  };
};

struct
webvtt_internal_node_data_t {
  webvtt_string annotation;
  webvtt_string_list *css_classes;

  webvtt_uint alloc;
  webvtt_uint length;
  webvtt_node **children;
};

Here is one example how it simplifies the code:

Before:

if( webvtt_create_node_from_token( token_ptr, &temp_node_ptr, current_node_ptr ) != WEBVTT_SUCCESS )
  /* Do something here? */
  continue;
  else
  {
  webvtt_attach_internal_node( (webvtt_internal_node_ptr)current_node_ptr->concrete_node, temp_node_ptr );
  if( WEBVTT_IS_VALID_INTERNAL_NODE( temp_node_ptr->kind ) )
    current_node_ptr = temp_node_ptr;
  }
}

After:

if( webvtt_create_node_from_token( token, &temp_node, current_node ) != WEBVTT_SUCCESS )
  /* Do something here? */
  { continue; }
  else {
    webvtt_attach_internal_node( current_node, temp_node );
     if( WEBVTT_IS_VALID_INTERNAL_NODE( temp_node->kind ) )
       { current_node = temp_node; }
      }
  }
}

Tokens vs Nodes

In my last blog post about refactoring the parser I talked about removing the token stuff and instead just using webvtt_node straight away. I haven't done that yet because I have discovered a problem with it. The problem is that tokens have a way of representing an end tag whereas nodes do not. This is important because when the tokenizer returns an end tag token a new node is not instantiated, instead the current node will change to the current nodes parent. Therefore, since a node has no idea of what an end tag is it cannot replace a token... yet. I'm still not convinced that we can't do this in some way.

Those Pesky Pointer Typedefs

I've also begun to remove all the pointer typedefs I've been using throughout my code. I liked it at first, but now that I have been thinking about it just using straight pointer notation is probably better.

  1. It's more clear as to what is actually going on.
  2. I was typedef-ing pointers as a [object]_ptr. So if I instead just use pointer notation it will decrease the character count everywhere. Instead of [object]_ptr everywhere we will just have *[object].

Memory Leaks in C++ Bindings

The last thing I have done for now isĀ  changed the C++ bindings over to instantiate the entire tree of nodes when the cue is created. This way the cue knows about the nodes and will be able to release them at the end if it's lifespan. In this way we can avoid memory leaks.

UTF16 to UTF8

The final thing that I need to do is get the parser using UTF8 instead of UTF16. Caitlin is currently doing a major overhaul on the webvtt_string stuff to change it from using UTF16 to instead use our webvtt_byte which is straight UTF8. That way we can have all those nice string functions that we created for UTF16 with our UTF8 stuff.

Forward to Review

I'm looking to get this done as well as debugging the unit tests by Friday. Then I believe the parser will be ready for review.

- Tagged in cdot, seneca college, mozilla, and open-source.

comments powered by Disqus

Wow coding Copyleft Richard Eyre 2013

Code for this site can be found on GitHub.