For this new Le lab session, we wanted to learn SVG and build something useful for biologists around us: so we drafted Franklin , our DNA annotation tool.

Le lab #3 - Franklin the DNA sequence annotation tool

On July 1st, 2016, we published an update on Franklin (Le lab #4).

A long time ago in a galaxy region far, far away, close researchers asked us to write a tool that help them to maintain their knowledge database on genomic sequences they are working on. Since then, we unfortunately had no time to design such tool, but a few weeks ago, we decided that a Le lab session would be appropriate to draft something. And this was a lot of fun!


Hurry to annotate your sequences?

Try Franklin now!


Key problems

For a (molecular) biologist, nucleic acid sequences are the base material they work with. Most of the time they focus their research on sequences of a few kb like plasmids or chromosome regions. Working on pieces of DNA means adding a lot of information (a.k.a. annotations) on the sequence itself, like exons, primers or SNPs. Those metadata are crucial to design their experiment, so they must be shared across their teams or better with the whole scientific community.

Unfortunately, to their knowledge, there is no simple and freely accessible tool to achieve this. So they call Captain MS Office :tm: to help them. And here we are: they use to share a MS Word :tm: file as a reference knowledge database. Yes, you read it well. We cannot blame them for that: it’s painful to maintain, index/search, etc. but it works!

For those of you that are not aware of what a sequence looks like, an example follows:

>gi|671162122:c7086083-7083225 Drosophila melanogaster chromosome 3R
ATGGTCACTCTAATCGCAGTCTGCAATTTACGTGTTTCCAACTTAACGCCCCCAAGTTAATAGCCGTAAT
CATTTGAAAAGAAAGGCACGCACGCACAACGCCATGCGGATCGAACCTGGGGACTCCTTTTGGACGAAAA
AGGCGATGTTTTCCAACGCAGAAAGGCAGTACTTTGAGACGGTCCGTCCGCGGAAGACCAGTGTGAGTAA
AAGTTGACCGTCGATGGCGATTTCACAAGTGACGTTTAAGTGGCGGGAACTTCTACTCACAAATCCCTGA
GCCCTGTGATATGATTTATTTTATGGAGCCGTGATCCGGACGAAAAATGCACACACATTTCTACAAAAAT
ATGTACATCGCGGTGCGATTGTGTCGCTTAAAGCACACGTACACCCACTGTCACACTCACACTCACATGC
ATACACCGCCGGCGAACTTTGGTGTAGTTGGCCACCCTACGAAATTCAACCGCTTCGATTCGAATTTTCG
AATCAACAGTTATTGGCAGTCGAACAAAGGCGGCAAACTTTCGAGTTGCAGAAAAGTTAACGCATTCGAT
TAACCTTTCAGCTTCCGGGCTCCACCGCGCCCAACATAGCCGCTCCGGTAACAAAGGCCACGAAGAAGAA
...

From a computer scientist’s point-of-view, a DNA sequence is a string with a restricted subset of characters ([ATGC] but not only), and, an annotation can be modelized as:

{
  "positionFrom": 123,
  "positionTo": 145,
  "label": "primer",
  "comment": "Reference: R2D2/C6PO used in project X"
}

Considering this, we thought that a dedicated web tool could save time and help them to be more efficient. This is the time when Franklin comes to the rescue!

Scope of this Le lab session

As a week is really short to design a new tool, we decided to focus on four main features:

  • Import fasta file sequence, like the Tailor gene,
  • Render the sequence in an SVG image,
  • Create, Edit & Remove labels,
  • Create, Edit & Remove annotations.

As we felt in love with React, the interface has been conceived as a pure SPA. The sequence along with its annotations is rendered as SVG elements:

<svg
  version="1.1"
  baseProfile="full"
  width={this.state.width}
  height={this.state.height}
  xmlns="http://www.w3.org/2000/svg"
>
  <rect width="100%" height="100%" />

  <Annotations
    labels={this.props.labels}
    {...this.state}
  />

  <Sequence
    sequence={this.props.sequence}
    positionFrom={this.props.positionFrom}
    {...this.state}
  />
</svg>

Once rendered, here is what it looks like:

Franklin's Snapshot

Performance issues

We wanted to build an SVG-based application to export annotated sequence as an image that can be further integrated in a publication or edited with a scalable graphics editor. By doing so, we learnt many things about SVG. Particularly that rendering SVG can cause severe performance issues when a web browser try to parse and render it in a web view.

Using a lot of SVG groups (<g> tags) along with transformations applied to it seems rather inefficient and had to be finely tuned. Moreover with huge sequences, the DOM complexity grows exponentially and cannot be handled by a web browser. For this first release, Franklin is perfectly fluid for sequences of a few kb, but is not dimensioned for full genome annotations.

In a few words, there is still room for improvements like splitting our main SVG element to mixed HTML/SVG elements or dropping SVG in favor of a <canvas>, but it was not the purpose of this Le lab session. We finally released a proof of concept with reasonable performances.

Roadmap to a usable tool

Franklin is still in an early stage of development, but here is what we have in mind to make it usable for modern scientists:

  1. Built-in exon support: define the exons of your sequence and choose to display either the coding or the full sequence,
  2. Fuzzy search: use regular expressions to look for a pattern in a (spliced) sequence with both strands search support,
  3. CSV export: save your annotations in a file for sharing and further analysis,
  4. Data persistence à la Monod: work on your sequence safely with client-side encryption, and share your work with your colleagues.

Seems pretty cool, huh?

Final remarks

There was a lot of entropy in this session: recent events forced us to split our session in non-continuous days, and, we clearly under-estimate the magnitude of the task. But, we have the feeling that Franklin has the potential to be handy for some researcher.

As you might expect, the source code of Franklin is available on GitHub. Feel free to report bugs and/or contribute.

Last but not least, we decided to name it Franklin, not because of him, but rather in homage to Rosalind Elsie Franklin who’s major contributions to the discovery of the DNA structure has been too lately recognized.

Update (2016-07-01) — You can get more information about Franklin by reading: Le lab #4 - Franklin the DNA annotation ninja is back.


Hurry to annotate your sequences?

Try Franklin now!