WIGファイルを構文解析する - parser combinatorの使い方。 Eitherによるエラー処理

解説

WIGフォーマットは、ゲノム座標上に数値データを載せるときに使われる。このデータは一次元の大きな配列を表現しているのだが、その他のアノテーション（データを説明するメタデータ、グラフデータの幅、ステップの量など）も適切に処理する必要がある。

WIGフォーマットのサンプル

http://genome.ucsc.edu/goldenPath/help/wiggle.htmlより。

browser position chr19:49304200-49310700
browser hide all
#	150 base wide bar graph at arbitrarily spaced positions,
#	threshold line drawn at y=11.76
#	autoScale off viewing range set to [0:25]
#	priority = 10 positions this as the first graph
#	Note, one-relative coordinate system in use for this format
track type=wiggle_0 name="variableStep" description="variableStep format"  visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10
variableStep chrom=chr19 span=150
49304701 10.0
49304901 12.5
49305401 15.0
49305601 17.5
49305901 20.0
49306081 17.5
49306301 15.0
49306691 12.5
49307871 10.0
#	200 base wide points graph at every 300 bases, 50 pixel high graph
#	autoScale off and viewing range set to [0:1000]
#	priority = 20 positions this as the second graph
#	Note, one-relative coordinate system in use for this format
track type=wiggle_0 name="fixedStep" description="fixedStep format" visibility=full autoScale=off viewLimits=0:1000 color=0,200,100 maxHeightPixels=100:50:20 graphType=points priority=20
fixedStep chrom=chr19 start=49307401 step=300 span=200
1000
 900
 800
 700
 600
 500
 400
 300
 200
 100

方法

Scalaのparser combinatorを使った構文解析を行う。

サンプルコード

Wig.scala (Genome Weaverプロジェクトより)

WIGフォーマットの各行に対応するクラスを定義

WIG formatは一行ずつ解析できる仕様になっており、各々の行に対応するクラスを定義する。

object WIG {
  sealed abstract class Header extends WIG
  case class Comment(line: String) extends WIG
  case class Browser(line: String) extends Header
  case class Track(property: Map[String, String]) extends Header
  case class VariableStep(chrom: String, span: Int = 1) extends Header
  case class FixedStep(chrom: String, start: Int, step: Int = 1, span: Int = 1) extends Header
  sealed abstract class Data extends WIG
  case class VariableStepValue(position: Int, value: Float) extends Data
  case class FixedStepValue(value: Float) extends Data

  case class Error(message:String) extends WIG
}

sealed abstract class WIG

構文解析の結果

上記のサンプルデータから以下のような出力を得たい:

Browser(browser position chr19:49304200-49310700)
Browser(browser hide all)
Comment(#   150 base wide bar graph at arbitrarily spaced positions,)
Comment(#   threshold line drawn at y=11.76)
Comment(#   autoScale off viewing range set to [0:25])
Comment(#   priority = 10 positions this as the first graph)
Comment(#   Note, one-relative coordinate system in use for this format)
Track(Map(yLineMark -> 11.76, name -> variableStep, priority -> 10, autoScale -> off, description -> variableStep format, color -> 50,150,255, yLineOnOff -> on, viewLimits -> 0.0:25.0, type -> wiggle_0, visibility -> full))
VariableStep(chr19,150)
VariableStepValue(49304701,10.0)
VariableStepValue(49304901,12.5)
VariableStepValue(49305401,15.0)
VariableStepValue(49305601,17.5)
VariableStepValue(49305901,20.0)
VariableStepValue(49306081,17.5)
VariableStepValue(49306301,15.0)
VariableStepValue(49306691,12.5)
VariableStepValue(49307871,10.0)
Comment(#   200 base wide points graph at every 300 bases, 50 pixel high graph)
Comment(#   autoScale off and viewing range set to [0:1000])
Comment(#   priority = 20 positions this as the second graph)
Comment(#   Note, one-relative coordinate system in use for this format)
Track(Map(name -> fixedStep, priority -> 20, autoScale -> off, description -> fixedStep format, color -> 0,200,100, viewLimits -> 0:1000, maxHeightPixels -> 100:50:20, type -> wiggle_0, visibility -> full, graphType -> points))
FixedStep(chr19,49307401,300,200)
FixedStepValue(1000.0)
FixedStepValue(900.0)
FixedStepValue(800.0)
FixedStepValue(700.0)
FixedStepValue(600.0)
FixedStepValue(500.0)
FixedStepValue(400.0)
FixedStepValue(300.0)
FixedStepValue(200.0)
FixedStepValue(100.0)

正規表現による字句解析(lexical analysis)

RegexParser を使うと手軽に正規表現を使った構文解析が行える。空白文字（white spaces)はデフォルトで無視してくれるのでルール中に記述する必要はない。

WIGフォーマットの要素をBNF記法で表すと、以下のようになる。

header        := paramName param* 
param         := paramName "=" paramValue
paramName     := [A-Za-z0-9:\-_\.]	
paramValue    := stringLiteral | quote | value
value         := [^\"\s]+
stringLiteral := '"' (.*) '"'  // 単純表記。実際に使うパターンは以下のコード例を参照
quote         := "'" (.*) "'"  // 単純表記。実際に使うパターンは以下のコード例を参照

この文法に対応する解析構文を行うには、Parse[A]の型を返す要素(elem)をparser内に定義する。正規表現(文字列から.rで作成される) を記述すると、implicit conversionによってParse[String]型の要素に変換される。

object WIGParser extends RegexParsers with Logger {
  // remove quotation symbols
  protected def unquote(s: String): String = s.substring(1, s.length() - 1)
  def paramName: Parser[String] = """[A-Za-z0-9:\-_\.]+""".r
  def value: Parser[String] = """[^\"'\s]+""".r

ダブルクォート、シングルクォートを含んだ文字列のパターン。エスケープシーケンス、Unicode文字列も表現できるように配慮。

  def stringLiteral: Parser[String] = ("\"" + """([^"\p{Cntrl}\\]|\\[\\/bfnrt]|\\u[a-fA-F0-9]{4})*""" + "\"").r ^^ 
  { unquote(_)  }

  def quote: Parser[String] = ("'" + """([^'\p{Cntrl}\\]|\\[\\/bfnrt]|\\u[a-fA-F0-9]{4})*""" + "'").r ^^ 
  { unquote(_) }

^^ { ... } をパターンの後につなげると、出力結果をパターンマッチにより加工できる。マッチした結果を後の処理で扱いやすい形に変更する際に使う。

パターンを組み合わせた構文解析 (parsing)

正規表現の記述力だけでは限界があるため、正規表現でマッチした要素を組み合わせてより複雑な構文を記述できる。

| (or)、 ~ (パターンの連結), 、rep(パターンの繰り返し) 、repsep（パターンを区切り文字のパターンを挟んで繰り返し連結）などが使える。

コード例:

  def paramValue: Parser[String] = stringLiteral | quote | value

  def param: Parser[(String, String)] = paramName ~ "=" ~ paramValue ^^ {
	 case key ~ "=" ~ value => (key, value)
  }
  def header: Parser[(String, Map[String, String])] = paramName ~ rep(param) ^^ {
    case p ~ params => (p, Map() ++ params)
  }

ヘッダの解析

RegexParserのparseAllを呼び出すとパターンに文字列をマッチさせる。成功するとSuccess(マッチした結果、残りのテキスト)が返り、失敗すると NoSuccessが返る。

ここでEither を使い、構文解析に失敗したときの処理(Left)と、成功した時の処理(Right)を同時に扱えるようにするのがコードを複雑にしないコツ。

以下はWIGのheader行(track, fixedStep, variableStep)を解析を開始するコード。Eitherを返す。

  def parseHeader(line: String) : Either[NoSuccess, (String, Map[String, String])] = {
    parseAll(header, line) match {
      case Success(result, next) => Right(result)
      case failure : NoSuccess => Left(failure)
    }
  }

行単位で処理を分ける

RegexParsersでは構文定義を短く書けるが、awkやANTLRのようにオートマトンを生成するわけではなく、正規表現によるマッチを繰り返すので残念ながら速度が速くない。プログラミング言語の解析程度なら問題ないが、ゲノム情報処理のように大規模データ全体を構文解析するには速度的に厳しい。行ごとに処理が分けられる文法なら、RegexParsersによる処理は必要な行に対してのみ行うと良い。

def parseLine(line: String): WIG = {

    // 文字列の変換中に例外が発生したらLeft(例外)、成功すればRight(値)を返す
    def convert[A](s:String, f:String => A) : Either[Throwable, A] = 
      scala.util.control.Exception.allCatch.either(f(s))
    
    def toInt(s:String) = convert(s, {_.toInt})
    def toFloat(s:String) = convert(s, {_.toFloat})
    def invalidLine = WIG.Error("invalid line: " + line)

    def parse : Either[NoSuccess, WIG] = {
      if (line.startsWith("#"))
        Right(WIG.Comment(line))
      else if (line.startsWith("browser"))
        Right(WIG.Browser(line))
      else if (line.startsWith("track"))
        parseHeader(line).right.map(header => WIG.Track(header._2))
      else if (line.startsWith("variableStep")) {
        parseHeader(line).right.map{
          case (name, props) =>
            (props.get("chrom"), toInt(props.getOrElse("span", "1"))) match {
              case (Some(chr), Right(sp)) => WIG.VariableStep(chrom=chr, span=sp)
              case _ => invalidLine
            }
        }
      }
      else if (line.startsWith("fixedStep")) {
        parseHeader(line).right.map{
          case (name, props) =>
            val chrom = props.get("chrom")
            val start = props.get("start")
            val step = toInt(props.get("step").getOrElse("1"))
            val span = toInt(props.get("span").getOrElse("1"))
            (chrom, start, step, span) match {
              case (Some(chr), Some(s), Right(st), Right(sp)) =>
                WIG.FixedStep(chrom=chr, start=s.toInt, step=st, span=sp)
              case _ => invalidLine
            }
        }
      }
      else {
        // data line
        val c = line.trim.split("""\s+""")
        val r = c match {
          case Array(step, value) => (toInt(step), toFloat(value)) match {
            case (Right(st), Right(v)) => WIG.VariableStepValue(st, v)
            case _ => invalidLine
          }
          case Array(value) => toFloat(value) match {
            case Right(v) => WIG.FixedStepValue(value.toFloat)
            case _ => invalidLine
          }
          case _ => invalidLine
        }
        Right(r)
      }
    }

    parse match {
      case Right(m) => m
      case Left(error) => WIG.Error(error.toString)
    }
 }

Eitherを用いて、エラー処理による分岐を減らしたコードにする

Either[A, B] は、AまたはBを返す型である。

上記のように Either[（エラー情報）, （結果）]を返すとき、成功した場合は引き続きの処理を行いたいが、エラーの場合はそのまま次のコードにエラーを伝えたい場合がある。 Either.rightを呼び出すと、値の内容がRightの型の場合は次の処理を行い、Leftの型の場合は以降の処理を無視してLeftの内容（この場合はエラー情報を）そのまま返すことができる。

  // parseHeader(line)   :  Either[NoSuccess, (String, Map[String, String])] を返す
  if (line.startsWith("track"))
    parseHeader(line).right.map(header => WIG.Track(header._2))  // Either[NoSuccess, WIG] を返す

Eitherを使うことで、エラーを含んだデータであっても処理の流れを妨げないようにできる。

WIGファイルを構文解析する parser combinatorの使い方。 Eitherによるエラー処理

解説