Thursday, January 15, 2009

SMILES Parser

SMILES and Mathematica normal expression are essentially the same. Both are a presentation of a graph by breaking cycles into labeled trees. It is straightforward to parse a SMILES string into an expression so that further operations can be easily applied in Mathematica functions.
With the aid of string patterns (i.e. regular expression), recursive programing and the ability to convert a string into an expression (i.e. macro expansion), an expandable parser is not hard to write.
First, there are two kinds of information encoded in SMILES: atomic and structural information. Atomic information includes atom symbol, type, bonds, isotope weight, charge, valence, etc. Structural information includes atom order, the cyclic labeling, branches, etc. Configuration around double bonds, chirality are structural information from a chemist's viewpoint. However, they are not graphic properties in the sense of topology and can only be encoded into some predefined conventions, such as the orders of atoms. If we need independent representation form of them, we can encode such information as 3D coordinates of atoms along other information. Atomic information is very limited in SMILES and can be thought as known knowledge or can be obtained with other means. For the sake of simplicity, we may neglect it now.
We divide the parsing job into two parts, static definition of atom dictionary and all sorts of string patterns and scan function of a SMILES string.

Atom Dictionary

smiElements:= {
{"C", 4},
{"H",1},
{"N", 3,5},
{"O",2},
{"S",2,4,6},
{"B",3},
{"F",1},
{"Cl",1},
{"Br",1},
{"I",1}
};
smiElementsAromatic:={"c","n","o","p","s","as","se"};


String Patterns

(* Need to be refined. *)
smiAtomDefault := (#[[1]] & /@ smiElements);
smiChargePatttern := (("-" | "+") ..) | (("-" | "+") ~~ DigitCharacter ...);
smiChiralPattern := ("@" ..) | ("@" ~~ DigitCharacter) | ("@" ~~ LetterCharacter ~~ LetterCharacter ~~ DigitCharacter);
smiAtomCustom := "[" ~~ DigitCharacter ... ~~ smiAtomDefault ~~
smiChiralPattern ... ~~ ("H" ~~ DigitCharacter ...) ... ~~
smiChargePatttern ... ~~ "]";
smiAtomPattern := smiAtomCustom | smiAtomDefault;
smiAtomAromatic := smiElementsAromatic | ( "[" ~~ DigitCharacter ... ~~ smiElementsAromatic ~~ smiChiralPattern ... ~~ ("H" ~~ DigitCharacter ...) ... ~~smiChargePatttern ... ~~ "]");
smiBondPattern := "-" | "=" | "#" | ":" | "/" | "\\" | ".";
smiBranchBra := "(";
smiBranchKet := ")";
smiBranchEither = smiBranchBra | smiBranchKet;
smiCyclicLabel := (DigitCharacter ~~ ("%" ~~ DigitCharacter ~~ DigitCharacter) ...) | (DigitCharacter ~~ DigitCharacter ~~ ("%" ~~ DigitCharacter ~~ DigitCharacter) ...)
smiAtomAny := ("" | smiBondPattern) ~~ (smiAtomPattern | smiAtomAromatic) ~~ ("" |smiCyclicLabel);


When scanning a SMILES string, we break it into basic nodes as defined in the pattern of smiAtomAny, then different cases are handled accordingly. There are three situations in general: an atomic node, an atomic node with breaking labels and a branching node.

Parse Smiles

smiParseSmiles[s_String, type_: "String"] :=
Module[{smiParseAtom, smiParse,smiGetLabelIndex, smiParseCyclicLabelMeta},

smiGetLabelIndex[i_String] := Module[{pos, temp}, ...];
smiParseCyclicLabelMeta[bond_String, y_String] := Module[{z, zz, zzz}, ...];
smiParseAtom[ss_String] /; StringMatchQ[ss, smiAtomAny] :=StringReplace[...];
smiParseAtom[ss_String] /; StringMatchQ[ss, smiBranchBra] := (...);
smiParseAtom[ss_String] /; StringMatchQ[ss, smiBranchKet] := (...);
smiParse[ss_String] := Block[{$RecursionLimit = Infinity},StringReplace[ss, ...];

StringReplace[s,
StartOfString ~~ a : (smiAtomPattern | smiAtomAromatic) ~~
c : ("" | smiCyclicLabel) ~~ rest___ ~~ EndOfString :>
"Molecule[\"" ~~ a ~~ "\"" ~~
If[StringQ[c] && StringLength[c] > 0,
smiParseCyclicLabelMeta[
If[StringMatchQ[a, smiAtomAromatic], ",Aromatic",
",Single["], c], ""] ~~
If[StringQ[rest] && StringLength[rest] > 0, smiParse[rest]] ~~
"]"]
];


Cycling labeled are re-organized with an unique number from the natural number sequence. Cycle breakage labeling and branching handling uses a similar mechanism: a stack-like (either in the form of a data structure or a function) data structure is used to store intermediate results. As an open symbol is met, a new item is built. When a closure symbol is met, the stored intermedate is handled and pop up one item from the stack. Otherwise, we push the item into the stack. To avoid using global variables which is convenient in recursive coding, we define the sub-functions inside smiParseSmiles.

The function was tested on the 150 smiles strings in the OpenBabel package. Total time of 30 seconds was costed on my laptop. Below is an example SMILES string and the parsed result.

Example SMILES

OC(=O)C1=C(C=CC=C1)C2=C3C=CC(=O)C(=C3OC4=C2C=CC(=C4Br)O)Br


Normal Expression Represenation

Molecule["O",
Single["C", Double["O"],
Single["C", Single[R[1]],
Double["C",
Single["C", Double["C", Single["C", Double["C", Double[R[1]]]]]],
Single["C", Single[R[2]],
Double["C", Double[R[3]],
Single["C",
Double["C",
Single["C", Double["O"],
Single["C",
Double["C", Double[R[3]],
Single["O",
Single["C", Single[R[4]],
Double["C", Double[R[2]],
Single["C",
Double["C",
Single["C", Double["C", Double[R[4]], Single["Br"]],
Single["O"]]]]]]]], Single["Br"]]]]]]]]]]]

No comments: