Life in Blue: January 2009

Monday, January 26, 2009

Pfizer buys Wyeth

2008 saw many "impossible" things happened. Super big Wall Street banks went brankrupt and Obama won the presidency of the United States. Well, in the new year of 2009, Pfizer wants to buy Wyeth for $68 billion and it would become the largest merger case if it is completed as planned.
News report from New York Times is linked as below.
http://www.nytimes.com/2009/01/26/business/26drug.html?_r=1&hp

Thursday, January 15, 2009

SMILES Parser

SMILES and Mathematica normal expression are essentially the same. Both are a presentation of a graph by breaking cycles into labeled trees. It is straightforward to parse a SMILES string into an expression so that further operations can be easily applied in Mathematica functions.
With the aid of string patterns (i.e. regular expression), recursive programing and the ability to convert a string into an expression (i.e. macro expansion), an expandable parser is not hard to write.
First, there are two kinds of information encoded in SMILES: atomic and structural information. Atomic information includes atom symbol, type, bonds, isotope weight, charge, valence, etc. Structural information includes atom order, the cyclic labeling, branches, etc. Configuration around double bonds, chirality are structural information from a chemist's viewpoint. However, they are not graphic properties in the sense of topology and can only be encoded into some predefined conventions, such as the orders of atoms. If we need independent representation form of them, we can encode such information as 3D coordinates of atoms along other information. Atomic information is very limited in SMILES and can be thought as known knowledge or can be obtained with other means. For the sake of simplicity, we may neglect it now.
We divide the parsing job into two parts, static definition of atom dictionary and all sorts of string patterns and scan function of a SMILES string.

Atom Dictionary


smiElements:= {
     {"C", 4},
     {"H",1},
     {"N", 3,5},
     {"O",2},
     {"S",2,4,6},
     {"B",3},
     {"F",1},
     {"Cl",1},
     {"Br",1},
     {"I",1}
};
smiElementsAromatic:={"c","n","o","p","s","as","se"};

String Patterns


(* Need to be refined. *)
smiAtomDefault := (#[[1]] & /@ smiElements);
smiChargePatttern := (("-" | "+") ..) | (("-" | "+") ~~ DigitCharacter ...);
smiChiralPattern := ("@" ..) | ("@" ~~ DigitCharacter) | ("@" ~~ LetterCharacter ~~ LetterCharacter ~~ DigitCharacter);
smiAtomCustom :=  "[" ~~ DigitCharacter ... ~~ smiAtomDefault ~~ 
   smiChiralPattern ... ~~ ("H" ~~ DigitCharacter ...) ... ~~ 
   smiChargePatttern ... ~~ "]";
smiAtomPattern := smiAtomCustom | smiAtomDefault;
smiAtomAromatic := smiElementsAromatic | (  "[" ~~ DigitCharacter ... ~~ smiElementsAromatic ~~  smiChiralPattern ... ~~ ("H" ~~ DigitCharacter ...) ... ~~smiChargePatttern ... ~~ "]");
smiBondPattern := "-" | "=" | "#" | ":" | "/" | "\\" | ".";
smiBranchBra := "(";
smiBranchKet := ")";
smiBranchEither = smiBranchBra | smiBranchKet;
smiCyclicLabel := (DigitCharacter ~~ ("%" ~~ DigitCharacter ~~ DigitCharacter) ...) | (DigitCharacter ~~ DigitCharacter ~~ ("%" ~~ DigitCharacter ~~ DigitCharacter) ...)
smiAtomAny := ("" | smiBondPattern) ~~ (smiAtomPattern | smiAtomAromatic) ~~ ("" |smiCyclicLabel);

When scanning a SMILES string, we break it into basic nodes as defined in the pattern of smiAtomAny, then different cases are handled accordingly. There are three situations in general: an atomic node, an atomic node with breaking labels and a branching node.

Parse Smiles


smiParseSmiles[s_String, type_: "String"] := 
  Module[{smiParseAtom, smiParse,smiGetLabelIndex, smiParseCyclicLabelMeta},
   
   smiGetLabelIndex[i_String] := Module[{pos, temp}, ...];
   smiParseCyclicLabelMeta[bond_String, y_String] := Module[{z, zz, zzz}, ...];
   smiParseAtom[ss_String] /; StringMatchQ[ss, smiAtomAny] :=StringReplace[...]; 
   smiParseAtom[ss_String] /; StringMatchQ[ss, smiBranchBra] := (...);
   smiParseAtom[ss_String] /; StringMatchQ[ss, smiBranchKet] := (...);
   smiParse[ss_String] := Block[{$RecursionLimit = Infinity},StringReplace[ss, ...];
   
  StringReplace[s, 
     StartOfString ~~ a : (smiAtomPattern | smiAtomAromatic) ~~ 
       c : ("" | smiCyclicLabel) ~~ rest___ ~~ EndOfString :> 
      "Molecule[\"" ~~ a ~~ "\"" ~~ 
       If[StringQ[c] && StringLength[c] > 0, 
        smiParseCyclicLabelMeta[
         If[StringMatchQ[a, smiAtomAromatic], ",Aromatic", 
          ",Single["], c], ""] ~~ 
       If[StringQ[rest] && StringLength[rest] > 0, smiParse[rest]] ~~ 
       "]"]
   ];

Cycling labeled are re-organized with an unique number from the natural number sequence. Cycle breakage labeling and branching handling uses a similar mechanism: a stack-like (either in the form of a data structure or a function) data structure is used to store intermediate results. As an open symbol is met, a new item is built. When a closure symbol is met, the stored intermedate is handled and pop up one item from the stack. Otherwise, we push the item into the stack. To avoid using global variables which is convenient in recursive coding, we define the sub-functions inside smiParseSmiles.

The function was tested on the 150 smiles strings in the OpenBabel package. Total time of 30 seconds was costed on my laptop. Below is an example SMILES string and the parsed result.

Example SMILES


OC(=O)C1=C(C=CC=C1)C2=C3C=CC(=O)C(=C3OC4=C2C=CC(=C4Br)O)Br

Normal Expression Represenation


Molecule["O", 
 Single["C", Double["O"], 
  Single["C", Single[R[1]], 
   Double["C", 
    Single["C", Double["C", Single["C", Double["C", Double[R[1]]]]]], 
    Single["C", Single[R[2]], 
     Double["C", Double[R[3]], 
      Single["C", 
       Double["C", 
        Single["C", Double["O"], 
         Single["C", 
          Double["C", Double[R[3]], 
           Single["O", 
            Single["C", Single[R[4]], 
             Double["C", Double[R[2]], 
              Single["C", 
               Double["C", 
                Single["C", Double["C", Double[R[4]], Single["Br"]], 
                 Single["O"]]]]]]]], Single["Br"]]]]]]]]]]]

Thursday, January 08, 2009

Represenation of Molecules

There are many ways to represent molecules in computerized world, such as connection table, z-matrix, cartesian lists, SMILES, INCHI, SDF, MOL2, PDF, etc. Various kinds of formats serve for different purpose. For the most use in cheminformatics area, the format should be efficent and compact in terms of size, import and export, all sorts of operations. For the internal usage, the human readability is not so important. A molecule can be naturally viewed as a labeled graph. Robert B. Nachbar proposed a hierarchical data structure (a tree) for genetic programming in his molecular evolution studies. Like in SMILES format, rings are broken to turn a cyclic graph to a tree and the broken bonds are labeled with a label. He called such format Normal Expression represenation. In Mathematica, a sample presenation is as follows. This view is not only natural but also efficient in parsing though it seems clumszy at the first sight.

Ball-Stick Model

Mathematica Normal Expression

Tree Structure

Although the molecular evolution has been continued by researchers at the Unveristy of Leiden (they even set up a company for selling their software product), the represenation format has been abondoned. Instead, they adopt the so-called TreeSMILES format which is nearly identical with SMILES. Since LISP was used in their implementation. I don't see the reason why abondon such format. Actually, in Lisp, the format merits on another advantage: the closure scoping and compiling on-fly make it possible to present a molecule in a compiled function. For such an implementation, not only is the efficiency is high, but also the coding is terse (the molecular represenation function also serves as accessor and other related functions).

Wednesday, January 07, 2009

Reader on Screen

The Reader is a novel about the relationship between a teenager boy and a Nazi policewoman during WWII. I first read it four (or eight?) years ago. As I remembered, I could not keep on reading it through once I started. The work has all characteristics of a masterpiece, a touching angle about a big historical event, the conflict between humanity and justice, desire and sex, etc. Just an unforgetable reading experience as the author unfolding his story with the heroine, Hanna.

The same title on today's New York Times caught my attention. By clicking into the link, I was eager to find if this movie is based on the novel that I read. The running title, "How far would you go to protect a secret?", does not ensure me, but as I read the story summary, all the feelings got freshen-up.

The movie, starred by Kate Winslet and Ralph Fiennese, has won several gold globe nominations. I am eager to see it, however, not sure if I should hold much expecation on the movie.

Life in Blue