In the past two weeks, I need to build an in-house compounds collection database. Well, there are many open and commercials ones available on Internet. However, they do not serve my needs. Then I decide that I might be able to build one of my own. I have no previous experience in database building. So I started out by thinking if I can use Mathematica or Lisp to build a database of its own without relying on the third-party engines. Later I realized the enomorus work that will require.
After searching the documentation of Mathematica 7, I found it provides an easy-to-use interface for many database systems, such as MySQL and Oracle. Although the mathematica version of SQL functions is quite limited and you cannot use those functions as free as normal functions (they have to be able to converted to SQL grammar), you could always use SQLExecute[] to write any SQL sentences. The interface function takes care of data types conversion. I then chose the open source MySQL as my database engine.
Mathematica 7 introduced new Import capabilities for SDF and SMILES files besides its support for MOL2 and PDB. It also provides a closed source database called ChemicalData. At the beginning, I thought these functionality may save me a lot of time since most of my data sources are available in either SDF or SMILES format. After trying out several big commercial catalogs, I abandoned this idea. Real life files are not 100% in conform with the file format standards, but filled with all sorts of odds. Import function fails at so many situations that I have to write many wrap-up codes for taking care of errors. Then I started wring my own parser codes and the job was much simpler than I thought. The SDF format is quite old and its definition is in the FORTRAN style.
Although mathematica is not much helpful in parsing the input files, it is a good choice for work-flow coding, pre- and post-processing of data items. Cheminformatics toolkits need time to write, but the interface to external programs of Mathematica offers a convenient pipe to employ the third-party toolkits, such as openbabel, oechem, jchem, etc.
After debugging many obvious and less obvious errors, the system is finally running smoothly. On my 2-core lenovo laptop with 2GB memory, the single-process code loads about 800,000 compounds (1.8GB on hard disk) including the redundancy check and proprieties calculation in 12 hours. And the average CPU loading is about 1.5. And the time cost per entry seems not increasing as the database increases its size dramatically.
Wednesday, February 04, 2009
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment