What is the Algorithm behind this structure
proposal engine ?
In order to understand the algorithm
behind, let's do a simple example: Assume your spectrum consists
of ONE, SINGLE line - a dublett at 160.0 ppm. What are possible
structures to fulfill all constraints derived from the C-NMR spectrum ?
e.g.
1,2,4,5-Tetrazine
( 161,0 ppm )
Formic acid,
anhydride ( 158,5 ppm )
1,2-diformyl
hydrazine ( 159,2 ppm )
Formyl
hydrazine (
159,2 ppm )
Instead
of taking the exact
shiftvalue, the spectrum is translated
into a PATTERN like: "In the typical
lowfield heteroaromatic region there is a dublett, all other regions of
the spectrum contain no signal." This is similar to Wolfgang
Bremser's approach named SAHO-search - internally the
SAHO-implementation as used as in CSEARCH is also used here.
This spectral information (=peaktable) is encoded into a 15-character
'Spectral Hash Key' which describes this spectral pattern.
The spectrum of acetone ( 29 Q, 206 S ) is described like: "There is one
singulett in the typical CO-region and one quartett in the region
around 30ppm, no other signals are available."
This description of the spectral pattern is again converted into a
unique hash-code describing the URL of a webpage holding all structure
proposals with an identical description of their spectra. The situation
becomes more complicated from the fact, that lines cannot be exactly
attributed to one single region in the spectrum - therefore for ONE
spectrum up to 1,024 alternate spectral hash-keys can be
constructed in this implementation. The structures are accessed as
INChIKeys and the spectra as SpectralKeys allowing direct access to the
corresponding webpage hosted on 'http://nmrpredict.orc.univie.ac.at'
How many structures are available ?
70,000,000 Structures from the
PUBCHEM-Compounds and PUBCHEM-Substances files have been taken, which
corresponds to the downloads from November and December 2007. This set
of structures includes also the structures from Chemspider deposited at
PUBCHEM.
How many C-NMR spectra have been calculated
?
The number of calculated C-NMR spectra
is
somewhat smaller than the number of structures contained in the
PUBCHEM-files. Inorganics, polymers and compounds exceeding the limits
of CSEARCH have been automatically skipped (e.g. more than 99 carbons,
or more than 63 oxygen, etc.)
How many structure - spectra pairs are
available ?
2,963,385,376 Structure pairs -
Spectral pairs are available. Roughly 3 billions.
What
is the typical search time for 3 billions of structure-spectra pairs ?
The typical search-time for searching
3 billions of structure-spectra pairs is below 2 seconds, in most cases
below 1 second. Any redundant information (same structure-spectrum
pair) will be automatically removed during the search - this costs a
few milliseconds of CPU-usage.
Which technology has been used for spectrum
calculation ?
All spectra have been calculated using
the NN-prediction engine of the CSEARCH-software enhanced by the
auto-stereo recognition feature as implemented into the
NMRPredict-program.
What happens during a search ?
Your peaklist is 'translated' into an
URL pointing to the corresponding webpage on the 'NMRPredict'-Server. On this particular webpage
adressed by your hash-code all structures from the PUBCHEM-collection
are summarized, which have 'similar' NMR-spectra to your peaklist. The
corresponding structures might give some hint about your unknown
molecule. In cases where experimental data are available, the
corresponding information derived from the CSEARCH, SPECINFO and
NMRShiftDB collection is directly linked to the structures.
Is there a link to experimental
NMR-data available ?
Yes - in cases where an experimental spectrum is available, it is
linked via the InChIKey. The collections
indexed are CSEARCH, SPECINFO and NMRShiftDB as well as the
'University of Mainz'-collection.
How to perform
a search ?
Method 1:
- Put your peaklist into a 'Hash-Code generator', which generates
the
URL and downloads the corresponding page using the command 'curl'
- Keep in mind: Not all possible spectral pattern can be derived
from the PUBCHEM-collection, sometimes you will get a 'Page not found
(404)'-error
- If you are interested in using this Structure Elucidation Engine
feel free to contact me
Method 2:
Please keep in mind:
- All comparisons are based on CALCULATED
spectra
- There is no ranking
done, because the predicted shiftvalues are not available. The spectral
information is only coded into the Hash-key.
- Take the structures you see as PROPOSALS, not
as FINAL SOLUTION for your request
- At least the compound class
can be usually derived
- Despite the PUBCHEM-files hold approx. 70,000,000 structures -
this amount represents
only a small segment of existing and upcoming organic chemistry
Page
written by: Wolfgang.Robien(at)univie.ac.at on November
13th, 2007
Page online since: April
16th, 2008