CSEARCH performance
versus ACD CNMR/Predictor Performance:
A few weeks ago I
published a webpage about a quality evaluation
of NMRShiftDB-data, where I
mentioned an
average deviation of 2.22ppm
before a first correction cycle and 2.19ppm afterwards. This
improvement of 0.03ppm corresponds ca. 6200ppm overall, because
more than 200,000 shifts were involved.
Now ACD has published a paper "NMR
Prediction Accuracy Validation"
using nearly the same dataset (in the meantime the most prominent
errors have been removed by Christoph Steinbeck, therefore it
corresponds in principle to the data I had in hand after the first
correction cycle). They obtained an average deviation of 1.59ppm, which
is
definitely much better than 2.22ppm (or 2.19ppm).
Lets go back to the starting point:
The starting point was a quality evaluation of NMRShiftDB-data; I was
not interested in the absolute value of the deviation between the
CSEARCH prediction and the experimental values given in NMRShiftDB, I
was simply interested in the question: "Is there a visible improvement
when you spent a few hours on data-correction, both on a statistical
basis and also with a few specific examples?"
My finding was, that spending less than an afternoon you can
improve a
collection of some 20,000 NMR-spectra having more than 200,000
shiftvalues by 0.03ppm or ca. 6,200ppm in
total. That's an amazing result ! NMRShiftDB has been online for about
5
years, the 20 most important contributors are mentioned on their
homepage by name and I think many other people have contributed into
this
project. All these people together were unable
to spend less than an
afternoon on data-correction within 5 years! That's exactly the point
nobody seems to (be willing to) understand ! This is the
central point
in ALL open-source projects, you always need a strong personality to
push things - it doesn't matter if this project is called NMRShiftDB or
LINUX or whatever its name is.
Peter
Murray-Rust blog tell's us:
...... It contains mechanisms for assessing data quality automatically.
For
example software can be run that will indicate whether values are
seriously in error.........
That's really a
great idea AFTER being 5
years on the web ! We need HIGH-QUALITY
data, agreed ! The production of high-quality data is tightly connected
with ongoing data checking (and correction) - it's named GLP, but the
problem is a tiny 'c' before the term "GLP" !
We have always to improve our algorithms because of that tiny 'c'
and we have always to correct known errors immediately for the same
reason - let
me know
which data-correction protocol has been applied to the NMRShiftDB-data
leading to deviations of about 130ppm (= 2/3 of the usual C-NMR shift
range) BEFORE you put them on the web
! Maybe Christoph could spend a few words to the community about
available data-checking protocols in NMRShiftDB ?
Now it's really fun to see
what happens afterwards:
ACD - a global player in the field of NMR spectrum prediction and
computer-assisted structure elucidation - has now access to a
"benchmark" value of a competitor and comes up with a better value for
its own prediction engine (1.59ppm versus 2.19/2.22ppm). That's
absolutely legal, informative and might push science to the advantage
of the community.
It order to make things more
clear, I will describe the technology used
during my evaluation:
- My calculations leading to a value of 2.19/2.22ppm have been done
using my CSEARCH-environment -
I stated that very clearly on my
webpage; the technology used was 'Neural Network Technology' based on a
network datafile, which has been generated in 1996 from approx. 60,000
reference data.
- As is well-known within the scientific community I have strong
cooperations with nearly all vendors of spectroscopic databases
(BIORAD, MODGRAPH, SPECINFO
and many others) - therefore a total amount
of some 750,000 X-nuclei NMR-spectra are online in my internal
collection.
In order to be honest with all of them I do not only separate their
data
very well, I also separate all secondary files in the same way. That's
exactly the reason, why I have many (yes: many!) datafiles for the
neural network parameters available, despite I use always the same
algorithm.
- Within the CSEARCH-environment I can do a 'solvent-specific'
prediction - e.g. for benzene I can produce a C13-prediction of
129.0ppm
for CDCl3 (nearly all literature values of benzene in CDCl3 are between
128.0 and
129.0ppm) and I can produce with the SAME
ALGORITHM and the SAME
NETWORK DATAFILE a value of 140.4ppm (YES one hundred and forty
!) just setting
a switch to another solvent ! It's for "Lewis acids/Magic Acids" (the
known
literature values from CSEARCH-databases are between 135 and 145ppm,
therefore 140.4ppm isn't
that bad !)
- From the arguments above it's really nice to have a "validation
set" named NMRShiftDB which has
about 85%
of the solvents 'UNREPORTED'
- it's clear that most of the data have been obtained in the usual
NMR-solvents and the effect isn't that dramatic.
- Another central point with neural network technology is, that a
network can do interpolation very well, but extrapolation is forbidden
! Within CSEARCH there is a switch which enables/disables the feature
'Ignore Extrapolation' - for testing purposes I sometimes accept also
extrapolated values internally. Despite they bias the absolute value of
the average
deviation - because the network has never really learned those
functional groups - they
have no influence on the difference !
- I can show that I can cover a total
range of approximately 4ppm in
average for a single prediction when
changing only the solvent-selection and using the identical network algorithm and the identical network datafiles - again
tested on the NMRShiftDB-dataset.
A few more words about NMRShiftDB
data-quality:
Some feel well at 30 degree, some people don't. The question regarding
250, 300 or more errors falls into this category - I think the point is
again, how well they are hidden within the data. Databases can be
automatically checked - if it is possible to find 250 or 300 obvious
errors
automatically within minutes of CPU-time, they should be corrected
immediately. As it is stated on Ryan's
blog, ACD has found 8% of
misassignments during their data-extraction process - we
should learn
from this horrible number (my
opinion !), that we need highly validated datasets and
not 'centers of epidemics', which have again (bad) influence on the
data quality of NMR-assignments in the upcoming literature - or
do we like 'applied error-propagation' ?
Copied from Ryan's
blog:
------begin of copy-----
Let me say, I am very confused by the positioning of this question
to Christoph Steinbeck:
"Why do you "reinvent" existing systems - there are a lot of
systems (with much better performance !) already around (a few in
alphabetical order: ACD, CSEARCH, KnowItAll, NMRPredict, SDBS,
SPECINFO)"
Why reinvent existing systems?
To improve! To provide better resources for NMR spectroscopists and
scientists around the world!
-----end of copy-------
OK, the scientific community needs improvements - agreed !
What is the improvement, when you wait for a molecular formula search
at NMRShiftDB over some 20,000 compounds for about 40 seconds ? How
long does it take with ACD-software ? I
think it will be milliseconds, as with CSEARCH and NMRPredict.
What is the improvement in the NMRShiftDB HOSE-code implementation
compared to ACD, compared to CSEARCH ?
Any new data-checking algorithm, which finds errors overseen by ACD
and/or CSEARCH ?
It's free - agreed that is an advantage for many of us in academia !
Free of charge itself is not a scientific progress, but it might
support science.
OK - its free, thats great ! Lets compare it against MESTREC - a one
year subscription of "NMRPredict ONLINE FULL" costs Euro 155.- which allows you to
access
425,000
NMR-spectra. Lets disregard the software and the performance, just
focus on the number of spectra.
425,000 spectra ........ 155 Euro
20,000 spectra ........ x
Euro --------> x=7.29 Euro
That means: If you access "NMRPredict ONLINE FULL" you have to pay instead of 0.00 Euro
an
amount of 7.29 Euro per year (this corresponds to less than 2 packages
of cigarettes !) with respect to the same number of spectra !
Now lets focus on the resource -
the data of NMRShiftDB - itself:
As stated previously there were obvious and easily detectable errors in
the first version, I have analyzed. There seems to be no validation
process during data input and nobody seems to be responsible for data
corrections over years, despite there are nearly 1100 registered users
and many contributors. After my report and some external help the
errors have been (at least) partly corrected. (BTW: Christoph never
asked me for a copy of my error reports)
It is a strong disadvantage, that about 85% of the data have no solvent
in it - how to validate an algorithm with them, which can handle
solvent-specific
predictions ? It would be fine to run the predictions twice, once with
'solvent-specific prediction' disabled and afterwards enabled. From
these findings the algorithm could be improved.
The most necessary work (and the most cumbersome one) when creating a
spectral database is the input of the data itself - this is
time-consuming and definitely not fun like algorithm development. Why
is it done twice (or even more often) ?. An advantage to the scientific
community would be if all the database supplier wouldn't focus on the
same journals, which have the most data in it - the goal should be to
create the highest diversity. I definitely do not feel threatened by
the NMRShiftDB-project - I know it is simply stupid extracting the same
data from the same journals twice. Why was there no effort done by
Christoph Steinbeck, to cover journals which are not covered by the
people already working in this field when he started about 5 years ago
? Even if somebody doesn't feel comfortable working together, it should
be possible to get a list of journals currently under extraction. Under
these criteria NMRShiftDB would be a valuable resource and an excellent
dataset for testing, because there would be more unique structures
there. There is an overlap of 57% between ACD's collection and
NMRShiftDB based on the
number of carbons as reported in the ACD-paper - in other words 57% of
the data have been entered twice. No other challenging things around ?
At
the end:
Congratulations to ACD - they are at 1.59ppm:
1.59ppm
average deviation for more than 200,000
predictions based on the 20,000 NMRShiftDB-structures using their
ACD/CNMR Predictor
Software is really great !
Congratulations to MODGRAPH - they are at
1.40ppm:
Using the same
NMRShiftDB-dataset, MODGRAPH's CSEARCH-derived
NMRPredict-program performs already at 1.40ppm using their
"BEST"-technology !
Page
written on May 31st, 2007 by
Wolfgang.Robien(at)University.of.Vienna(univie.ac.at)
Page
online since: May 31st, 2007
Last modification: June 4th, 2007