Antony Williams stated
here:
..........
What was interesting
about Robien’s post was the fact that it focused on the application of
Neural Networks to prediction........
According to my list of publications our neural network approach has
been shown to the scientific community for the first time during the
Gemen-conference on Computers in Chemistry in 1996. It was the work for
diploma of my former coworker V. Purtuc here at Vienna University. We
use this approach for already more than 10 years and it has proven its
capabilities in many real-world structure verification problems during
our daily work. For us its really an 'old hat' ;-) - one procedure for
checking of our data among many other features within CSEARCH.
Copied
from Tony's blog:
Begin of
sequence---------------------------------------------------------------
.............
After analyzing the
data, over
200,000 individual chemical shifts I can say DON’T judge by the lowest
common denominator. There is some junk on there..as seen by Wolgang
Robien. But our estimates after our analysis is likely less than 250
data points in error ............. The addition or improvement of
rigorous checking algorithms to the NMRSHIFTDB is the next natural step
and flagging data to the submitter will have them check and validate
the quality of their input. This will catch many errors during the
submission process...........
End of copied
sequence----------------------------------------------------------
Within
this paragraph there are 2 central messages:
Message #1:
===========
OK, less than 250 data points are in error:
Lets check it - we simply have a problem here: There are more than
200,000 shiftvalues - is therefore a 'data point' a single CHEMICAL
SHIFT VALUE or is a 'data point' a STRUCTURE. Lets leave this question
open, we do our job first and then we go into details.
Message #2:
===========
The second part is the verification of my central message: There are
not even simple checks during data input - keep in mind: 'NMRSHIFTDB'
is an approach based on a reference collection of (mainly) C13-NMR
spectra which are used for prediction AND VERIFICATION of unknown
structures. If we do a bad job here, we have another wrong example
which improves all statistical parameters - in other words: Applied
error propagation ! For the NMR-specialists: We should be able to
learn from the distribution of shift values in the ipso- and
para-position of a tosylgroup. In ca. 50% of the cases the assignment
is correct - the other 50% are wrong depending on the cited literature
(Sorry to say so: When you put your eye's into your neighbors work
during an examination, you should first have some rough feeling of your
neighbors capabilities !)
Before we start checking a
few remarks about 'latest additions':
The data of the 4 isomers of decalin-2-ol in NMRSHIFTDB were
'summarized' as one single isomer having 4 sets of chemical shift
values - I mentioned this software-bug in one of my emails to Christoph
Steinbeck. Now we find
on
the homepage of NMRShiftDB 3 of these isomers as 'LATEST ADDITIONS'
(sreendump done on May 8th, 2007)
According to my personal opinion just an
error-correction of a severe
software-bug - it seems that we have different opinions about ethics in
science.
Now lets apply a more complete
CSEARCH-based procedure for data
checking as can be seen on the previously shown pages:
Screendumps are from CSEARCH; the
CSEARCH-Database has been created on
March 10th, 2007. Within 'nmrshiftdb' it is possible that this error
has already been corrected after informing Christoph Steinbeck
about this investigation.
Lets start
easy: It is well known that the 2 methylgroups within an
isoprene-unit give signals around 17 and 25 ppm, a standard-example for
this fact is geraniol. The cis-Me is around 17ppm, the trans-Me around
25ppm. By application of a basic formula from vector-algebra this fact
can be proven automatically - as done in CSEARCH since its early days
during the 80's of the last century. CSEARCH tells us, that this single
type of error occurs on 57 structures, in rare cases also twice on the
same structure, because of more than one partial structure of this
particular type.
In the
picture above one wrong and one correct assignment can be seen
.....
A
quite interesting variation of this topic can be seen in the next
picture:
Now we have 57
structures found by one simple check and (at least) 104
misassigned shift values ....
Now lets do a spectrum prediction using these data - the selected
structure is quite simple. From the screen-dump below it can be seen
that the predicted values are quite reasonable, the ranges are fairly
large, because even in trivial structures there are a lot of assignment
errors. (The prediction uses HOSE-code technology with stereochemistry,
spheres are used over 3 bonds)
When
analyzing the predictions having unusual large ranges, the
following distributions have been obtained:
These
6 extremely large distributions are caused by 105 assignment
errors within the data.
When doing this job in a more systematic way not using specific
examples as given here, the total number of incorrect assignments
exceeds the above mentioned limit of 250 significantly. The
intermediate number is at the moment around 300, but about ca. 1,000
pages of printouts are waiting for visual inspection. From my
experience with C13-NMR databases a reasonable estimate for the number
of misassigned shiftvalues will be between
500 and 600, which corresponds to ca. 0.25-0,3%.
Afterwards the data will reach the usual quality as within CSEARCH and
NMRPredict.
I definitely do not claim, that collections like CSEARCH, NMRPredict
and SPECINFO are free of errors - the desired level of errors is always
0.0%; a value which can't be reached - the acceptable limit is clearly
below 0.1%, maybe 0.05% is good compromise between dream and reality.
If we go back to the question about the definition of a data point:
Based on shiftvalues an error-rate of ca. 0,3% is to be expected, based
on structures it would be 1,25%. Therefore lets assume a data point
corresponds to a single chemical shift, which corresponds also to my
understanding of Tony Williams statement.
When recalling e.g. methyl
4-ethynylbenzoate the following
display will be obtained (screendump done on May 8th, 2007):
Now
we find MEASURED SHIFTS (according to header) exact to 0.0001 ppm -
scrolling down reveals that these shifts have been calculated using the
SPARTAN-package:
The
example above shows the 'conversion' of experimental values to
calculated ones - with protons the 'conversion' goes the other way
round.
Not
a severe error - but very confusing to an occasional user. When
this error has been corrected, maybe a message will be appear on the
homepage that a new feature is available ......... ;-)
Another nice feature (or bug) of NMRshiftdb
can be easily detected, when downloading the SD-file named
"NmrshiftdbWithSignals.sdf.zip":
A quite simple test:
Run the following command on it:
unzip NmrshiftdbWithSignals.sdf.zip ; grep -i 'spartan' nmrshiftdb.sdf
| wc -l
From this simple command you get the information that 168 lines have
the word 'Spartan' in it. Now perform a search on nmrshiftdb.org
(expert search --> search by condition: activate the 'Spartan'-box)
as shown below. You find 171 entries - no problem at all; today is
already May 9th, 2007 - the download has been done on March 10th, 2007
- 3 calculated spectra have been obviously added in the meantime.
Now lets have a more detailed look on the
result - we inspect the first 25 lines having the string 'Spartan' in
it:
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
A line like '0:Spartan 1:Spartan 2:Spartan 3:Spartan' tells us, that
there are 4 spectra available for this
particular structure - and all of them are calculated by using the
'Spartan'-package ...... therefore we have more than 168
calculated spectra. Doing a set of "grep"-commands ( grep -i
'0:spartan' nmrshiftdb.sdf | wc -l ;
grep -i '1:spartan' nmrshiftdb.sdf | wc -l; ..... ), we identify a
total number of 489 entries with calculated shiftvalues using the
Spartan-package - but we find only 171.
The explanation is quite easy: The
download-file, which is available to the scientific community doesn't
fit to the actual database - there happens a severe software error when
downloading the database into a SD-file ......
My 3 basic questions to Christoph
Steinbeck are:
1)
Why do you "reinvent" existing systems - there are a lot of systems
(with much better performance !) already around (a few in
alphabetical order: ACD, CSEARCH, KnowItAll, NMRPredict, SDBS, SPECINFO)
2) What is
the progress of science in this particular project justifying financial
support from the German tax payers via DFG ?
3) I have
sent this information about the data quality AND the software-problems
to you and to the members of the Scientific Advisory Board - therefore
these problems are known since end of March 2007. When an error on
consumer products has been detected usually a 'call-back' starts - in
your case distribution continues and seems to be intensified.
At
the
end of this page I want to cite the words of Tony Williams used
in his blog:
I give a thumbs up to
the quality of the NMRSHIFTDB. We’ve
validated it ......
So, my compliments to
Christoph and the team. The quality is excellent and there are “large
errors” but minimum in number..........
Nothing
to add ! Wolfgang Robien / May 8+9th, 2007