An Evaluation of NMRSHIFTDB by Antony Williams
from Chemspider and a few further remarks by
Wolfgang Robien (May 8+9th, 2007)


Antony Williams stated here:


.......... What was interesting about Robien’s post was the fact that it focused on the application of Neural Networks to prediction........


According to my list of publications our neural network approach has been shown to the scientific community for the first time during the Gemen-conference on Computers in Chemistry in 1996. It was the work for diploma of my former coworker V. Purtuc here at Vienna University. We use this approach for already more than 10 years and it has proven its capabilities in many real-world structure verification problems during our daily work. For us its really an 'old hat' ;-) - one procedure for checking of our data among many other features within CSEARCH.


Copied from Tony's blog:

Begin of sequence---------------------------------------------------------------

............. After analyzing the data, over 200,000 individual chemical shifts I can say DON’T judge by the lowest common denominator. There is some junk on there..as seen by Wolgang Robien. But our estimates after our analysis is likely less than 250 data points in error ............. The addition or improvement of rigorous checking algorithms to the NMRSHIFTDB is the next natural step and flagging data to the submitter will have them check and validate the quality of their input. This will catch many errors during the submission process...........

End of copied sequence----------------------------------------------------------


Within this paragraph there are 2 central messages:


Message #1:
===========


OK, less than 250 data points are in error:

Lets check it - we simply have a problem here: There are more than 200,000 shiftvalues - is therefore a 'data point' a single CHEMICAL SHIFT VALUE or is a 'data point' a STRUCTURE. Lets leave this question open, we do our job first and then we go into details.


Message #2:
===========


The second part is the verification of my central message: There are not even simple checks during data input - keep in mind: 'NMRSHIFTDB' is an approach based on a reference collection of (mainly) C13-NMR spectra which are used for prediction AND VERIFICATION of unknown structures. If we do a bad job here, we have another wrong example which improves all statistical parameters - in other words: Applied error propagation !  For the NMR-specialists: We should be able to learn from the distribution of shift values in the ipso- and para-position of a tosylgroup. In ca. 50% of the cases the assignment is correct - the other 50% are wrong depending on the cited literature (Sorry to say so: When you put your eye's into your neighbors work during an examination, you should first have some rough feeling of your neighbors capabilities !)


Before we start checking a few remarks about 'latest additions':


The data of the 4 isomers of decalin-2-ol in NMRSHIFTDB were 'summarized' as one single isomer having 4 sets of chemical shift values - I mentioned this software-bug in one of my emails to Christoph Steinbeck. Now we find on the homepage of NMRShiftDB 3 of these isomers as 'LATEST ADDITIONS' (sreendump done on May 8th, 2007)


Ethics in science



According to my personal opinion just an error-correction of a severe software-bug - it seems that we have different opinions about ethics in science.




Now lets apply a more complete CSEARCH-based procedure for data checking as can be seen on the previously shown pages:


Screendumps are from CSEARCH; the CSEARCH-Database has been created on March 10th, 2007. Within 'nmrshiftdb' it is possible that this error has already been corrected after informing Christoph Steinbeck about this investigation.


Lets start easy: It is well known that the 2 methylgroups within an isoprene-unit give signals around 17 and 25 ppm, a standard-example for this fact is geraniol. The cis-Me is around 17ppm, the trans-Me around 25ppm. By application of a basic formula from vector-algebra this fact can be proven automatically - as done in CSEARCH since its early days during the 80's of the last century. CSEARCH tells us, that this single type of error occurs on 57 structures, in rare cases also twice on the same structure, because of more than one partial structure of this particular type.


Geraniol-type error

In the picture above one wrong and one correct assignment can be seen .....

A quite interesting variation of this topic can be seen in the next picture:


Variation


Now we have 57 structures found by one simple check and (at least) 104 misassigned shift values ....


Now lets do a spectrum prediction using these data - the selected structure is quite simple. From the screen-dump below it can be seen that the predicted values are quite reasonable, the ranges are fairly large, because even in trivial structures there are a lot of assignment errors. (The prediction uses HOSE-code technology with stereochemistry, spheres are used over 3 bonds)


prediction


When analyzing the predictions having unusual large ranges, the following distributions have been obtained:


Distribution

Distribution

Distribution

Distribution

Distribution

Distribution

These 6 extremely large distributions are caused by 105 assignment errors within the data.


When doing this job in a more systematic way not using specific examples as given here, the total number of incorrect assignments exceeds the above mentioned limit of 250 significantly. The intermediate number is at the moment around 300, but about ca. 1,000 pages of printouts are waiting for visual inspection. From my experience with C13-NMR databases a reasonable estimate for the number of misassigned shiftvalues will be between 500 and 600, which corresponds to ca. 0.25-0,3%. Afterwards the data will reach the usual quality as within CSEARCH and NMRPredict.

I definitely do not claim, that collections like CSEARCH, NMRPredict and SPECINFO are free of errors - the desired level of errors is always 0.0%; a value which can't be reached - the acceptable limit is clearly below 0.1%, maybe 0.05% is good compromise between dream and reality.

If we go back to the question about the definition of a data point: Based on shiftvalues an error-rate of ca. 0,3% is to be expected, based on structures it would be 1,25%. Therefore lets assume a data point corresponds to a single chemical shift, which corresponds also to my understanding of Tony Williams statement.


When recalling e.g. methyl 4-ethynylbenzoate the following display will be obtained (screendump done on May 8th, 2007):


Software-Error

Now we find MEASURED SHIFTS (according to header) exact to 0.0001 ppm - scrolling down reveals that these shifts have been calculated using the SPARTAN-package:

Software-Error

The example above shows the 'conversion' of experimental values to calculated ones - with protons the 'conversion' goes the other way round.

Software-Error


Not a severe error - but very confusing to an occasional user. When this error has been corrected, maybe a message will be appear on the homepage that a new feature is available ......... ;-)


Another nice feature (or bug) of NMRshiftdb can be easily detected, when downloading the SD-file named "NmrshiftdbWithSignals.sdf.zip":


A quite simple test:

Run the following command on it:
unzip NmrshiftdbWithSignals.sdf.zip ; grep -i 'spartan' nmrshiftdb.sdf | wc -l

From this simple command you get the information that 168 lines have the word 'Spartan' in it. Now perform a search on nmrshiftdb.org (expert search --> search by condition: activate the 'Spartan'-box) as shown below. You find 171 entries - no problem at all; today is already May 9th, 2007 - the download has been done on March 10th, 2007 - 3 calculated spectra have been obviously added in the meantime.

Spartan


Now lets have a more detailed look on the result - we inspect the first 25 lines having the string 'Spartan' in it:


0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan
0:Spartan 1:Spartan 2:Spartan 3:Spartan

A line like '0:Spartan 1:Spartan 2:Spartan 3:Spartan' tells us, that there are 4 spectra available for this
particular structure - and all of them are calculated by using the 'Spartan'-package ......  therefore we have more than 168 calculated spectra. Doing a set of "grep"-commands ( grep -i '0:spartan' nmrshiftdb.sdf | wc -l ;
grep -i '1:spartan' nmrshiftdb.sdf | wc -l; ..... ), we identify a total number of 489 entries with calculated shiftvalues using the Spartan-package - but we find only 171.

The explanation is quite easy: The download-file, which is available to the scientific community doesn't fit to the actual database - there happens a severe software error when downloading the database into a SD-file ......




My 3 basic questions to Christoph Steinbeck are:


1) Why do you "reinvent" existing systems - there are a lot of systems (with much better performance !) already around  (a few in alphabetical order: ACD, CSEARCH, KnowItAll, NMRPredict, SDBS, SPECINFO)

2) What is the progress of science in this particular project justifying financial support from the German tax payers via DFG ?

3) I have sent this information about the data quality AND the software-problems to you and to the members of the Scientific Advisory Board - therefore these problems are known since end of March 2007. When an error on consumer products has been detected usually a 'call-back' starts - in your case distribution continues and seems to be intensified.






At the end of this page I want to cite the words of Tony Williams used in his blog:


I give a thumbs up to the quality of the NMRSHIFTDB. We’ve validated it ...... So, my compliments to Christoph and the team. The quality is excellent and there are “large errors” but minimum in number..........


Nothing to add !    Wolfgang Robien / May 8+9th, 2007

Thumbs up !