CSEARCH performance versus ACD CNMR/Predictor Performance:


A few weeks ago I published a webpage about a quality evaluation of NMRShiftDB-data, where I mentioned an average deviation of 2.22ppm before a first correction cycle and 2.19ppm afterwards. This improvement of 0.03ppm corresponds ca. 6200ppm overall, because more than 200,000 shifts were involved.

Now ACD has published a paper "NMR Prediction Accuracy Validation" using nearly the same dataset (in the meantime the most prominent errors have been removed by Christoph Steinbeck, therefore it corresponds in principle to the data I had in hand after the first correction cycle). They obtained an average deviation of 1.59ppm, which is definitely much better than 2.22ppm (or 2.19ppm).

Lets go back to the starting point:

The starting point was a quality evaluation of NMRShiftDB-data; I was not interested in the absolute value of the deviation between the CSEARCH prediction and the experimental values given in NMRShiftDB, I was simply interested in the question: "Is there a visible improvement when you spent a few hours on data-correction, both on a statistical basis and also with a few specific examples?"

My finding was, that spending less than an afternoon you can improve a collection of some 20,000 NMR-spectra having more than 200,000 shiftvalues by 0.03ppm or ca. 6,200ppm in total. That's an amazing result ! NMRShiftDB has been online for about 5 years, the 20 most important contributors are mentioned on their homepage by name and I think many other people have contributed into this project. All these people together were unable to spend less than an afternoon on data-correction within 5 years! That's exactly the point nobody seems to (be willing to) understand ! This is the central point in ALL open-source projects, you always need a strong personality to push things - it doesn't matter if this project is called NMRShiftDB or LINUX or whatever its name is.

Peter Murray-Rust blog tell's us:

...... It contains mechanisms for assessing data quality automatically. For example software can be run that will indicate whether values are seriously in error.........

That's really a great idea AFTER being 5 years on the web ! We need HIGH-QUALITY data, agreed ! The production of high-quality data is tightly connected with ongoing data checking (and correction) - it's named GLP, but the problem is a tiny 'c' before the term "GLP" !
We have always to improve our algorithms because of that tiny 'c' and we have always to correct known errors immediately for the same reason - let me know which data-correction protocol has been applied to the NMRShiftDB-data leading to deviations of about 130ppm (= 2/3 of the usual C-NMR shift range) BEFORE you put them on the web ! Maybe Christoph could spend a few words to the community about available data-checking protocols in NMRShiftDB ?


Now it's really fun to see what happens afterwards:

ACD - a global player in the field of NMR spectrum prediction and computer-assisted structure elucidation - has now access to a "benchmark" value of a competitor and comes up with a better value for its own prediction engine (1.59ppm versus 2.19/2.22ppm). That's absolutely legal, informative and might push science to the advantage of the community.

It order to make things more clear, I will describe the technology used during my evaluation:
A few more words about NMRShiftDB data-quality:

Some feel well at 30 degree, some people don't. The question regarding 250, 300 or more errors falls into this category - I think the point is again, how well they are hidden within the data. Databases can be automatically checked - if it is possible to find 250 or 300 obvious errors automatically within minutes of CPU-time, they should be corrected immediately. As it is stated on Ryan's blog, ACD has found 8% of misassignments during their data-extraction process - we should learn from this horrible number (my opinion !), that we need highly validated datasets and not 'centers of epidemics', which have again (bad) influence on the data quality of NMR-assignments in the upcoming literature  - or do we like 'applied error-propagation' ?


Copied from Ryan's blog:

------begin of copy-----

Let me say, I am very confused by the positioning of this question to Christoph Steinbeck:

"Why do you "reinvent" existing systems - there are a lot of systems (with much better performance !) already around  (a few in alphabetical order: ACD, CSEARCH, KnowItAll, NMRPredict, SDBS, SPECINFO)"

Why reinvent existing systems? To improve! To provide better resources for NMR spectroscopists and scientists around the world!

-----end of copy-------

OK, the scientific community needs improvements - agreed !

What is the improvement, when you wait for a molecular formula search at NMRShiftDB over some 20,000 compounds for about 40 seconds ? How long does it take with ACD-software ? I think it will be milliseconds, as with CSEARCH and NMRPredict.

What is the improvement in the NMRShiftDB HOSE-code implementation compared to ACD, compared to CSEARCH ?

Any new data-checking algorithm, which finds errors overseen by ACD and/or CSEARCH ?

It's free - agreed that is an advantage for many of us in academia ! Free of charge itself is not a scientific progress, but it might support science.

OK - its free, thats great ! Lets compare it against MESTREC - a one year subscription of "NMRPredict ONLINE FULL" costs Euro 155.- which allows you to access 425,000 NMR-spectra. Lets disregard the software and the performance, just focus on the number of spectra.

425,000 spectra ........ 155 Euro   
 20,000 spectra ........  x  Euro     -------->    x=7.29 Euro

That means: If you access "NMRPredict ONLINE FULL" you have to pay instead of 0.00 Euro an amount of 7.29 Euro per year (this corresponds to less than 2 packages of cigarettes !) with respect to the same number of spectra !


Now lets focus on the resource - the data of NMRShiftDB - itself:

As stated previously there were obvious and easily detectable errors in the first version, I have analyzed. There seems to be no validation process during data input and nobody seems to be responsible for data corrections over years, despite there are nearly 1100 registered users and many contributors. After my report and some external help the errors have been (at least) partly corrected. (BTW: Christoph never asked me for a copy of my error reports)

It is a strong disadvantage, that about 85% of the data have no solvent in it - how to validate an algorithm with them, which can handle solvent-specific predictions ? It would be fine to run the predictions twice, once with 'solvent-specific prediction' disabled and afterwards enabled. From these findings the algorithm could be improved.

The most necessary work (and the most cumbersome one) when creating a spectral database is the input of the data itself - this is time-consuming and definitely not fun like algorithm development. Why is it done twice (or even more often) ?. An advantage to the scientific community would be if all the database supplier wouldn't focus on the same journals, which have the most data in it - the goal should be to create the highest diversity. I definitely do not feel threatened by the NMRShiftDB-project - I know it is simply stupid extracting the same data from the same journals twice. Why was there no effort done by Christoph Steinbeck, to cover journals which are not covered by the people already working in this field when he started about 5 years ago ? Even if somebody doesn't feel comfortable working together, it should be possible to get a list of journals currently under extraction. Under these criteria NMRShiftDB would be a valuable resource and an excellent dataset for testing, because there would be more unique structures there. There is an overlap of 57% between ACD's collection and NMRShiftDB based on the number of carbons as reported in the ACD-paper - in other words 57% of the data have been entered twice. No other challenging things around ?




At the end:
 

Congratulations to ACD - they are at 1.59ppm:

1.59ppm average deviation for more than 200,000 predictions based on the 20,000 NMRShiftDB-structures using their ACD/CNMR Predictor Software is really great !

Congratulations to MODGRAPH - they are at 1.40ppm:

Using the same NMRShiftDB-dataset, MODGRAPH's CSEARCH-derived NMRPredict-program performs already at 1.40ppm using their "BEST"-technology !


Page written on May 31st, 2007 by Wolfgang.Robien(at)University.of.Vienna(univie.ac.at)

Page online since: May 31st, 2007
Last modification: June 4th, 2007