Hi Joselaine, On Wed, Apr 24, 2019 at 8:40 PM Joselaine Cáceres gonzalez < joselainecaceres@gmail.com> wrote:
Hi Matt, thank you for your answer!. The references I have about Malinowski´s work and some applications are:
Malinowski, E.R., *Theory of error in factor analysis.* Analytical Chemistry, 1977. *49*(4): p. 606-612.
Malinowski, E.R., *Theory of the distribution of error eigenvalues resulting from principal component analysis with applications to spectroscopic data.* Journal of Chemometrics, 1987. *1*(1): p. 33-40.
Malinowski, E.R., *Statistical F-tests for abstract factor analysis and target testing.* Journal of Chemometrics, 1989. *3*(1): p. 49-60.
Malinowski, E.R., *Adaptation of the Vogt–Mizaikoff F-test to determine the number of principal factors responsible for a data matrix and comparison with other popular methods.* Journal of Chemometrics, 2004. *18*(9): p. 387-392.
McCue, M. and E.R. Malinowski, *Target Factor Analysis of the Ultraviolet Spectra of Unresolved Liquid Chromatographic Fractions.* Applied Spectroscopy, 1983. *37*(5): p. 463-469.
Beauchemin, S., D. Hesterberg, and M. Beauchemin, *Principal Component Analysis Approach for Modeling Sulfur K-XANES Spectra of Humic Acids.* Soils Science Society of America Journal, 2002. *66*: p. 83-91.
Wasserman, S.R., et al., *EXAFS and principal component analysis: a new shell game.* Journal of Synchrotron Radiation, 1999. *6*: p. 284-286.
OK, thanks, but what I was more interested in was what is the math for these tests that you are doing? I think I don't understand what you mean by "eigenvalues explain exactly
the amount of variance ... but they are different". Can you clarify? Giving an actual example might help.
Here are results obtained by ATHENA, the factional variance explained by each eigenvalue is calulated by dividing the eigenvalue between the sum of them all, right?:
Yes, that is my understanding too.
# Eignevalues Variance Cumulative variance 1 8,864394 0,80585 0,805854 2 1,227578 0,11160 0,917452 3 0,708334 0,06439 0,981846 4 0,129478 0,01177 0,993617 5 0,045127 0,00410 0,997719 6 0,012229 0,00111 0,998831 7 0,009464 0,00086 0,999692 8 0,001489 0,00014 0,999827 9 0,000989 0,00009 0,999917 10 0,000617 0,00006 0,999973 11 0,000298 0,00003 1
Here are results obtained with matrix calculator for the same data:
Yes. Note that for Athena, Sum Eigenvalues = 11, the number of spectra. What is "matrix calculator"?
Eigenvalue Explained Variance Cumulative variance 1418,3057 0,80586 0,80586 196,4118 0,11160 0,917453 113,3331 0,06439 0,981846 20,7162 0,01177 0,993617 7,2205 0,00410 0,997719 1,9566 0,00111 0,998831 1,5144 0,00086 0,999692 0,2381 0,00014 0,999827 0,1583 0,00009 0,999917 0,0987 0,00006 0,999973 0,0477 0,00003 1,000000
Here, Sum Eigenvalues = 1760. = 160*11. The explained variance looks essentially identical. The eigenvalues differ by a constant scale factor that happens to be very close to 160. Without seeing your data, I might ask a) are there 160 energy points in the matrix you are using? and b) is the integral or the sum of the mean specta = 160? FWIW, when I compare scikit-learn PCA (as used in Larch) with Athena, I get answers that are significantly different, and by much more than just a scale factor. I am pretty sure that this is because scikit-learch For a case where there are 9 input spectra, Athen gives eigenvalues that sum to 9, whereas scikit-learns' eigenvalues sum to 2.3.... I don't know where that difference comes from.
The eigenvalues are used then to evaluate the function IND and F test, and depending on the values of eigenvalues, function IND reach a minimum value when the set of primary components are separated from the secondary ones that just explained experimental errors (in the equations lambda are the eigenvalues, r the numbers of rows, c columns, n the number of primary components): [image: image.png]
The results obtained with the two sets of eigenvalues are diferent but they reach the minimum in the same n. The F test also gives me similar levels of significance for the two sets, but I do not undestand why I´m not hable to find the same eigenvalues that ATHENA does.
Yeah, I sort of doubt that any of those tests will give a different value for 'n' (significant components) based on the scale of the eigenvalues themselves. That is, I think you can use "fractional explained variance" (or "explained variance ratio"). I don't see an actual write-up where the formula for IND comes from. But I do not quite get what `s` is -- is that `c-n` ? I'd also be happy to try to add Malinowski's F test to the reporting for Larch's PCA analysis, but I don't quite understand it. The sum is from n+1 to s or n+1 to c (as with RE)? What is `j` in the demoninator (isn't that outside the sum?). Similarly, I'd be happy to report SPOIL if I knew a workable definition... By the way, I tested the possibility you told to not divide the data by the
standard deviation and still couldn´t find the same
eigenvalues.
I am almost certain that Athena is not removing the mean -- that might help explain the issue too. Hope that helps. I'd like to understand these statistics well enough to report them. --Matt