[Ifeffit] Question about PCA

Wed Apr 24 22:59:31 CDT 2019

Hi Joselaine,

On Wed, Apr 24, 2019 at 8:40 PM Joselaine Cáceres gonzalez <
joselainecaceres at gmail.com> wrote:

> Hi Matt, thank you for your answer!. The references I have about
> Malinowski´s work and some applications are:
>
> Malinowski, E.R., *Theory of error in factor analysis.* Analytical
> Chemistry, 1977. *49*(4): p. 606-612.
>
>  Malinowski, E.R., *Theory of the distribution of error eigenvalues
> resulting from principal component analysis with applications to
> spectroscopic data.* Journal of Chemometrics, 1987. *1*(1): p. 33-40.
>
> Malinowski, E.R., *Statistical F-tests for abstract factor analysis and
> target testing.* Journal of Chemometrics, 1989. *3*(1): p. 49-60.
>
> Malinowski, E.R., *Adaptation of the Vogt–Mizaikoff F-test to determine
> the number of principal factors responsible for a data matrix and
> comparison with other popular methods.* Journal of Chemometrics, 2004.
> *18*(9): p. 387-392.
>
> McCue, M. and E.R. Malinowski, *Target Factor Analysis of the Ultraviolet
> Spectra of Unresolved Liquid Chromatographic Fractions.* Applied
> Spectroscopy, 1983. *37*(5): p. 463-469.
>
> Beauchemin, S., D. Hesterberg, and M. Beauchemin, *Principal Component
> Analysis Approach for Modeling Sulfur K-XANES Spectra of Humic Acids.*
> Soils Science Society of America Journal, 2002. *66*: p. 83-91.
>
> Wasserman, S.R., et al., *EXAFS and principal component analysis: a new
> shell game.* Journal of Synchrotron Radiation, 1999. *6*: p. 284-286.
>

OK, thanks, but what I was more interested in was what is the math for
these tests that you are doing?

I think I don't understand what you mean by "eigenvalues explain exactly
> the amount of variance ... but they are different".  Can you clarify?
> Giving an actual example might help.
>
> Here are results obtained by ATHENA, the factional variance explained by
> each eigenvalue is calulated by dividing the eigenvalue between the sum of
> them all, right?:
>

Yes, that is my understanding too.

>
> #  Eignevalues    Variance    Cumulative variance
> 1 8,864394 0,80585 0,805854
> 2 1,227578 0,11160 0,917452
> 3 0,708334 0,06439 0,981846
> 4 0,129478 0,01177 0,993617
> 5 0,045127 0,00410 0,997719
> 6 0,012229 0,00111 0,998831
> 7 0,009464 0,00086 0,999692
> 8 0,001489 0,00014 0,999827
> 9 0,000989 0,00009 0,999917
> 10 0,000617 0,00006 0,999973
> 11 0,000298 0,00003 1
>
> Here are results obtained with matrix calculator for the same data:
>

Yes.  Note that for Athena, Sum Eigenvalues = 11, the number of spectra.

What is "matrix calculator"?

> Eigenvalue Explained Variance Cumulative variance
> 1418,3057 0,80586 0,80586
> 196,4118 0,11160 0,917453
> 113,3331 0,06439 0,981846
> 20,7162 0,01177 0,993617
> 7,2205 0,00410 0,997719
> 1,9566 0,00111 0,998831
> 1,5144 0,00086 0,999692
> 0,2381 0,00014 0,999827
> 0,1583 0,00009 0,999917
> 0,0987 0,00006 0,999973
> 0,0477 0,00003 1,000000
>

Here, Sum Eigenvalues = 1760.  = 160*11.

The explained variance looks essentially identical.  The eigenvalues differ
by a constant scale factor that happens to be very close to 160. Without
seeing your data, I might ask a) are there 160 energy points in the matrix
you are using? and b) is the integral or the sum of the mean specta = 160?

FWIW, when I compare scikit-learn PCA (as used in Larch) with Athena, I get
answers that are significantly different, and by much more than just a
scale factor.  I am pretty sure that this is because scikit-learch
For a case where there are 9 input spectra, Athen gives eigenvalues that
sum to 9, whereas scikit-learns' eigenvalues sum to 2.3....  I don't know
where that difference comes from.

> The eigenvalues are used then to evaluate the function IND and F test, and
> depending on the values of eigenvalues, function IND reach a minimum value
> when the set of primary components are separated from the secondary ones
> that just explained experimental errors (in the equations lambda are the
> eigenvalues, r the numbers of rows, c columns, n the number of primary
> components):
> [image: image.png]
>
>
> The results obtained with the two sets of eigenvalues are diferent but
> they reach the minimum in the same n. The F test also gives me similar
> levels of significance for the two sets, but I do not undestand why I´m not
> hable to find the same eigenvalues that ATHENA does.
>

Yeah, I sort of doubt that any of those tests will give a different value
for 'n' (significant components) based on the scale of the eigenvalues
themselves.  That is, I think you can use "fractional explained variance"
(or "explained variance ratio").

I don't see an actual write-up where the formula for IND comes from.  But I
do not quite get what `s` is -- is that `c-n` ?  I'd also be happy to try
to add Malinowski's F test to the reporting for Larch's PCA analysis, but I
don't quite understand it.  The sum is from n+1 to s or n+1 to c (as with
RE)?  What is `j` in the demoninator (isn't that outside the sum?).
 Similarly, I'd be happy to report SPOIL if I knew a workable definition...

By the way, I tested the possibility you told to not divide the data by the
> standard deviation and still couldn´t find the same
>
eigenvalues.
>

I am almost certain that Athena is not removing the mean -- that might help
explain the issue too.

Hope that helps.  I'd like to understand these statistics well enough to
report them.

--Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://millenia.cars.aps.anl.gov/pipermail/ifeffit/attachments/20190424/06c0ce97/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 44705 bytes
Desc: not available
URL: <http://millenia.cars.aps.anl.gov/pipermail/ifeffit/attachments/20190424/06c0ce97/attachment-0001.png>