Re: [Ifeffit] Question about PCA

25 Apr 2019

      Hi Matt, Sorry I didn´t want to cause you too much trouble. The matrix I
have [D] is 161x11, but it happens that 160 is exactly the value of the
main diagonal in the  [Z] covariance matrix  [Z]=[D]T[D].  As as said, I
first performed step height normalization to each spectrum and then the Z
centralization of each one to find [D], and then use [D] to calculate [Z]
the covariance matrix . I extracted the eigenvalues  using two methods:
eigenvalue decomposition of [Z]  and also SVD of [D]. The second method
gives the square roots of the eigenvalues and with the first their values
directly. Both method gave reasonably equal eigenvalues betweem them.
Now I made the test to divide the eigenvalues (obtained by the second
method) between 160 and they are exactly equal to ATHENA results...

In relation to "s",  in the paper of Malinoksi of 2004 (Malinowski, E.R.,
Adaptation of the Vogt–Mizaikoff F-test to determine the number of
principal factors responsible for a data matrix and comparison with other
popular methods. Journal of Chemometrics, 2004. 18(9): p. 387-392) is
stated that s is equal to r or c, wichever is smaller, in this case s=c,
the number of spectra as you said.

You gave me a lot to think, actually I don´t really know if I understand
what you said: "...Athena *is* normalizing the data, but it is normalizing
along the first axis, not removing the mean spectra.   This gives
eigenvalues that sum to Nspectra and the same component eigenvectors as
Athena shows.  The first component is +/- 1 * mean_spectra."

Best regards
Joselaine

El jue., 25 abr. 2019 a las 11:49, Matt Newville (<
newville@cars.uchicago.edu>) escribió:
...
Hi Joselaine,
I'd like to update my answer, especially in regard to what Athena is
doing. As it turns out, I can very closely reproduce Athena's results with
a simple, straightforward implementation of PCA, and I think it might help
explain the differences you are seeing too.
In Python, for data that is arranged as  `data` with shape (Nspectra,
Nenergy), Athena appears to do
import numpy as np
def pca(data):
    data = (data - data.mean(axis=0)) / data.std(axis=0)
    cor = np.dot(data.T, data) / data.shape[0]
    evals, var = np.linalg.eigh(cor)
    iorder = np.argsort(evals)[::-1]
    evals = evals[iorder]
    evec = np.dot(data, var)[:, iorder]
    return evec, evals
which is to say: Athena *is* normalizing the data, but it is normalizing
along the first axis, not removing the mean spectra.   This gives
eigenvalues that sum to Nspectra and the same component eigenvectors as
Athena shows.  The first component is +/- 1 * mean_spectra.
That appears to be different from what scikit-learn is doing, which
removes the mean spectra (along the other axis).  The eigenvectors the
scikit-learn reports are off by one from the ones reported by Athena.
I suspect the difference you see might also be due to how the data matrix
is ordered, but I don't know why your eigenvalues sum to 160*11 -- I might
have suspected they add to 160 or 11....
I'm not sure what the right approach should be, but I'd like to figure
this out soon.  I think adding Malinowski's IND and perhaps F would be fine
statistics, but I only see how to do that if `s=c` (that is the total
number of spectra).
--Matt
On Wed, Apr 24, 2019 at 10:59 PM Matt Newville 
wrote:
...
Hi Joselaine,
On Wed, Apr 24, 2019 at 8:40 PM Joselaine Cáceres gonzalez <
joselainecaceres@gmail.com> wrote:
...
Hi Matt, thank you for your answer!. The references I have about
Malinowski´s work and some applications are:
Malinowski, E.R., *Theory of error in factor analysis.* Analytical
Chemistry, 1977. *49*(4): p. 606-612.
Malinowski, E.R., *Theory of the distribution of error eigenvalues
resulting from principal component analysis with applications to
spectroscopic data.* Journal of Chemometrics, 1987. *1*(1): p. 33-40.
Malinowski, E.R., *Statistical F-tests for abstract factor analysis and
target testing.* Journal of Chemometrics, 1989. *3*(1): p. 49-60.
Malinowski, E.R., *Adaptation of the Vogt–Mizaikoff F-test to determine
the number of principal factors responsible for a data matrix and
comparison with other popular methods.* Journal of Chemometrics, 2004.
*18*(9): p. 387-392.
McCue, M. and E.R. Malinowski, *Target Factor Analysis of the
Ultraviolet Spectra of Unresolved Liquid Chromatographic Fractions.*
Applied Spectroscopy, 1983. *37*(5): p. 463-469.
Beauchemin, S., D. Hesterberg, and M. Beauchemin, *Principal Component
Analysis Approach for Modeling Sulfur K-XANES Spectra of Humic Acids.*
Soils Science Society of America Journal, 2002. *66*: p. 83-91.
Wasserman, S.R., et al., *EXAFS and principal component analysis: a new
shell game.* Journal of Synchrotron Radiation, 1999. *6*: p. 284-286.
OK, thanks, but what I was more interested in was what is the math for
these tests that you are doing?
I think I don't understand what you mean by "eigenvalues explain exactly
...
the amount of variance ... but they are different".  Can you clarify?
Giving an actual example might help.
Here are results obtained by ATHENA, the factional variance explained by
each eigenvalue is calulated by dividing the eigenvalue between the sum of
them all, right?:
Yes, that is my understanding too.
...
#  Eignevalues    Variance    Cumulative variance
1 8,864394 0,80585 0,805854
2 1,227578 0,11160 0,917452
3 0,708334 0,06439 0,981846
4 0,129478 0,01177 0,993617
5 0,045127 0,00410 0,997719
6 0,012229 0,00111 0,998831
7 0,009464 0,00086 0,999692
8 0,001489 0,00014 0,999827
9 0,000989 0,00009 0,999917
10 0,000617 0,00006 0,999973
11 0,000298 0,00003 1
Here are results obtained with matrix calculator for the same data:
Yes.  Note that for Athena, Sum Eigenvalues = 11, the number of spectra.
What is "matrix calculator"?
...
Eigenvalue Explained Variance Cumulative variance
1418,3057 0,80586 0,80586
196,4118 0,11160 0,917453
113,3331 0,06439 0,981846
20,7162 0,01177 0,993617
7,2205 0,00410 0,997719
1,9566 0,00111 0,998831
1,5144 0,00086 0,999692
0,2381 0,00014 0,999827
0,1583 0,00009 0,999917
0,0987 0,00006 0,999973
0,0477 0,00003 1,000000
Here, Sum Eigenvalues = 1760.  = 160*11.
The explained variance looks essentially identical.  The eigenvalues
differ by a constant scale factor that happens to be very close to 160.
Without seeing your data, I might ask a) are there 160 energy points in the
matrix you are using? and b) is the integral or the sum of the mean specta
= 160?
FWIW, when I compare scikit-learn PCA (as used in Larch) with Athena, I
get answers that are significantly different, and by much more than just a
scale factor.  I am pretty sure that this is because scikit-learch
For a case where there are 9 input spectra, Athen gives eigenvalues that
sum to 9, whereas scikit-learns' eigenvalues sum to 2.3....  I don't know
where that difference comes from.
...
The eigenvalues are used then to evaluate the function IND and F test,
and depending on the values of eigenvalues, function IND reach a minimum
value when the set of primary components are separated from the secondary
ones that just explained experimental errors (in the equations lambda are
the eigenvalues, r the numbers of rows, c columns, n the number of primary
components):
[image: image.png]
The results obtained with the two sets of eigenvalues are diferent but
they reach the minimum in the same n. The F test also gives me similar
levels of significance for the two sets, but I do not undestand why I´m not
hable to find the same eigenvalues that ATHENA does.
Yeah, I sort of doubt that any of those tests will give a different value
for 'n' (significant components) based on the scale of the eigenvalues
themselves.  That is, I think you can use "fractional explained variance"
(or "explained variance ratio").
I don't see an actual write-up where the formula for IND comes from.  But
I do not quite get what `s` is -- is that `c-n` ?  I'd also be happy to try
to add Malinowski's F test to the reporting for Larch's PCA analysis, but I
don't quite understand it.  The sum is from n+1 to s or n+1 to c (as with
RE)?  What is `j` in the demoninator (isn't that outside the sum?).
 Similarly, I'd be happy to report SPOIL if I knew a workable definition...
By the way, I tested the possibility you told to not divide the data by
...
the standard deviation and still couldn´t find the same
eigenvalues.
...
I am almost certain that Athena is not removing the mean -- that might
help explain the issue too.
Hope that helps.  I'd like to understand these statistics well enough to
report them.
--Matt
--
--Matt Newville <newville at cars.uchicago.edu> 630-252-0431
_______________________________________________
Ifeffit mailing list
Ifeffit@millenia.cars.aps.anl.gov
http://millenia.cars.aps.anl.gov/mailman/listinfo/ifeffit
Unsubscribe: http://millenia.cars.aps.anl.gov/mailman/options/ifeffit