Question about PCA

Joselaine Cáceres gonzalez

23 Apr 2019 23 Apr '19

2:34 p.m.

Dear all, I performed PCA in a set of 11 XANES spectra and I was trying to use some of the functions and tests developed by Malinowski to extract the number of primary components. I found differences between eigenvalues calculated whit a matrix calculator and those calculated by ATHENA, but they are equivalent in some way since they explain the same amount of variance and when I calculate the IND function or Malinowski F-test I reach the same conclusions. To calculate the eigenvalues with the matrix calculator I first normalized my data matrix (step height normalization) with ATHENA, then I export the data to an Excel spreadsheet where I centered them: z=(value-mean)/standard_dev. Then I used SVD and Eigenvalue decomposition to find the eigenvalues and those obtained with SVD explain exactly the same amount of variance than those found with ATHENA but they are different. In all the calculation I did, I used the same amount of points 161, points in each spectrum in ATHENA and in the matrix calculator, exactly the same energy interval. I´d like to know if I am missing some step in the data pre-treatment that is prevented me to find the same eigenvalues obtained by ATHENA but also I´d like to know if I can straight forward use the values obtained with ATHENA to evaluate the functions proposed by Malinowski since these equations might be different depending if calculations are made using "covariance about the origin" or "correlation about the origin" and I am not sure what is the case. Thanks in advance! Best regards Joselaine

Attachments:

attachment.html (text/html — 1.7 KB)

Show replies by date

Matt Newville

23 Apr 23 Apr

9:14 p.m.

Hi Joselaine, On Tue, Apr 23, 2019 at 2:35 PM Joselaine Cáceres gonzalez < joselainecaceres@gmail.com> wrote:

...

Dear all,

I performed PCA in a set of 11 XANES spectra and I was trying to use some of the functions and tests developed by Malinowski to extract the number of primary components.

I am not familiar with the details of those tests. Do you have a reference for them? I found differences between eigenvalues calculated whit a matrix calculator

...

and those calculated by ATHENA, but they are equivalent in some way since they explain the same amount of variance and when I calculate the IND function or Malinowski F-test I reach the same conclusions. To calculate the eigenvalues with the matrix calculator I first normalized my data matrix (step height normalization) with ATHENA, then I export the data to an Excel spreadsheet where I centered them: z=(value-mean)/standard_dev.

I believe that Athena does not do this centering. I would imagine that dividing by the standard deviation could skew the data.

...

Then I used SVD and Eigenvalue decomposition to find the eigenvalues and those obtained with SVD explain exactly the same amount of variance than those found with ATHENA but they are different.

I think I don't understand what you mean by "eigenvalues explain exactly the amount of variance ... but they are different". Can you clarify? Giving an actual example might help. In all the calculation I did, I used the same amount of points 161, points

...

in each spectrum in ATHENA and in the matrix calculator, exactly the same energy interval.

...

I´d like to know if I am missing some step in the data pre-treatment that is prevented me to find the same eigenvalues obtained by ATHENA but also I´d like to know if I can straight forward use the values obtained with ATHENA to evaluate the functions proposed by Malinowski since these equations might be different depending if calculations are made using "covariance about the origin" or "correlation about the origin" and I am not sure what is the case.

I don't know the answer to any of those questions. I can say that when I compare PCA with Athena and with Larch (using scikit-learn's PCA), I do get very similar looking eigenvectors for the first few eigen components. To be clear, scikit-learn's PCA does first subtract out the mean (but does not divide by the standard deviation), whereas Athena identifies this as the 1st component. So there is a potential "off by 1" counting issue, but that is easily worked out. The scalar values I get with scikit-learn's PCA are different from what Athena reports. scikit-learn PCA returns both the eigenvalues (the explained variance) and the explained variance ratio -- weights that will add to 1 . I generally find the latter to be more useful, but maybe I'm not understand what the eigenvales can be used. Hope that helps. FWIW, I'm trying to learn all this stuff better too. Perhaps you can give some insight? --Matt

Joselaine Cáceres gonzalez

24 Apr 24 Apr

8:38 p.m.

Hi Matt, thank you for your answer!. The references I have about Malinowski´s work and some applications are: Malinowski, E.R., *Theory of error in factor analysis.* Analytical Chemistry, 1977. *49*(4): p. 606-612. Malinowski, E.R., *Theory of the distribution of error eigenvalues resulting from principal component analysis with applications to spectroscopic data.* Journal of Chemometrics, 1987. *1*(1): p. 33-40. Malinowski, E.R., *Statistical F-tests for abstract factor analysis and target testing.* Journal of Chemometrics, 1989. *3*(1): p. 49-60. Malinowski, E.R., *Adaptation of the Vogt–Mizaikoff F-test to determine the number of principal factors responsible for a data matrix and comparison with other popular methods.* Journal of Chemometrics, 2004. *18*(9): p. 387-392. McCue, M. and E.R. Malinowski, *Target Factor Analysis of the Ultraviolet Spectra of Unresolved Liquid Chromatographic Fractions.* Applied Spectroscopy, 1983. *37*(5): p. 463-469. Beauchemin, S., D. Hesterberg, and M. Beauchemin, *Principal Component Analysis Approach for Modeling Sulfur K-XANES Spectra of Humic Acids.* Soils Science Society of America Journal, 2002. *66*: p. 83-91. Wasserman, S.R., et al., *EXAFS and principal component analysis: a new shell game.* Journal of Synchrotron Radiation, 1999. *6*: p. 284-286. I think I don't understand what you mean by "eigenvalues explain exactly the amount of variance ... but they are different". Can you clarify? Giving an actual example might help. Here are results obtained by ATHENA, the factional variance explained by each eigenvalue is calulated by dividing the eigenvalue between the sum of them all, right?: # Eignevalues Variance Cumulative variance 1 8,864394 0,80585 0,805854 2 1,227578 0,11160 0,917452 3 0,708334 0,06439 0,981846 4 0,129478 0,01177 0,993617 5 0,045127 0,00410 0,997719 6 0,012229 0,00111 0,998831 7 0,009464 0,00086 0,999692 8 0,001489 0,00014 0,999827 9 0,000989 0,00009 0,999917 10 0,000617 0,00006 0,999973 11 0,000298 0,00003 1 Here are results obtained with matrix calculator for the same data: Eigenvalue Explained Variance Cumulative variance 1418,3057 0,80586 0,80586 196,4118 0,11160 0,917453 113,3331 0,06439 0,981846 20,7162 0,01177 0,993617 7,2205 0,00410 0,997719 1,9566 0,00111 0,998831 1,5144 0,00086 0,999692 0,2381 0,00014 0,999827 0,1583 0,00009 0,999917 0,0987 0,00006 0,999973 0,0477 0,00003 1,000000 The eigenvalues are used then to evaluate the function IND and F test, and depending on the values of eigenvalues, function IND reach a minimum value when the set of primary components are separated from the secondary ones that just explained experimental errors (in the equations lambda are the eigenvalues, r the numbers of rows, c columns, n the number of primary components): [image: image.png] The results obtained with the two sets of eigenvalues are diferent but they reach the minimum in the same n. The F test also gives me similar levels of significance for the two sets, but I do not undestand why I´m not hable to find the same eigenvalues that ATHENA does. By the way, I tested the possibility you told to not divide the data by the standard deviation and still couldn´t find the same eigenvalues. Best regards. Joselaine El mar., 23 abr. 2019 a las 23:15, Matt Newville (< newville@cars.uchicago.edu>) escribió:

...

Hi Joselaine,

On Tue, Apr 23, 2019 at 2:35 PM Joselaine Cáceres gonzalez < joselainecaceres@gmail.com> wrote:

...
Dear all,

I performed PCA in a set of 11 XANES spectra and I was trying to use some of the functions and tests developed by Malinowski to extract the number of primary components.

I am not familiar with the details of those tests. Do you have a reference for them?

I found differences between eigenvalues calculated whit a matrix

...
calculator and those calculated by ATHENA, but they are equivalent in some way since they explain the same amount of variance and when I calculate the IND function or Malinowski F-test I reach the same conclusions. To calculate the eigenvalues with the matrix calculator I first normalized my data matrix (step height normalization) with ATHENA, then I export the data to an Excel spreadsheet where I centered them: z=(value-mean)/standard_dev.

I believe that Athena does not do this centering. I would imagine that dividing by the standard deviation could skew the data.

...
Then I used SVD and Eigenvalue decomposition to find the eigenvalues and those obtained with SVD explain exactly the same amount of variance than those found with ATHENA but they are different.

I think I don't understand what you mean by "eigenvalues explain exactly the amount of variance ... but they are different". Can you clarify? Giving an actual example might help.

In all the calculation I did, I used the same amount of points 161, points

...
in each spectrum in ATHENA and in the matrix calculator, exactly the same energy interval.

...
I´d like to know if I am missing some step in the data pre-treatment that is prevented me to find the same eigenvalues obtained by ATHENA but also I´d like to know if I can straight forward use the values obtained with ATHENA to evaluate the functions proposed by Malinowski since these equations might be different depending if calculations are made using "covariance about the origin" or "correlation about the origin" and I am not sure what is the case.

I don't know the answer to any of those questions.

I can say that when I compare PCA with Athena and with Larch (using scikit-learn's PCA), I do get very similar looking eigenvectors for the first few eigen components. To be clear, scikit-learn's PCA does first subtract out the mean (but does not divide by the standard deviation), whereas Athena identifies this as the 1st component. So there is a potential "off by 1" counting issue, but that is easily worked out.

The scalar values I get with scikit-learn's PCA are different from what Athena reports. scikit-learn PCA returns both the eigenvalues (the explained variance) and the explained variance ratio -- weights that will add to 1 . I generally find the latter to be more useful, but maybe I'm not understand what the eigenvales can be used.

Hope that helps. FWIW, I'm trying to learn all this stuff better too. Perhaps you can give some insight?

--Matt _______________________________________________ Ifeffit mailing list Ifeffit@millenia.cars.aps.anl.gov http://millenia.cars.aps.anl.gov/mailman/listinfo/ifeffit Unsubscribe: http://millenia.cars.aps.anl.gov/mailman/options/ifeffit

Matt Newville

10:59 p.m.

Hi Joselaine, On Wed, Apr 24, 2019 at 8:40 PM Joselaine Cáceres gonzalez < joselainecaceres@gmail.com> wrote:

...

Hi Matt, thank you for your answer!. The references I have about Malinowski´s work and some applications are:

Malinowski, E.R., *Theory of error in factor analysis.* Analytical Chemistry, 1977. *49*(4): p. 606-612.

Malinowski, E.R., *Theory of the distribution of error eigenvalues resulting from principal component analysis with applications to spectroscopic data.* Journal of Chemometrics, 1987. *1*(1): p. 33-40.

Malinowski, E.R., *Statistical F-tests for abstract factor analysis and target testing.* Journal of Chemometrics, 1989. *3*(1): p. 49-60.

Malinowski, E.R., *Adaptation of the Vogt–Mizaikoff F-test to determine the number of principal factors responsible for a data matrix and comparison with other popular methods.* Journal of Chemometrics, 2004. *18*(9): p. 387-392.

McCue, M. and E.R. Malinowski, *Target Factor Analysis of the Ultraviolet Spectra of Unresolved Liquid Chromatographic Fractions.* Applied Spectroscopy, 1983. *37*(5): p. 463-469.

Beauchemin, S., D. Hesterberg, and M. Beauchemin, *Principal Component Analysis Approach for Modeling Sulfur K-XANES Spectra of Humic Acids.* Soils Science Society of America Journal, 2002. *66*: p. 83-91.

Wasserman, S.R., et al., *EXAFS and principal component analysis: a new shell game.* Journal of Synchrotron Radiation, 1999. *6*: p. 284-286.

OK, thanks, but what I was more interested in was what is the math for these tests that you are doing? I think I don't understand what you mean by "eigenvalues explain exactly

...

the amount of variance ... but they are different". Can you clarify? Giving an actual example might help.

Here are results obtained by ATHENA, the factional variance explained by each eigenvalue is calulated by dividing the eigenvalue between the sum of them all, right?:

Yes, that is my understanding too.

...

# Eignevalues Variance Cumulative variance 1 8,864394 0,80585 0,805854 2 1,227578 0,11160 0,917452 3 0,708334 0,06439 0,981846 4 0,129478 0,01177 0,993617 5 0,045127 0,00410 0,997719 6 0,012229 0,00111 0,998831 7 0,009464 0,00086 0,999692 8 0,001489 0,00014 0,999827 9 0,000989 0,00009 0,999917 10 0,000617 0,00006 0,999973 11 0,000298 0,00003 1

Here are results obtained with matrix calculator for the same data:

Yes. Note that for Athena, Sum Eigenvalues = 11, the number of spectra. What is "matrix calculator"?

...

Eigenvalue Explained Variance Cumulative variance 1418,3057 0,80586 0,80586 196,4118 0,11160 0,917453 113,3331 0,06439 0,981846 20,7162 0,01177 0,993617 7,2205 0,00410 0,997719 1,9566 0,00111 0,998831 1,5144 0,00086 0,999692 0,2381 0,00014 0,999827 0,1583 0,00009 0,999917 0,0987 0,00006 0,999973 0,0477 0,00003 1,000000

Here, Sum Eigenvalues = 1760. = 160*11. The explained variance looks essentially identical. The eigenvalues differ by a constant scale factor that happens to be very close to 160. Without seeing your data, I might ask a) are there 160 energy points in the matrix you are using? and b) is the integral or the sum of the mean specta = 160? FWIW, when I compare scikit-learn PCA (as used in Larch) with Athena, I get answers that are significantly different, and by much more than just a scale factor. I am pretty sure that this is because scikit-learch For a case where there are 9 input spectra, Athen gives eigenvalues that sum to 9, whereas scikit-learns' eigenvalues sum to 2.3.... I don't know where that difference comes from.

...

The eigenvalues are used then to evaluate the function IND and F test, and depending on the values of eigenvalues, function IND reach a minimum value when the set of primary components are separated from the secondary ones that just explained experimental errors (in the equations lambda are the eigenvalues, r the numbers of rows, c columns, n the number of primary components): [image: image.png]

The results obtained with the two sets of eigenvalues are diferent but they reach the minimum in the same n. The F test also gives me similar levels of significance for the two sets, but I do not undestand why I´m not hable to find the same eigenvalues that ATHENA does.

Yeah, I sort of doubt that any of those tests will give a different value for 'n' (significant components) based on the scale of the eigenvalues themselves. That is, I think you can use "fractional explained variance" (or "explained variance ratio"). I don't see an actual write-up where the formula for IND comes from. But I do not quite get what `s` is -- is that `c-n` ? I'd also be happy to try to add Malinowski's F test to the reporting for Larch's PCA analysis, but I don't quite understand it. The sum is from n+1 to s or n+1 to c (as with RE)? What is `j` in the demoninator (isn't that outside the sum?). Similarly, I'd be happy to report SPOIL if I knew a workable definition... By the way, I tested the possibility you told to not divide the data by the

...

standard deviation and still couldn´t find the same

eigenvalues.

...

I am almost certain that Athena is not removing the mean -- that might help explain the issue too. Hope that helps. I'd like to understand these statistics well enough to report them. --Matt

Matt Newville

25 Apr 25 Apr

9:48 a.m.

Hi Joselaine, I'd like to update my answer, especially in regard to what Athena is doing. As it turns out, I can very closely reproduce Athena's results with a simple, straightforward implementation of PCA, and I think it might help explain the differences you are seeing too. In Python, for data that is arranged as `data` with shape (Nspectra, Nenergy), Athena appears to do import numpy as np def pca(data): data = (data - data.mean(axis=0)) / data.std(axis=0) cor = np.dot(data.T, data) / data.shape[0] evals, var = np.linalg.eigh(cor) iorder = np.argsort(evals)[::-1] evals = evals[iorder] evec = np.dot(data, var)[:, iorder] return evec, evals which is to say: Athena *is* normalizing the data, but it is normalizing along the first axis, not removing the mean spectra. This gives eigenvalues that sum to Nspectra and the same component eigenvectors as Athena shows. The first component is +/- 1 * mean_spectra. That appears to be different from what scikit-learn is doing, which removes the mean spectra (along the other axis). The eigenvectors the scikit-learn reports are off by one from the ones reported by Athena. I suspect the difference you see might also be due to how the data matrix is ordered, but I don't know why your eigenvalues sum to 160*11 -- I might have suspected they add to 160 or 11.... I'm not sure what the right approach should be, but I'd like to figure this out soon. I think adding Malinowski's IND and perhaps F would be fine statistics, but I only see how to do that if `s=c` (that is the total number of spectra). --Matt On Wed, Apr 24, 2019 at 10:59 PM Matt Newville wrote:

...

Hi Joselaine,

On Wed, Apr 24, 2019 at 8:40 PM Joselaine Cáceres gonzalez < joselainecaceres@gmail.com> wrote:

...
Hi Matt, thank you for your answer!. The references I have about Malinowski´s work and some applications are:

Malinowski, E.R., *Theory of error in factor analysis.* Analytical Chemistry, 1977. *49*(4): p. 606-612.

Malinowski, E.R., *Theory of the distribution of error eigenvalues resulting from principal component analysis with applications to spectroscopic data.* Journal of Chemometrics, 1987. *1*(1): p. 33-40.

Malinowski, E.R., *Statistical F-tests for abstract factor analysis and target testing.* Journal of Chemometrics, 1989. *3*(1): p. 49-60.

Malinowski, E.R., *Adaptation of the Vogt–Mizaikoff F-test to determine the number of principal factors responsible for a data matrix and comparison with other popular methods.* Journal of Chemometrics, 2004. *18*(9): p. 387-392.

McCue, M. and E.R. Malinowski, *Target Factor Analysis of the Ultraviolet Spectra of Unresolved Liquid Chromatographic Fractions.* Applied Spectroscopy, 1983. *37*(5): p. 463-469.

Beauchemin, S., D. Hesterberg, and M. Beauchemin, *Principal Component Analysis Approach for Modeling Sulfur K-XANES Spectra of Humic Acids.* Soils Science Society of America Journal, 2002. *66*: p. 83-91.

Wasserman, S.R., et al., *EXAFS and principal component analysis: a new shell game.* Journal of Synchrotron Radiation, 1999. *6*: p. 284-286.

OK, thanks, but what I was more interested in was what is the math for these tests that you are doing?

I think I don't understand what you mean by "eigenvalues explain exactly

...
the amount of variance ... but they are different". Can you clarify? Giving an actual example might help.

Here are results obtained by ATHENA, the factional variance explained by each eigenvalue is calulated by dividing the eigenvalue between the sum of them all, right?:

Yes, that is my understanding too.

...
# Eignevalues Variance Cumulative variance 1 8,864394 0,80585 0,805854 2 1,227578 0,11160 0,917452 3 0,708334 0,06439 0,981846 4 0,129478 0,01177 0,993617 5 0,045127 0,00410 0,997719 6 0,012229 0,00111 0,998831 7 0,009464 0,00086 0,999692 8 0,001489 0,00014 0,999827 9 0,000989 0,00009 0,999917 10 0,000617 0,00006 0,999973 11 0,000298 0,00003 1

Here are results obtained with matrix calculator for the same data:

Yes. Note that for Athena, Sum Eigenvalues = 11, the number of spectra.

What is "matrix calculator"?

...
Eigenvalue Explained Variance Cumulative variance 1418,3057 0,80586 0,80586 196,4118 0,11160 0,917453 113,3331 0,06439 0,981846 20,7162 0,01177 0,993617 7,2205 0,00410 0,997719 1,9566 0,00111 0,998831 1,5144 0,00086 0,999692 0,2381 0,00014 0,999827 0,1583 0,00009 0,999917 0,0987 0,00006 0,999973 0,0477 0,00003 1,000000

Here, Sum Eigenvalues = 1760. = 160*11.

The explained variance looks essentially identical. The eigenvalues differ by a constant scale factor that happens to be very close to 160. Without seeing your data, I might ask a) are there 160 energy points in the matrix you are using? and b) is the integral or the sum of the mean specta = 160?

FWIW, when I compare scikit-learn PCA (as used in Larch) with Athena, I get answers that are significantly different, and by much more than just a scale factor. I am pretty sure that this is because scikit-learch For a case where there are 9 input spectra, Athen gives eigenvalues that sum to 9, whereas scikit-learns' eigenvalues sum to 2.3.... I don't know where that difference comes from.

...
The eigenvalues are used then to evaluate the function IND and F test, and depending on the values of eigenvalues, function IND reach a minimum value when the set of primary components are separated from the secondary ones that just explained experimental errors (in the equations lambda are the eigenvalues, r the numbers of rows, c columns, n the number of primary components): [image: image.png]

The results obtained with the two sets of eigenvalues are diferent but they reach the minimum in the same n. The F test also gives me similar levels of significance for the two sets, but I do not undestand why I´m not hable to find the same eigenvalues that ATHENA does.

Yeah, I sort of doubt that any of those tests will give a different value for 'n' (significant components) based on the scale of the eigenvalues themselves. That is, I think you can use "fractional explained variance" (or "explained variance ratio").

I don't see an actual write-up where the formula for IND comes from. But I do not quite get what `s` is -- is that `c-n` ? I'd also be happy to try to add Malinowski's F test to the reporting for Larch's PCA analysis, but I don't quite understand it. The sum is from n+1 to s or n+1 to c (as with RE)? What is `j` in the demoninator (isn't that outside the sum?). Similarly, I'd be happy to report SPOIL if I knew a workable definition...

By the way, I tested the possibility you told to not divide the data by

...
the standard deviation and still couldn´t find the same

eigenvalues.

...
I am almost certain that Athena is not removing the mean -- that might help explain the issue too.

Hope that helps. I'd like to understand these statistics well enough to report them.

--Matt

-- --Matt Newville <newville at cars.uchicago.edu> 630-252-0431

Joselaine Cáceres gonzalez

2:50 p.m.

Hi Matt, Sorry I didn´t want to cause you too much trouble. The matrix I have [D] is 161x11, but it happens that 160 is exactly the value of the main diagonal in the [Z] covariance matrix [Z]=[D]T[D]. As as said, I first performed step height normalization to each spectrum and then the Z centralization of each one to find [D], and then use [D] to calculate [Z] the covariance matrix . I extracted the eigenvalues using two methods: eigenvalue decomposition of [Z] and also SVD of [D]. The second method gives the square roots of the eigenvalues and with the first their values directly. Both method gave reasonably equal eigenvalues betweem them. Now I made the test to divide the eigenvalues (obtained by the second method) between 160 and they are exactly equal to ATHENA results... In relation to "s", in the paper of Malinoksi of 2004 (Malinowski, E.R., Adaptation of the Vogt–Mizaikoff F-test to determine the number of principal factors responsible for a data matrix and comparison with other popular methods. Journal of Chemometrics, 2004. 18(9): p. 387-392) is stated that s is equal to r or c, wichever is smaller, in this case s=c, the number of spectra as you said. You gave me a lot to think, actually I don´t really know if I understand what you said: "...Athena *is* normalizing the data, but it is normalizing along the first axis, not removing the mean spectra. This gives eigenvalues that sum to Nspectra and the same component eigenvectors as Athena shows. The first component is +/- 1 * mean_spectra." Best regards Joselaine El jue., 25 abr. 2019 a las 11:49, Matt Newville (< newville@cars.uchicago.edu>) escribió:

...

Hi Joselaine,

I'd like to update my answer, especially in regard to what Athena is doing. As it turns out, I can very closely reproduce Athena's results with a simple, straightforward implementation of PCA, and I think it might help explain the differences you are seeing too.

In Python, for data that is arranged as `data` with shape (Nspectra, Nenergy), Athena appears to do

import numpy as np def pca(data): data = (data - data.mean(axis=0)) / data.std(axis=0) cor = np.dot(data.T, data) / data.shape[0] evals, var = np.linalg.eigh(cor) iorder = np.argsort(evals)[::-1] evals = evals[iorder] evec = np.dot(data, var)[:, iorder] return evec, evals

which is to say: Athena *is* normalizing the data, but it is normalizing along the first axis, not removing the mean spectra. This gives eigenvalues that sum to Nspectra and the same component eigenvectors as Athena shows. The first component is +/- 1 * mean_spectra.

That appears to be different from what scikit-learn is doing, which removes the mean spectra (along the other axis). The eigenvectors the scikit-learn reports are off by one from the ones reported by Athena.

I suspect the difference you see might also be due to how the data matrix is ordered, but I don't know why your eigenvalues sum to 160*11 -- I might have suspected they add to 160 or 11....

I'm not sure what the right approach should be, but I'd like to figure this out soon. I think adding Malinowski's IND and perhaps F would be fine statistics, but I only see how to do that if `s=c` (that is the total number of spectra).

--Matt

On Wed, Apr 24, 2019 at 10:59 PM Matt Newville wrote:

...
Hi Joselaine,

On Wed, Apr 24, 2019 at 8:40 PM Joselaine Cáceres gonzalez < joselainecaceres@gmail.com> wrote:

...
Hi Matt, thank you for your answer!. The references I have about Malinowski´s work and some applications are:

Malinowski, E.R., *Theory of error in factor analysis.* Analytical Chemistry, 1977. *49*(4): p. 606-612.

Malinowski, E.R., *Theory of the distribution of error eigenvalues resulting from principal component analysis with applications to spectroscopic data.* Journal of Chemometrics, 1987. *1*(1): p. 33-40.

Malinowski, E.R., *Statistical F-tests for abstract factor analysis and target testing.* Journal of Chemometrics, 1989. *3*(1): p. 49-60.

Malinowski, E.R., *Adaptation of the Vogt–Mizaikoff F-test to determine the number of principal factors responsible for a data matrix and comparison with other popular methods.* Journal of Chemometrics, 2004. *18*(9): p. 387-392.

McCue, M. and E.R. Malinowski, *Target Factor Analysis of the Ultraviolet Spectra of Unresolved Liquid Chromatographic Fractions.* Applied Spectroscopy, 1983. *37*(5): p. 463-469.

Beauchemin, S., D. Hesterberg, and M. Beauchemin, *Principal Component Analysis Approach for Modeling Sulfur K-XANES Spectra of Humic Acids.* Soils Science Society of America Journal, 2002. *66*: p. 83-91.

Wasserman, S.R., et al., *EXAFS and principal component analysis: a new shell game.* Journal of Synchrotron Radiation, 1999. *6*: p. 284-286.

OK, thanks, but what I was more interested in was what is the math for these tests that you are doing?

I think I don't understand what you mean by "eigenvalues explain exactly

...
the amount of variance ... but they are different". Can you clarify? Giving an actual example might help.

Here are results obtained by ATHENA, the factional variance explained by each eigenvalue is calulated by dividing the eigenvalue between the sum of them all, right?:

Yes, that is my understanding too.

...
# Eignevalues Variance Cumulative variance 1 8,864394 0,80585 0,805854 2 1,227578 0,11160 0,917452 3 0,708334 0,06439 0,981846 4 0,129478 0,01177 0,993617 5 0,045127 0,00410 0,997719 6 0,012229 0,00111 0,998831 7 0,009464 0,00086 0,999692 8 0,001489 0,00014 0,999827 9 0,000989 0,00009 0,999917 10 0,000617 0,00006 0,999973 11 0,000298 0,00003 1

Here are results obtained with matrix calculator for the same data:

Yes. Note that for Athena, Sum Eigenvalues = 11, the number of spectra.

What is "matrix calculator"?

...
Eigenvalue Explained Variance Cumulative variance 1418,3057 0,80586 0,80586 196,4118 0,11160 0,917453 113,3331 0,06439 0,981846 20,7162 0,01177 0,993617 7,2205 0,00410 0,997719 1,9566 0,00111 0,998831 1,5144 0,00086 0,999692 0,2381 0,00014 0,999827 0,1583 0,00009 0,999917 0,0987 0,00006 0,999973 0,0477 0,00003 1,000000

Here, Sum Eigenvalues = 1760. = 160*11.

The explained variance looks essentially identical. The eigenvalues differ by a constant scale factor that happens to be very close to 160. Without seeing your data, I might ask a) are there 160 energy points in the matrix you are using? and b) is the integral or the sum of the mean specta = 160?

FWIW, when I compare scikit-learn PCA (as used in Larch) with Athena, I get answers that are significantly different, and by much more than just a scale factor. I am pretty sure that this is because scikit-learch For a case where there are 9 input spectra, Athen gives eigenvalues that sum to 9, whereas scikit-learns' eigenvalues sum to 2.3.... I don't know where that difference comes from.

...
The eigenvalues are used then to evaluate the function IND and F test, and depending on the values of eigenvalues, function IND reach a minimum value when the set of primary components are separated from the secondary ones that just explained experimental errors (in the equations lambda are the eigenvalues, r the numbers of rows, c columns, n the number of primary components): [image: image.png]

The results obtained with the two sets of eigenvalues are diferent but they reach the minimum in the same n. The F test also gives me similar levels of significance for the two sets, but I do not undestand why I´m not hable to find the same eigenvalues that ATHENA does.

Yeah, I sort of doubt that any of those tests will give a different value for 'n' (significant components) based on the scale of the eigenvalues themselves. That is, I think you can use "fractional explained variance" (or "explained variance ratio").

I don't see an actual write-up where the formula for IND comes from. But I do not quite get what `s` is -- is that `c-n` ? I'd also be happy to try to add Malinowski's F test to the reporting for Larch's PCA analysis, but I don't quite understand it. The sum is from n+1 to s or n+1 to c (as with RE)? What is `j` in the demoninator (isn't that outside the sum?). Similarly, I'd be happy to report SPOIL if I knew a workable definition...

By the way, I tested the possibility you told to not divide the data by

...
the standard deviation and still couldn´t find the same

eigenvalues.

...
I am almost certain that Athena is not removing the mean -- that might help explain the issue too.

Hope that helps. I'd like to understand these statistics well enough to report them.

--Matt

-- --Matt Newville <newville at cars.uchicago.edu> 630-252-0431 _______________________________________________ Ifeffit mailing list Ifeffit@millenia.cars.aps.anl.gov http://millenia.cars.aps.anl.gov/mailman/listinfo/ifeffit Unsubscribe: http://millenia.cars.aps.anl.gov/mailman/options/ifeffit

2184

Age (days ago)

2186

Last active (days ago)

List overview

Download

5 comments

2 participants

participants (2)

Joselaine Cáceres gonzalez
Matt Newville