-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected spectra factors and some misc questions #24
Comments
Hi Cristian, thanks for your feedback. Regarding your first question: These behaviors is expected. Three aspects: a) Multiple factors getting named by the same gene set and b) cell type specific factors getting named by factors from a different cell type c) a factor does not get a gene set name assigned a) There is by design no 1:1 mapping between gene sets and factors. The factors are named by their overlap coefficient with all input gene sets, so also cell type specific gene sets will be used for naming all of the factors. But don't worry, Spectra only used them to fit the factors for the cell types you indicated. Maybe I will constrain the factor naming to the gene sets that were used for fitting the factors for the respective cell type. Personally, I felt it was helpful to perform the naming by computing overlap of all gene sets vs all factors because when you set a gene set as cell type specific, Spectra may still discover a similar factor outside of that cell type in an unsupervised way that is without using the gene set. Let me know if that makes sense. b) Because of this lack of 1:1 mapping, gene sets which are not very coherent (maybe they are a mix of several processes) can be split into several factors mapping to a same gene. You may want to look at the marker genes for each of these factors. I suspect they will use different genes from the gene set. We believe this behavior is desired as the gene sets cannot be regarded as ground truth and may contain several ground truth processes which should then be split into distinct factors by Spectra. I think this behavior will be reduced if you reduce lambda, however this will also lead to stronger supervision by the gene sets so you learn less from the data. If you have a lot more factors than gene sets you can try reducing the number of factors first. Another approach could also be to take a closer look at the marker genes and see if there are perhaps some important subprocesses in your gene set and whether the gene set can be split and renamed or made more coherent. Of course there are also other ways to name the factors e.g. by running GSEA on the marker genes/ gene scores/loadings. c) If there is an gene set vs factor marker genes overlap coefficient lower than the threshold in Regarding eta: Regarding the cell scores: Regarding the gene set dictionary: Regarding the GPU version: Let me know if that helps. Thanks, |
Hello Thomas, I appreciate your thorough explanation. Now things are much clearer. I will need to invest some time to digest all the output I got from Spectra taking into account all the remarks you made :) Will also test different lambda values and try to refine my gene sets so there are not 'promiuscous' gene sets that cover many cell processes. Regarding your comment on unassigned factors to a specific gene set, you are right that all entries are unique and named with no duplicates. However I am curious why the final number of factors is 'limited/sticked' to the number of gene sets. I would assume if there are factors that are not fitting that well or less 'relevant', then the number of final factors should change. Because I have the feeling now that from my 465 gene sets, Spectra will try to fit 465 factors even if some of them are not specific for a particular cell type (such as the example with jessa22_M2 (which appears at least 8 times and abrogates other gene sets). What happened with the gene sets that now, talking again about this particular gene set (jessa22_M2), has 'replaced'? Does it mean that the gene set 17 that in my dictionary belongs to 'all_selenoamino-acid metabolism' has an overlap with jessa22_M2 but is better defined by the last one? (although I have noticed that factors try to keep the same order as in the dictionary, sometimes they are shuffled and there is no direct 1:1 replacement with less fitting factors). About the cell scores, I had the feeling they are more 'arbitrary' units and not necessarily can be comparable across factors. Looking forward to read more about it once the revised version is publically available (or in a updated tutorial🙂). Lastly, if I run Spectra in an unsupervised way, how does it determines the number of factors? do I need to set and expected number of factors to be recovered? or will Spectra determine the best and more stable value where no more overlapping factors and more relevant ones are reached? Hope @russellkune can tell us a bit more about the GPU implementation of Spectra. Thanks in advance for the time and detailed answers. |
Hi there, I was trying to run spectra fully unsupervised (no dictionary nor cell types), but I am unsure how to set the model besides setting Thanks! |
Hi Cristian, |
Perhaps to add to that: If you use Spectra without cell types and without gene sets it should essentially behave like NMF. So probably it's better to run NMF for this use case (mind that in the paper we show that NMF is actually performing very poorly so I would maybe rather use something like scHPF if you want to go fully unsupervised). |
Good day,
I have run Spectra successfully using both CPU and GPU. However, I have noticed an unexpected behavior in both approaches. I used a downsampled version of my data (120K cells), normalized using
scran
, and grouped in major cell types (malignant, myeloid, lymphoid, and vascular). In addition to the global gene sets, I added some specific ones for each cell group, and in total, the dictionary contains 465 gene sets.When I examined and plot the output stored in
adata.uns['SPECTRA_overlap']
and their respective scoresadata.obsm['SPECTRA_cell_scores']
, I noticed that the order and naming of the gene sets changed. In some cases, gene sets appear in global and other categories more than one time and show different results. Also, some categories are called only global, or malignant or myeloid. I thought they could be a summary of all gene sets in that category, but once again, there are several with the same name and different scores/distributions. See the example below (In this case, the gene set called ‘jessa22_M2’ is in the dictionary only as an entry for the malignant category and contains a signature of 50 genes).I am using
Why do you think this is happening? Should I reduce
lam
(e.g., 0.01) to make it more strict?I also wonder how can you discriminate potentially novel programs as you described in the manuscript for the therapy-induced macrophages? You mentioned in the methods that you defined “new factors as factors with a graph dependency parameter η < 0.25 and modified factors as factors with a graph dependency parameter η ≥ 0.25”. Which of the outputs can I use to filter potentially novel factors?
I also read in the methods that the gene set dictionary is optional. In that case, will Spectra perform an unsupervised gene program detection?
About the SPECTRA_cell_scores, I saw that most of the scores are not higher than 0.02. Do you know if this is normal? I see these scores even for well-known gene sets that define, for instance, the cell cycle. (I see the scales presented in your manuscript are much higher)
Finally, as I mentioned in #21, GPU can reach a lower LR with fewer epochs than the CPU implementation. Does this suggest the GPU version can provide more accurate results in less time?
Thanks in advance! Sorry for the questions, but I am excited to get this working on my dataset (and understanding better the outputs)!
I'm looking forward to hearing your thoughts and suggestions.
The text was updated successfully, but these errors were encountered: