gene annotations and poll

it just occurred to me (probably you alredy had it on mind) that we need to have a system similar to citulike and connotea for genes!!! On other words, let TSL be for genes what delicious is for links or flikr is for photos! Imagine a situation in which we catalog (tru NCBI eUtils http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) ALL genes known up-to-date, and allow people to subscribe to their genes by providing them a card with lots if info. They also tag them, annotate them, create groups around a gene, etc... then we have what we are looking for!! There would be no need of a voting system. In my particular project, the question is: which are the important genes for protein structure characterization? well the answer could be simple... the ones that more people will be working on them and the ones that create more interest == the ones that belong to more libraries. Also, by having a tagging system, we can calculate scores of "tag similarity" with other libraries such as connotea or citeulike (or you name it!). Your thoughts?

Yes!

I've been noodeling in this general area too. The trick will be to provide a good exploring/discovery UI.  Something like a tree map or heat map. The concept of tags provide multiple dimensions to explore upon and to connect to other related areas. As you drill into the results of items with a given tag, you can see other related tags and explore those. You can also look for tag intersections where content has been tagged with multiple specific values e.g. http://www.technorati.com/tag/HTML%20and%20CSS One of the things I really like about Connotea is that it automatically creates a dimension on people too - showing me other users who have also used a given tag. Already, I'm seeing patterns of people tagging content. People who use the Malaria tag could be potential people to notify and invite to the Malaria community. Let's take this idea to genes. A few questions to help define the appropriate dimensions (this is where my genome ignorance shows...) True / false / refine: A genome represents a known collection of genes for a biological organism. This includes animals, plants, bacteria and viruses and more. True / false / refine: A specific gene can belong to more than one genome. True / false / refine: A protein is frequently studied as part of the field of molecular biology and can belong to more than one gene. Annotations can occur for: - genome? - genes? - proteins?

RE: Yes!

Some specific answers: A genome represents a known collection of genes for a biological organism. This includes animals, plants, bacteria and viruses and more.
    A genoms is the collection of all genes in an organism. It usually comprise ALL sequenced nucleic acid form the organism even the regions with not known or annotated genes. The fact that a gene is not annotated does not mean it does not exists, simply means we do not know it exists. Currently, there are about 285 complete genomes sequenced (see Genomes @ NCBI). 93% of those are for prokaryote organisms (bateria, etc...) and only 7% are for eukayotes (incluing Human and Mouse). There are also 219 genomes almost completed and 510 in progress.
A specific gene can belong to more than one genome.
    A gene belongs only to a genome and each is a unique identity. However, close organisms (such as human and chimpanzee) share high degree of homology in their genes. Thus, two genes from two genomes may be very similar or even identical. All organisms need to perform the very same tasks at the cellular level. For example, we all need ATP for generating energy in our bodies (ATP is the gas we use to run). Thus, the genes (rather proteins) that are esential for assimilating and using ATP in our cells are needed for all cellular organisms (from uni-cellular such as yeast to multi-cellular such as humans).
A protein is frequently studied as part of the field of molecular biology and can belong to more than one gene.
    That question does not have a straight answer. The generaly accepted rule is that one gene == one protein but that is now clear to not hold anymore. Therefore, a gene may result in more than one protein depending on several conditions. Generaly, researchers work on a particular biological problem (or sometimes disease) and they concentrate their efforts in studying all genes and proteins known to be somehow associated to the molecular mechanisms around their problem. Several fields in biology work at the gene or protein levels and Molecular Biology is one of those.
Annotations can occur for: - genome? - genes? - proteins?
    It actually occurs at all levels. However, the one that people is working more actively are genes and proteins. The type of annotations also differ depending on the level you are in. For example, very little is annotated on structure for genes but for proteins the structural annotation is essential since structure determines how proteins function.
Hope that helps...

gene dimensions and tree map possibilities

It helps a lot. Assuming I understand, here are my assumptions. Again, correct as necessary. Any given community is typically going to have 1 genome of interest. There may be exceptions, but this is generally true. A few of these genomes will be fully mapped and we'll be able to load the "complete" genome information. Complete meaning the actual existence of a gene is known. But the details of that gene and the associated proteins still require a lot more research and that information is evolving. But most will only have partial information. Is this also available in a central db? Do scientists keep it up to date? In other words should we only perform periodic batch uploads from that source? Or should we allow one-by-one maintenance of the genomic data by the community? Annotations, references and comments should be maintained by the community and can occur at the following levels:
  • genome
  • gene
    • protein
    Is there any other scientific DB tracking this today? I don't want to build a competing tool. Are there others - relating to our idea of a genomic reference map. Again, a tree map is likely applicable here.   I know somebody at Oracle who can give us guidance in this area.  She says that The Hive Group has some decent software.  But she also says that people who create tree maps (see smart money's market map) tend to need something custom.  I could envision a partnership with them where by you provide them specs for bio-applications.  They give us the software for free usage on The Synaptic Leap and perhaps plasmodb - public domain, but then they offer it as a package for same to other biotechnology companies for private use. The other approach is to talk a tree map expert into helping you out for free as an open source project.  My friend at Oracle says that Jonathan Helfman is one expert developer she knows in this space.  (Afraid I don't know Jonathan.) A long shot would be to try and recruit him to help. Think about it.   It doesn't have to be v1 of the gene poll.  But if the application is as right as I think it is then it would be a nice visual exploration tool.