Hi AWG members @AnimalAWG @MultiOmicsAWG,
We have been processing microarray datasets and came across some bacteria datasets that do not have BioMart support. Normally, we would use BioMart to help us map the probeset-level output data to gene-level annotations. But since this information is not available for bacteria data, we can either:
(1) Omit annotations entirely in the outputs for bacteria datasets
(2) Use annotation data from the manufacturer
For the latter, note that the annotation data may be quite old. For example, Thermo Fisher provides the following annotation files, both created in 2016 (login required for below links):
We would like your feedback on whether or not these annotations from the manufacturer would still be useful to include in the output data. Alternatively, we are open to any other sources of annotations that you may know of.
Thank you!
2 Likes
Generally I think weβd want to provide some annotation even if it is outdated. Thus, unless anyone comes up with a good open source, the manufacturer annotation might be the best you can do.
2 Likes
Thank you for your feedback! We will make sure to include the annotations in that case. Our current outputs include the following annotation fields:
βENSEMBLβ, βSYMBOLβ, βGENENAMEβ, βREFSEQβ, βENTREZIDβ, βSTRING_idβ, βGOSLIM_IDSβ
Some of these fields might not be available in the manufacturer annotations, but we will include what is available.
Are there any other fields besides the ones listed above that you would like to see in the output? Here are all the fields provided by ThermoFisher in the bacteria species linked above (note: some fields might not actually be populated):
Probe Set ID |
GeneChip Array |
Species Scientific Name |
Annotation Date |
Sequence Type |
Sequence Source |
Transcript ID(Array Design) |
Target Description |
Representative Public ID |
Archival UniGene Cluster |
UniGene ID |
Genome Version |
Alignments |
Gene Title |
Gene Symbol |
Chromosomal Location |
Unigene Cluster Type |
Ensembl |
Entrez Gene |
SwissProt |
EC |
OMIM |
RefSeq Protein ID |
RefSeq Transcript ID |
FlyBase |
AGI |
WormBase |
MGI Name |
RGD Name |
SGD accession number |
Gene Ontology Biological Process |
Gene Ontology Cellular Component |
Gene Ontology Molecular Function |
Pathway |
InterPro |
Trans Membrane |
QTL |
Annotation Description |
Annotation Transcript Cluster |
Transcript Assignments |
Annotation Notes |
AFFX-Athal_actin_at |
P. aeruginosa Genome Array |
Pseudomonas aeruginosa |
30-Mar-16 |
Control sequence |
GenBank |
AFFX-Athal_actin |
U37281 Arabidopsis thaliana actin-2 mRNA |
U37281 |
β |
At.23605 |
β |
β |
actin 7 |
ACT7 |
β |
β |
β |
830841 |
P53492 |
β |
β |
NP_196543 |
NM_121018 |
β |
β |
β |
β |
β |
β |
0006007 // glucose catabolic process // inferred from reviewed computational analysis /// 0006094 // gluconeogenesis // inferred from reviewed computational analysis /// 0007010 // cytoskeleton organization // inferred from reviewed computational analysis /// 0007010 // cytoskeleton organization // traceable author statement /// 0009416 // response to light stimulus // inferred from expression pattern /// 0009611 // response to wounding // inferred from expression pattern /// 0009733 // response to auxin // inferred from expression pattern /// 0009845 // seed germination // inferred from mutant phenotype /// 0010053 // root epidermal cell differentiation // inferred from mutant phenotype /// 0010498 // proteasomal protein catabolic process // inferred from reviewed computational analysis /// 0032880 // regulation of protein localization // inferred from reviewed computational analysis /// 0048364 // root development // inferred from mutant phenotype /// 0048767 // root hair elongation // inferred from mutant phenotype /// 0048767 // root hair elongation // inferred from reviewed computational analysis /// 0051301 // cell division // inferred from mutant phenotype |
0005618 // cell wall // inferred from direct assay /// 0005730 // nucleolus // inferred from direct assay /// 0005737 // cytoplasm // not recorded /// 0005739 // mitochondrion // inferred from direct assay /// 0005829 // cytosol // inferred from direct assay /// 0005856 // cytoskeleton // inferred from sequence or structural similarity /// 0005886 // plasma membrane // inferred from direct assay /// 0009506 // plasmodesma // inferred from direct assay /// 0009570 // chloroplast stroma // inferred from direct assay /// 0009941 // chloroplast envelope // inferred from direct assay |
0000166 // nucleotide binding // inferred from electronic annotation /// 0005200 // structural constituent of cytoskeleton // inferred from sequence or structural similarity /// 0005515 // protein binding // inferred from physical interaction /// 0005524 // ATP binding // inferred from electronic annotation |
β |
β |
β |
β |
This probe set was annotated using the Design Representative Id based pipeline to a Entrez Gene identifier using 1 transcripts. // false // Design Representative Id // R |
U37281 |
AFFX-Athal_actin // β // unknown // β // β /// U37281 // U37281 Arabidopsis thaliana actin-2 mRNA // gb // β // β |
β |
AFFX-Athal_GAPDH_at |
P. aeruginosa Genome Array |
Pseudomonas aeruginosa |
30-Mar-16 |
Control sequence |
GenBank |
AFFX-Athal_GAPDH |
M64116 Arabidopsis thaliana glyceraldehyde 3-phosphate dehydrogenase C subunit (GapC) gene |
M64116 |
β |
At.22963 |
β |
β |
glyceraldehyde-3-phosphate dehydrogenase C subunit 1 |
GAPC1 |
β |
β |
β |
819567 |
P25858 /// Q41949 /// Q56WW5 |
β |
β |
NP_187062 |
NM_111283 |
β |
β |
β |
β |
β |
β |
0006006 // glucose metabolic process // inferred from electronic annotation /// 0006007 // glucose catabolic process // inferred from reviewed computational analysis /// 0006094 // gluconeogenesis // inferred from reviewed computational analysis /// 0006094 // gluconeogenesis // traceable author statement /// 0006096 // glycolytic process // inferred from direct assay /// 0006096 // glycolytic process // inferred from electronic annotation /// 0006096 // glycolytic process // inferred from sequence or structural similarity /// 0006096 // glycolytic process // inferred from reviewed computational analysis /// 0006096 // glycolytic process // traceable author statement /// 0006098 // pentose-phosphate shunt // inferred from reviewed computational analysis /// 0006833 // water transport // inferred from reviewed computational analysis /// 0006972 // hyperosmotic response // inferred from reviewed computational analysis /// 0006979 // response to oxidative stress // inferred from direct assay /// 0006979 // response to oxidative stress // inferred from expression pattern /// 0007010 // cytoskeleton organization // inferred from reviewed computational analysis /// 0007030 // Golgi organization // inferred from reviewed computational analysis /// 0009060 // aerobic respiration // inferred from reviewed computational analysis /// 0009266 // response to temperature stimulus // inferred from reviewed computational analysis /// 0009408 // response to heat // inferred from expression pattern /// 0009651 // response to salt stress // inferred from expression pattern /// 0009651 // response to salt stress // inferred from reviewed computational analysis /// 0009744 // response to sucrose // inferred from expression pattern /// 0010154 // fruit development // inferred from mutant phenotype /// 0010498 // proteasomal protein catabolic process // inferred from reviewed computational analysis /// 0034976 // response to endoplasmic reticulum stress // inferred from reviewed computational analysis /// 0042542 // response to hydrogen peroxide // inferred from direct assay /// 0046686 // response to cadmium ion // inferred from expression pattern /// 0046686 // response to cadmium ion // inferred from reviewed computational analysis /// 0048316 // seed development // inferred from mutant phenotype /// 0051775 // response to redox state // inferred from direct assay /// 0055114 // oxidation-reduction process // inferred from electronic annotation |
0005634 // nucleus // inferred from direct assay /// 0005737 // cytoplasm // not recorded /// 0005739 // mitochondrion // inferred from direct assay /// 0005740 // mitochondrial envelope // inferred from direct assay /// 0005774 // vacuolar membrane // inferred from direct assay /// 0005794 // Golgi apparatus // inferred from reviewed computational analysis /// 0005829 // cytosol // inferred from direct assay /// 0005829 // cytosol // traceable author statement /// 0005886 // plasma membrane // inferred from direct assay /// 0009507 // chloroplast // inferred from direct assay /// 0016020 // membrane // inferred from direct assay /// 0048046 // apoplast // inferred from direct assay |
0003677 // DNA binding // inferred from electronic annotation /// 0004365 // glyceraldehyde-3-phosphate dehydrogenase (NAD+) (phosphorylating) activity // traceable author statement /// 0005507 // copper ion binding // inferred from direct assay /// 0008886 // glyceraldehyde-3-phosphate dehydrogenase (NADP+) (non-phosphorylating) activity // inferred from direct assay /// 0016491 // oxidoreductase activity // inferred from electronic annotation /// 0016620 // oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor // inferred from electronic annotation /// 0050661 // NADP binding // inferred from electronic annotation /// 0051287 // NAD binding // inferred from electronic annotation |
β |
β |
β |
β |
This probe set was annotated using the Design Representative Id based pipeline to a UniGene identifier using 1 transcripts. // false // Design Representative Id // R |
M64116 |
AFFX-Athal_GAPDH // β // unknown // β // β /// M64116 // M64116 Arabidopsis thaliana glyceraldehyde 3-phosphate dehydrogenase C subunit (GapC) gene // gb // β // β |
β |
We would appreciate any feedback on this topic by Friday, August 9th. Thank you all!
1 Like