Embedding Models
taxotagger.models
¶
ModelFactory
¶
Factory class to get the embedding model for the given model identifier.
get_model
staticmethod
¶
get_model(
model_id: str, config: ProjectConfig
) -> EmbedModelBase
Get the embedding model for the given model identifier.
Parameters:
-
model_id
(str
) –The identifier of the model to load.
-
config
(ProjectConfig
) –The configurations for the project.
Returns:
-
EmbedModelBase
–The embedding model instance for the given model identifier.
Examples:
Source code in src/taxotagger/models.py
MycoAICNNEmbedModel
¶
MycoAICNNEmbedModel(config: ProjectConfig)
Bases: EmbedModelBase
Embedding model for the pretrained MycoAI-CNN.
Source code in src/taxotagger/models.py
embed
¶
Calculate the embeddings for the given FASTA file.
Parameters:
-
fasta_file
(str
) –The path to the FASTA file to embed.
Returns:
-
dict[str, list[dict[str, Any]]]
–A dictionary of embeddings for each taxonomy level. The dictionary keys are the taxonomy levels, and the values are lists of dictionaries containing the id, embeddings and metadata for each sequence.
The shape of the list is
(n_samples)
, wheren_samples
is the number of sequences.The keys of the inside dictionaries are:
id
,vector
, and the taxonomy levels (e.g.phylum
,class
,order
,family
,genus
,species
) and other metadata fields present in the FASTA header.The shape of the
vector
is(n_features)
, wheren_features
is the number of features in the embedding. The number of features for each taxonomy level is:- phylum: 18 - class: 70 - order: 231 - family: 791 - genus: 3695 - species: 14742
The returned data looks like:
{ "phylum": [{"id": "seq1", "vector": [0.1, 0.2, ...], "phylum": "Basidiomycota", ...}, ...], "class": [{"id": "seq1", "vector": [0.5, 0.6, ...], "class": "Agaricomycetes", ...}, ...], "order": [{"id": "seq1", "vector": [0.9, 0.8, ...], "order": "Corticiales", ...}, ...], "family": [{"id": "seq1", "vector": [0.3, 0.4, ...], "family": "Corticiaceae", ...}, ...], "genus": [{"id": "seq1", "vector": [0.7, 0.8, ...], "genus": "Waitea", ...}, ...], "species": [{"id": "seq1", "vector": [0.5, 0.6, ...], "species": "Circinata", ...}, ...] }
Examples:
>>> config = ProjectConfig()
>>> model = MycoAICNNEmbedModel(config)
>>> embeddings = model.embed("dna1.fasta")
Source code in src/taxotagger/models.py
parse_and_encode_fasta
¶
Parse headers and encode the sequences in the given FASTA file.
The sequences are encoded using the encoders defined in the pretrained model.
Parameters:
-
fasta_file
(str
) –The path to the FASTA file.
Returns:
-
tuple[list[list[str]], TensorData]
–A tuple containing the headers and the encoded data for the sequences in the FASTA file.
The shape of the headers is
(n_samples, n_headers)
, wheren_samples
is the number of sequences andn_headers
is the 9 metadata fields parsed from the header.
Source code in src/taxotagger/models.py
MycoAIBERTEmbedModel
¶
MycoAIBERTEmbedModel(config: ProjectConfig)
Bases: EmbedModelBase
Embedding model for the pretrained MycoAI-BERT.
Source code in src/taxotagger/models.py
embed
¶
Calculate the embeddings for the given FASTA file.
Parameters:
-
fasta_file
(str
) –The path to the FASTA file to embed.
Returns:
-
dict[str, list[dict[str, Any]]]
–A dictionary of embeddings for each taxonomy level. The dictionary keys are the taxonomy levels, and the values are lists of dictionaries containing the id, embeddings and metadata for each sequence.
The shape of the list is
(n_samples)
, wheren_samples
is the number of sequences.The keys of the inside dictionaries are:
id
,vector
, and the taxonomy levels (e.g.phylum
,class
,order
,family
,genus
,species
) and other metadata fields present in the FASTA header.The shape of the
vector
is(n_features)
, wheren_features
is the number of features in the embedding. The number of features for each taxonomy level is:- phylum: 18 - class: 70 - order: 231 - family: 791 - genus: 3695 - species: 14742
The returned data looks like:
{ "phylum": [{"id": "seq1", "vector": [0.1, 0.2, ...], "phylum": "Basidiomycota", ...}, ...], "class": [{"id": "seq1", "vector": [0.5, 0.6, ...], "class": "Agaricomycetes", ...}, ...], "order": [{"id": "seq1", "vector": [0.9, 0.8, ...], "order": "Corticiales", ...}, ...], "family": [{"id": "seq1", "vector": [0.3, 0.4, ...], "family": "Corticiaceae", ...}, ...], "genus": [{"id": "seq1", "vector": [0.7, 0.8, ...], "genus": "Waitea", ...}, ...], "species": [{"id": "seq1", "vector": [0.5, 0.6, ...], "species": "Circinata", ...}, ...] }
Examples:
>>> config = ProjectConfig()
>>> model = MycoAIBERTEmbedModel(config)
>>> embeddings = model.embed("dna1.fasta")
Source code in src/taxotagger/models.py
parse_and_encode_fasta
¶
Parse headers and encode the sequences in the given FASTA file.
The sequences are encoded using the encoders defined in the pretrained model.
Parameters:
-
fasta_file
(str
) –The path to the FASTA file.
Returns:
-
tuple[list[list[str]], TensorData]
–A tuple containing the headers and the encoded data for the sequences in the FASTA file.
The shape of the headers is
(n_samples, n_headers)
, wheren_samples
is the number of sequences andn_headers
is the 9 metadata fields parsed from the header.