Welcome to the second edition of Bioinformatics in Rust!

There was a great reception for the first-ever edition of this newsletter last month! I hope to keep that momentum going and continue bringing you exciting updates about Rust and bioinformatics.


Spotlight

Rust-bert

crates.io | docs.rs | source

Rust-bert is a Rust port of Hugging Face’s transformers library. It offers a wide range of powerful natural language processing (NLP) features, such as:

  • Named Entity Recognition (NER)
  • Part-of-Speech (POS) tagging
  • Keyword extraction
  • Masked Language Modeling
  • And more

This, and its extensive suite of tests, makes it a powerful tool for integrating modern NLP capabilities into a fast, reliable Rust application.

Talea

source

Talea is a Rust-based project that leverages NLP techniques to extract vital information from input files. It is designed for users without the technical expertise to build their own data processing pipelines. Talea aims to make the extraction of crucial information as seamless and user-friendly as possible.


Research Highlights

Quantum Natural Language Processing

Summary:

Natural Language Processing is a subfield of artificial intelligence that aims to make computers understand human language. Quantum Natural Language Processing takes this concept and applies it to the realm of quantum computing. Some of the applications of this within the field of bioinformatics include: detection of non-coding RNA, the prediction of protein structure and function, biomedical literature mining, drug discovery/design, and genomic sequence analysis.

Pallavi, G., & Kumar, R.P. (2025). Quantum natural language processing and its applications in bioinformatics: a comprehensive review of methodologies, concepts, and future directions. Frontiers in Computer Science, 7, Article 1464122. DOI


Monthly Challenge

Natural language processing (NLP) consists of many powerful components. Try programming it on your own!

Some core tasks you can implement include:

  • Text Tokenization – Splitting text into smaller units called tokens (words, phrases, or symbols).
  • Word Lemmatization – Reducing words to their base or dictionary form.
  • Part-of-Speech Tagging – Assigning grammatical categories (noun, verb, adjective, etc.) to each token.
  • Named Entity Recognition (NER) – Detecting and classifying proper nouns like people, places, or organizations.
  • Coreference Resolution – Identifying when different words refer to the same entity within a text.

Bonus Challenge

Combine multiple NLP components into a pipeline that performs more sophisticated tasks.

Example: Grammar Correction Pipeline

Create a system that:

  • Tokenizes the input.
  • Tags each word with its part of speech.
  • Detects grammar issues (e.g., verb tense mismatches, subject-verb disagreement).
  • Suggests corrections using lemmatization and rules (or a small ML model).
  • Reconstructs the corrected sentence.

How to Share


Ferris in Bioinformatics

Have a tool, paper, dataset, or idea you’d like featured? Have suggestions for the website? Want to submit your answer to the challenge to potentially be featured on next month’s newsletter? Discord