#2 Natural Language Processing

Welcome to the second edition of Bioinformatics in Rust!

There was a great reception for the first-ever edition of this newsletter last month! I hope to keep that momentum going and continue bringing you exciting updates about Rust and bioinformatics.

Spotlight

Rust-bert

crates.io | docs.rs | source

Rust-bert is a Rust port of Hugging Face’s transformers library. It offers a wide range of powerful natural language processing (NLP) features, such as:

Named Entity Recognition (NER)
Part-of-Speech (POS) tagging
Keyword extraction
Masked Language Modeling
And more

This, and its extensive suite of tests, makes it a powerful tool for integrating modern NLP capabilities into a fast, reliable Rust application.

ttaw

crates.io | docs.rs | source

Talking to a wall (ttaw) is described by the author as a piecemeal natural language processing library. The latest version at the time of writing, v0.3.0, is capable of detecting rhymes, detecting alliteration, and phonectic transcription.

Functionality:

Determine if two words rhyme using the Double Metaphone phonetic encoding
Determine if two words rhyme using CMUdict phonetic encoding
Determine if two words alliterate using the Double Metaphone phonetic encoding
Determine if two words alliterate using CMUdict phonetic encoding
Get the CMUdict phonetic encoding of a word
Get the Double Metaphone phonetic encoding of a word (port of words/double-metahone library)

Research Highlights

Quantum Natural Language Processing

Summary:

Natural Language Processing is a subfield of artificial intelligence that aims to make computers understand human language. Quantum Natural Language Processing takes this concept and applies it to the realm of quantum computing. Some of the applications of this within the field of bioinformatics include: detection of non-coding RNA, the prediction of protein structure and function, biomedical literature mining, drug discovery/design, and genomic sequence analysis.

Pallavi, G., & Kumar, R.P. (2025). Quantum natural language processing and its applications in bioinformatics: a comprehensive review of methodologies, concepts, and future directions. Frontiers in Computer Science, 7, Article 1464122. DOI

Monthly Challenge

Natural language processing (NLP) consists of many powerful components. Try programming it on your own!

Some core tasks you can implement include:

Text Tokenization – Splitting text into smaller units called tokens (words, phrases, or symbols).
Word Lemmatization – Reducing words to their base or dictionary form.
Part-of-Speech Tagging – Assigning grammatical categories (noun, verb, adjective, etc.) to each token.
Named Entity Recognition (NER) – Detecting and classifying proper nouns like people, places, or organizations.
Coreference Resolution – Identifying when different words refer to the same entity within a text.

Bonus Challenge

Combine multiple NLP components into a pipeline that performs more sophisticated tasks.

Example: Grammar Correction Pipeline

Create a system that:

Tokenizes the input.
Tags each word with its part of speech.
Detects grammar issues (e.g., verb tense mismatches, subject-verb disagreement).
Suggests corrections using lemmatization and rules (or a small ML model).
Reconstructs the corrected sentence.

Visit the official Challenge GitHub page
Share your entry on the discord

Have a tool, paper, dataset, or idea you’d like featured? Have suggestions for the website? Want to submit your answer to the challenge to potentially be featured on next month’s newsletter?

Spotlight#

Rust-bert#

ttaw#

Research Highlights#

Quantum Natural Language Processing#

Monthly Challenge#

Bonus Challenge#

Example: Grammar Correction Pipeline#

How to Share#