paint-brush
Beating Siri at its Text Gameby@philhopkins
380 reads
380 reads

Beating Siri at its Text Game

by Philip HopkinsDecember 14th, 2024
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

Beating Apple and Siri at the game of transcribing and sending texts
featured image - Beating Siri at its Text Game
Philip Hopkins HackerNoon profile picture


I’ve had an iPhone for ten years, and I love it. Unlike some people, I really enjoy Siri and use it frequently. But after ten years, Siri hasn't figured out that when it transcribes my texts, it should know my wife's name is not Aaron, it's Erin. Now that Apple Intelligence has been released, these sorts of issues are more rare, but many of us are still using older iPhones. In this article I propose an easy fix for mis-transcriptions that could be rolled out to older phones.

Speech-to-text systems often struggle with homophones—words that sound the same but have different spellings and meanings. These errors can be frustrating, especially when they affect personal names or commonly used terms. The key to fixing this problem lies not in overhauling the speech recognition engine but in a lightweight, post-transcription text processing layer that adapts to user corrections over time. Here’s the PyTorch-based code I designed to address this.

This isn’t a manual fix to a problem that should be addressed solely with transformer-scale learning, it creates a feature that could be used as an input to any model, including LLMs. And this doesn’t have to wait for a new phone release to be deployed. It would make life better for me in the next update Apple releases for my iPhone.

The Core Idea

This approach focuses on three main elements:

  • Correction History: Stores previous user corrections, prioritizing words the user has explicitly fixed before.
  • Frequent Contacts: Tracks frequently used words or names, assigning a higher likelihood to those more commonly used.
  • Contextual Analysis: Uses Natural Language Processing (NLP) to analyze the surrounding text for clues that help disambiguate homophones.

The system calculates a likelihood score for each homophone candidate based on these three factors and selects the most likely correction. Below is the Python implementation broken into sections with explanations.

Loading the Homophones Database

The first step is loading a database of homophones, and updating it smartly. These are word pairs (or groups) that are likely to be confused during transcription.


# Homophones database
homophones_db = {
    "Aaron": ["Erin"],
    "bare": ["bear"],
    "phase": ["faze"],
    "affect": ["effect"],
}

This is a simple dictionary where the key is the incorrectly transcribed word, and the value is a list of homophone alternatives. For example, "phase" can be confused with "faze". Later, this database will be queried when an ambiguous word is encountered.

Tracking Correction History

The code tracks user corrections in a dictionary where each key is a tuple of (original_word, corrected_word) and the value is the count of times the user corrected that error.

Correction history tracker

# Correction history tracker
correction_history = {
    ("phase", "Faye's"): 3,
    ("bear", "bare"): 2,
}


If the user corrects "phase" to "Faye’s" three times, the system prioritizes this correction for future transcriptions.

Frequent Contacts

Another factor influencing homophone selection is how often a particular word is used. This could be personal names or terms the user frequently types.

# Frequent contact tracker
frequent_contacts = {
    "faye": 15,
    "phase": 5,
    "erin": 10,
    "aaron": 2,
}

The system gives more weight to frequently used words when disambiguating homophones. For instance, if "faye" appears 15 times but "phase" appears only 5 times, "faye" will be preferred.

Contextual Analysis

Context clues are extracted from the surrounding sentence to further refine the selection. For example, if the sentence contains the pronoun "she", the system might favor "Erin" over "Aaron". from transformers import pipeline

Load an NLP model for context analysis

from transformers import pipeline

# Load an NLP model for context analysis
context_analyzer = pipeline("fill-mask", model="bert-base-uncased")

def detect_context(sentence):
    """Detect context-specific clues in the sentence."""
    pronouns = ["he", "she", "his", "her", "their"]
    tokens = sentence.lower().split()
    return [word for word in tokens if word in pronouns]

This function scans the sentence for gender-specific pronouns or other clues that might indicate the intended meaning of the word.

Calculating Likelihood Scores

Each homophone candidate is assigned a likelihood score based on:

  1. Past Corrections: Higher weight (e.g., 3x).
  2. Frequent Usage: Medium weight (e.g., 2x).
  3. Context Matching: Lower weight (e.g., 1x).
def calculate_likelihood(word, candidate, sentence):
    """Calculate a likelihood score for a homophone candidate."""
    correction_score = correction_history.get((word, candidate), 0) * 3
    frequency_score = frequent_contacts.get(candidate, 0) * 2
    context = detect_context(sentence)
    context_clues = homophones_db.get(candidate, [])
    context_score = sum(1 for clue in context if clue in context_clues)
    return correction_score + frequency_score + context_score

This score combines the three factors to determine the most likely homophone.

Disambiguating Homophones

With the likelihood scores calculated, the system selects the homophone with the highest score.

def prioritize_homophones(word, candidates, sentence):
    """Prioritize homophones based on their likelihood scores."""
    likelihoods = {
        candidate: calculate_likelihood(word, candidate, sentence) for candidate in candidates
    }
    return max(likelihoods, key=likelihoods.get)

def disambiguate_homophone(word, sentence):
    """Disambiguate homophones using likelihood scores."""
    candidates = homophones_db.get(word, [])
    if not candidates:
        return word
    return prioritize_homophones(word, candidates, sentence)


This process ensures the most appropriate word is chosen based on history, frequency, and context.

Processing Full Transcriptions

The system processes an entire sentence, applying the disambiguation logic to each word.

def process_transcription(transcription):
    """Process the transcription to correct homophones."""
    words = transcription.split()
    corrected_words = [disambiguate_homophone(word, transcription) for word in words]
    return " ".join(corrected_words)


Full Example Workflow

# Example transcription and correction
raw_transcription = "This is phase one plan."
corrected_transcription = process_transcription(raw_transcription)

print("Original Transcription:", raw_transcription)
print("Corrected Transcription:", corrected_transcription)

# Simulate user feedback
update_correction_history("phase", "faye")
print("Updated Correction History:", correction_history)
print("Updated Frequent Contacts:", frequent_contacts)

Updating Feedback

When the user corrects a mistake, the correction history and frequent contacts are updated to improve future predictions.


def update_correction_history(original, corrected):
    """Update correction history and frequent contacts."""
    correction_history[(original, corrected)] = correction_history.get((original, corrected), 0) + 1
    frequent_contacts[corrected] = frequent_contacts.get(corrected, 0) + 1
    frequent_contacts[original] = max(0, frequent_contacts.get(original, 0) - 1)

Example transcription and correction

Original Transcription: This is phase one plan.
Corrected Transcription: This is Faye's one plan.
Updated Correction History: {('phase', 'Faye's'): 4}
Updated Frequent Contacts: {'Faye's': 16, 'phase': 4}

Conclusion

This lightweight text-processing layer enhances the accuracy of speech-to-text applications by learning from user corrections, leveraging frequent usage, and analyzing context. It’s compact enough to run on mobile devices and adaptable to individual user needs, offering a smarter alternative to traditional static models. With minimal effort, Apple—or any other company—could integrate this functionality to make virtual assistants like Siri more responsive and personalized.