Core Concepts in Information Retrieval

A guide to the foundational algorithms of search and text mining.

IR Cheat Sheet

Foundations

Heaps' Law: Predicts vocabulary size (M) from token count (T). Formula: M = kT^b.

Permuterm Index: For wildcard queries. Rotate term + `$`. (e.g., `hello` -> `hello$`, `$hello`, `o$hell`, etc.). Query `hel*o` becomes `o$hel*`.

Classification

Document Representation: Use Vector Space Model (documents as vectors). BoW for term counts, TF-IDF for weighted importance.

Feature Selection: Reduce dimensions. Remove rare/common words; use Mutual Information to find informative terms.

Rocchio Classification: Assigns document to class of nearest centroid (average vector).

Naïve Bayes: Probabilistic. Uses Bayes' theorem with assumption of feature independence.

Evaluating Classification: Use a Confusion Matrix. Precision = TP / (TP+FP), Recall = TP / (TP+FN), F1-Score = 2 * (Prec*Rec) / (Prec+Rec).

Clustering

Agglomerative Clustering: Bottom-up approach. Start with individual docs, merge closest clusters until one remains. Visualized with a dendrogram.

Web & Multimedia

PageRank: Ranks pages by importance based on incoming links. Models a "random surfer".

HITS: Identifies Hubs (good link lists) and Authorities (good content). Hubs point to authorities.

Collaborative Filtering: Recommends items based on user similarity (user-based) or item similarity (item-based).

Multimedia Search: Metadata-based (searches tags, descriptions) vs. Content-based (CBIR) (searches visual features like color, texture).

Question-Topic Mapping

This mapping connects the exam question distribution to the core topics covered in the modular content structure.

Q1: Module 4.1, 4.2

  • Text & Vector Space Classification

Q2: Module 4.1, 4.2, 5.2

  • Text & Vector Space Classification
  • Evaluation in ranked retrieval sets

Q3: Module 4.3

  • Text Clustering

Q4: Module 4.3, 6.2

  • Text Clustering
  • Web Crawlers and Indexes

Q5: Module 6.3, 6.1

  • Web Search Basics & Link Analysis

Q6: Module 7, 8, 9, 2

  • Basic Information Retrieval Concepts
  • Cross Lingual & Multimedia Retrieval
  • Recommender Systems

Foundations: Processing & Indexing Text

Heaps' Law: Predicting Vocabulary Size

Heaps' Law estimates the number of unique words (vocabulary size, M) in a document collection based on its total word count (T). It's an empirical law that describes the sublinear growth of vocabulary size with respect to the text length.

Here, k and b are constants specific to the text or corpus. Typically, k is between 10 and 100, and b is between 0.4 and 0.6. This law is crucial for estimating memory requirements for indexing large text collections.

Example Calculation

Suppose we have a document collection where:

  • When T = 1,000 tokens, M = 1,000 terms.
  • When T = 100,000 tokens, M = 10,000 terms.

Let's estimate the vocabulary size for a collection of 109 tokens.

Step 1: Set up equations using

Equation 1: 1,000 = k * (1,000)b

Equation 2: 10,000 = k * (100,000)b

Step 2: Solve for 'b'

Divide Equation 2 by Equation 1:

10,000 / 1,000 = (100,000 / 1,000)b

10 = 100b

Taking log base 10 on both sides:

log10(10) = b * log10(100)

1 = b * 2

So, b = 0.5

Step 3: Solve for 'k'

Substitute b = 0.5 into Equation 1:

1,000 = k * (1,000)0.5

1,000 = k * 31.62

So, k ≈ 31.62

Step 4: Estimate vocabulary size for 109 tokens

Using with k = 31.62 and b = 0.5:

M = 31.62 * (109)0.5

M = 31.62 * 104.5

M = 31.62 * 31622.77 ≈ 1,000,000

The estimated vocabulary size for a 109 token collection is approximately 1 million terms.

Text Normalization: Case Folding

A simple but crucial step where all text is converted to a single case (usually lowercase). This ensures that words like Apple and apple are treated as the same term, preventing vocabulary inflation and improving search recall.

Tolerant Retrieval: Permuterm Indexes

To support wildcard queries (e.g., el*pse), a permuterm index stores all possible rotations of a term. A special character ($) marks the end of the word.

For the term ellipse, we add a `$` and rotate:

ellipse$$ellipsee$ellipsse$ellip pse$elliipse$elllipse$elllipse$e

A query like el*pse is rewritten to pse$el*, which can be found efficiently in the sorted index.

Text Mining: Classification & Clustering

Document Representation and Feature Selection

Before classification, documents must be converted into a numerical format that algorithms can process. This is known as document representation. A crucial part of this is feature selection, where we choose the most informative terms to include in our representation.

Document Representation: Vector Space Model

The most common representation is the Vector Space Model (VSM). In VSM, documents and queries are represented as vectors of term weights in a high-dimensional space, where each dimension corresponds to a unique term in the vocabulary of the entire document collection.

The magnitude of a term's weight in a document vector indicates the importance of the term in representing the document's content. The geometric orientation of vectors in the space reveals semantic relationships between documents and queries.

Term Weighting: TF-IDF

A popular and effective method for weighting terms in the Vector Space Model is TF-IDF (Term Frequency-Inverse Document Frequency). This scheme assigns a weight to each term in a document based on its frequency in that document and its rarity across the entire collection.

The TF-IDF weight for a term t in a document d is the product of its Term Frequency (TF) and its Inverse Document Frequency (IDF).

Term Frequency (TF)

Term Frequency measures how often a term appears in a document. To prevent a bias towards longer documents, the raw frequency is often logarithmically scaled.

If a term does not appear in a document, its TF is 0.

Inverse Document Frequency (IDF)

Inverse Document Frequency measures how much information a term provides, i.e., whether it is common or rare across all documents. It is a logarithmically scaled inverse fraction of the documents that contain the term.

Where N is the total number of documents in the collection, and dft is the number of documents containing the term t.

Algorithm: Creating a TF-IDF Vector Space Model
  1. Collect Documents: Gather the corpus of documents to be represented.
  2. Tokenize and Build Vocabulary: Tokenize the text of all documents and create a vocabulary of all unique terms.
  3. Calculate Term Frequencies (TF): For each term in the vocabulary, calculate its frequency in each document.
  4. Calculate Inverse Document Frequencies (IDF): For each term in the vocabulary, count the number of documents it appears in (df) and calculate its IDF.
  5. Compute TF-IDF Weights: For each document, compute the TF-IDF weight for every term in the vocabulary. This results in a vector for each document, where each component is the TF-IDF weight of a term.
  6. Normalize Vectors: Normalize the document vectors to unit length. This is important for cosine similarity calculations, as it removes the bias of document length.
Example: Calculating TF-IDF Weights

Consider a collection of 3 documents:

  • Doc 1: "the cat sat on the mat"
  • Doc 2: "the dog sat on the log"
  • Doc 3: "the cat chased the dog"

Let's calculate the TF-IDF weight for the term "cat" in Doc 1.

Step 1: Calculate TF for "cat" in Doc 1

The term "cat" appears once in Doc 1.

TF("cat", Doc 1) = 1 + log10(1) = 1

Step 2: Calculate IDF for "cat"

The term "cat" appears in 2 documents (Doc 1, Doc 3) out of 3 total documents.

IDF("cat") = log10(3 / 2) = 0.176

Step 3: Compute TF-IDF Weight

W("cat", Doc 1) = TF("cat", Doc 1) × IDF("cat") = 1 × 0.176 = 0.176

Feature Selection

In text classification, the number of features (terms) can be very large. Feature selection helps by removing non-informative terms, which reduces computational complexity and can improve accuracy. Common techniques include:

  • Frequency-Based Selection: Removing very rare terms (likely typos or noise) and very common terms (like stop words, which offer little discriminatory power).
  • Mutual Information: Selecting terms that have a high mutual information with the class labels, meaning they provide a lot of information about whether a document belongs to a certain class.

Evaluating Classification

To measure the performance of a classifier, we use several metrics derived from a confusion matrix. The confusion matrix shows the number of correct and incorrect predictions for each class.

Predicted Class
Actual Class Positive Negative
Positive True Positives (TP) False Negatives (FN)
Negative False Positives (FP) True Negatives (TN)

Key Evaluation Metrics

  • Accuracy: The proportion of correct predictions.
  • Precision: Of all the documents the classifier predicted as positive, what fraction were actually positive?
  • Recall (Sensitivity): Of all the documents that were actually positive, what fraction did the classifier correctly predict?
  • F1-Score: The harmonic mean of precision and recall, providing a single score that balances both.
  • Specificity: Of all the documents that were actually negative, what fraction did the classifier correctly predict?
  • False Positive Rate (FPR): Of all the documents that were actually negative, what fraction did the classifier incorrectly predict as positive?

Vector Space Classification: Rocchio Algorithm

This algorithm classifies documents by assuming they belong to the class with the nearest "centroid" or average document vector.

Algorithm Steps

  • Represent Documents: Convert all training documents and the query document into numerical vectors (e.g., normalized term frequency or tf-idf).
  • Compute Centroids: Calculate the average vector (centroid) for each class C using the formula, where Cc is the set of documents in class c:
  • Classify: Calculate the distance (e.g., Euclidean distance) between the query document's vector and each class centroid. Assign the document to the class with the smallest distance.

Example: Classifying "executive suite"

Given the following training documents and their assigned classes, use the Rocchio algorithm to classify the new query document: "executive suite".

  • Class 1 (World News): "China election", "Maldives executive", "Election executive"
  • Class 2 (Business): "Chief executive", "Maldives executive", "Maldives executive suite"

Step 1: Create Normalized Term Frequency Vectors

Vocabulary: (china, election, maldives, executive, suite, chief)

DocClassRaw TF VectorNormalized Vector
d1World(1,1,0,0,0,0)(0.71, 0.71, 0, 0, 0, 0)
d2World(0,0,1,1,0,0)(0, 0, 0.71, 0.71, 0, 0)
d3World(0,1,0,1,0,0)(0, 0.71, 0, 0.71, 0, 0)
d4Business(0,0,0,1,0,1)(0, 0, 0, 0.71, 0, 0.71)
d5Business(0,0,1,1,1,0)(0, 0, 0.58, 0.58, 0.58, 0)

Step 2: Compute Centroids

Centroid C1 (World) = (d1+d2+d3)/3 = <0.24, 0.47, 0.24, 0.47, 0, 0>

Centroid C2 (Business) = (d4+d5)/2 = <0, 0, 0.29, 0.65, 0.29, 0.35>

Step 3: Classify Query

The query "executive suite" (dq) has a normalized vector: <0, 0, 0, 0.71, 0.71, 0>

Euclidean Distance(C1, dq) ≈ 0.94

Euclidean Distance(C2, dq) ≈ 0.68

Conclusion: Since the distance to the Business centroid is smaller, the query is classified as Business.

Supervised Learning: Naïve Bayes Classifier

A probabilistic classifier that uses Bayes' theorem, with the "naïve" assumption that all features (words) are independent of each other given the class.

Algorithm Steps

  • Calculate Prior Probabilities: For each class, find the probability P(c) = (docs in class c) / (total docs).
  • Calculate Conditional Probabilities: For each term `t` and class `c`, find the probability P(t|c) = (count of term t in class c) / (total terms in class c). Use smoothing (e.g., add-one) to avoid zero probabilities.
  • Classify: For a new document, calculate a score for each class by multiplying the class's prior probability by the conditional probabilities of each term in the document. Assign the document to the class with the highest score.

Example

Using the same corpus, we estimate the model parameters.

Prior Probabilities:

P(World News) = 3 documents / 5 total = 0.6

P(Business) = 2 documents / 5 total = 0.4

Conditional Probabilities (sample):

Total terms in World News: 6. Total terms in Business: 5.

P(election | World News) = 2 / 6 = 0.33

P(executive | Business) = 2 / 5 = 0.4

To Classify: A new document's words are used with these probabilities to find the most likely class.

Unsupervised Learning: Text Clustering

Text clustering is an unsupervised machine learning task that automatically groups similar documents together into clusters without any prior knowledge of their categories. The goal is to discover inherent structures in the data, where documents within a cluster are more similar to each other than to those in other clusters. This is particularly useful for organizing large document collections, topic discovery, and improving search results.

Key Steps in Text Clustering

  1. Document Representation: Convert text documents into numerical vectors (e.g., TF-IDF, Word Embeddings).
  2. Similarity Measurement: Define a metric to quantify the similarity or distance between document vectors (e.g., Cosine Similarity, Euclidean Distance).
  3. Clustering Algorithm: Apply an algorithm to group documents based on their similarity.
  4. Evaluation: Assess the quality of the generated clusters (e.g., using internal or external metrics).

Agglomerative (Hierarchical) Clustering

Agglomerative clustering is a "bottom-up" hierarchical method. It starts by treating each document as a single cluster and then iteratively merges the closest pairs of clusters until all documents are part of a single, large cluster, or a stopping criterion is met. The results are often visualized as a dendrogram.

Linkage Criteria

The choice of how to measure the "distance" or "similarity" between two clusters is crucial and defined by the linkage criterion:

  • Single Linkage: The distance between two clusters is the minimum distance between any single data point in the first cluster and any single data point in the second cluster. Tends to produce long, "chain-like" clusters.
  • Complete Linkage: The distance between two clusters is the maximum distance between any single data point in the first cluster and any single data point in the second cluster. Tends to produce more compact, spherical clusters.
  • Average Linkage: The distance between two clusters is the average distance between all data points in the first cluster and all data points in the second cluster. Offers a compromise between single and complete linkage.

Dendrogram Visualization

d1 d2 d4 d5 d3

A dendrogram visually represents the hierarchy of clusters. The height at which two clusters are merged indicates their dissimilarity. Cutting the dendrogram at a certain height yields a partitioning of the data into a specific number of clusters.

K-Means Clustering

K-Means is a partition-based clustering algorithm that aims to partition n documents into k clusters, where each document belongs to the cluster with the nearest mean (centroid). It requires the number of clusters (k) to be specified beforehand.

Algorithm Steps
  1. Initialization: Randomly select k documents from the dataset as initial cluster centroids.
  2. Assignment Step: Assign each document to the closest centroid based on a distance metric (e.g., Euclidean distance, Cosine distance).
  3. Update Step: Recalculate the centroids of the clusters by taking the mean of all documents assigned to each cluster.
  4. Iteration: Repeat the assignment and update steps until the cluster assignments no longer change or a maximum number of iterations is reached.
Example: K-Means for Text

Imagine we have documents about "sports" and "politics". We want to cluster them into k=2 groups.

Initial State:

Documents are represented as TF-IDF vectors. Two random documents are chosen as initial centroids.

Iteration 1:

  • Each document is assigned to the closest centroid. Documents about "football" and "basketball" might gravitate towards one centroid, while "election" and "government" documents gravitate towards another.
  • New centroids are calculated as the average of the documents in each cluster. The "sports" centroid moves closer to the center of sports documents, and the "politics" centroid moves closer to politics documents.

Subsequent Iterations:

Documents might shift clusters if a new centroid becomes closer. The centroids continue to move until they stabilize, meaning documents no longer change their cluster assignments. The final clusters will ideally separate sports-related documents from politics-related documents.

Evaluating Text Clustering

Evaluating the quality of clustering is challenging because there are no ground truth labels (as it's unsupervised). However, several metrics can be used:

  • Purity: Measures the extent to which a cluster contains documents from a single class (if ground truth labels are available for evaluation purposes).
  • Silhouette Score: Measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation). A higher silhouette score indicates better-defined clusters.
  • Davies-Bouldin Index: Measures the ratio of within-cluster scatter to between-cluster separation. Lower values indicate better clustering.

Evaluation in Information Retrieval

Evaluating the effectiveness of an Information Retrieval (IR) system is crucial to determine how well it satisfies user information needs. Evaluation metrics quantify the quality of search results by comparing them against a known ground truth (relevance judgments).

Evaluation in Unranked Retrieval Sets

For systems that return a set of documents without a specific order (e.g., Boolean retrieval), common metrics focus on the proportion of relevant documents retrieved.

Confusion Matrix for IR

Similar to classification, IR evaluation can be understood through a confusion matrix, where "Positive" refers to a relevant document and "Negative" to a non-relevant document.

Retrieved by System
Actual Relevance Relevant Non-Relevant
Relevant True Positives (TP): Relevant documents retrieved False Negatives (FN): Relevant documents not retrieved (missed)
Non-Relevant False Positives (FP): Non-relevant documents retrieved (noise) True Negatives (TN): Non-relevant documents not retrieved
Key Metrics for Unranked Sets
  • Precision (P): The fraction of retrieved documents that are relevant. It answers: "Of the documents the system returned, how many were actually good?"
  • Recall (R): The fraction of relevant documents in the collection that were retrieved by the system. It answers: "Of all the good documents out there, how many did the system find?"
  • F-measure (F1-Score): The harmonic mean of Precision and Recall, providing a single score that balances both. Useful when you need to consider both precision and recall.
  • Accuracy: The proportion of correct classifications (relevant retrieved + non-relevant not retrieved) out of all documents. Less commonly used in IR because it can be misleading if the number of non-relevant documents is very high.
  • Fall-out: The proportion of non-relevant documents that are retrieved. Also known as False Positive Rate.
Example: Unranked Retrieval

Consider a query for which there are 10 relevant documents in a collection of 100 documents. An IR system returns 8 documents, 6 of which are relevant.

TP (True Positives): 6 (relevant documents retrieved)

FP (False Positives): 2 (non-relevant documents retrieved)

FN (False Negatives): 4 (relevant documents not retrieved, 10 total relevant - 6 retrieved relevant)

TN (True Negatives): 88 (non-relevant documents not retrieved, 100 total - 10 total relevant - 2 FP)

Precision: 6 / (6 + 2) = 6 / 8 = 0.75

Recall: 6 / (6 + 4) = 6 / 10 = 0.60

F1-Score: 2 * (0.75 * 0.60) / (0.75 + 0.60) = 2 * 0.45 / 1.35 = 0.90 / 1.35 ≈ 0.67

Evaluation in Ranked Retrieval Sets

For systems that return documents in a ranked list (e.g., web search engines), the position of relevant documents in the ranking is important. Metrics for ranked retrieval consider both relevance and rank.

Key Metrics for Ranked Sets
  • Precision at K (P@K): The precision calculated only on the top K documents retrieved. Useful for evaluating systems where users typically only look at the first few results.
  • Mean Average Precision (MAP): A single-figure measure of quality across recall levels. It is the mean of the Average Precision (AP) scores for each query. AP is the average of the precision values at each point where a relevant document is retrieved. MAP is a very common and effective metric for ranked retrieval.

    Where P(k) is precision at cut-off k, and rel(k) is 1 if the k-th document is relevant, 0 otherwise.

  • Discounted Cumulative Gain (DCG) and Normalized DCG (NDCG): These metrics are used when documents have graded relevance (e.g., highly relevant, moderately relevant, non-relevant). DCG accumulates the relevance scores of retrieved documents, with a penalty for lower ranks. NDCG normalizes DCG to a perfect ranking, allowing comparison across different queries.

    Where reli is the relevance score of the document at rank i.

    Where IDCGp is the Ideal DCG for a perfect ranking.

Example: Ranked Retrieval (P@K)

Consider a query with 5 relevant documents in the collection. A system returns the following ranked list (R = Relevant, N = Non-Relevant):

  • Rank 1: R
  • Rank 2: N
  • Rank 3: R
  • Rank 4: R
  • Rank 5: N

P@1: 1/1 = 1.0

P@2: 1/2 = 0.5

P@3: 2/3 ≈ 0.67

P@4: 3/4 = 0.75

P@5: 3/5 = 0.60

Web & Multimedia Retrieval

Web Search: Link Analysis

These algorithms rank pages based on the web's link structure.

PageRank

PageRank is a link analysis algorithm developed by Google to measure the importance or authority of web pages. It operates on the principle that more important pages are likely to receive more links from other important pages. It models the behavior of a "random surfer" who randomly clicks on links.

PageRank Formula & Iterative Process

The PageRank of a page A (PR(A)) is calculated iteratively:

  • d (damping factor): A probability (typically 0.85) that the random surfer will continue clicking links. (1-d) is the probability of "teleporting" to a random page.
  • Ti: Any page that links to page A.
  • PR(Ti): The current PageRank of page Ti.
  • C(Ti): The number of outgoing links from page Ti.

The algorithm starts with an initial PageRank (e.g., 1/N for N pages) for all pages and iteratively updates scores until they converge.

Example Walkthrough

Consider a simple web graph with pages A, B, C, D and links: A→B, A→C, B→C, C→A, D→C. Let d = 0.85.

Initial Scores (Iteration 0):

PR(A)=0.25, PR(B)=0.25, PR(C)=0.25, PR(D)=0.25 (since N=4 pages)

Iteration 1:

  • PR(A) = (1-0.85) + 0.85 * (PR(C)/C(C)) = 0.15 + 0.85 * (0.25/1) = 0.15 + 0.2125 = 0.3625
  • PR(B) = (1-0.85) + 0.85 * (PR(A)/C(A)) = 0.15 + 0.85 * (0.25/2) = 0.15 + 0.10625 = 0.25625
  • PR(C) = (1-0.85) + 0.85 * (PR(A)/C(A) + PR(B)/C(B) + PR(D)/C(D)) = 0.15 + 0.85 * (0.25/2 + 0.25/1 + 0.25/1) = 0.15 + 0.85 * (0.125 + 0.25 + 0.25) = 0.15 + 0.85 * 0.625 = 0.15 + 0.53125 = 0.68125
  • PR(D) = (1-0.85) + 0.85 * (0) = 0.15

Iteration 2 (using PageRank scores from Iteration 1):

  • PR(A) = 0.15 + 0.85 * (PR(C)/C(C)) = 0.15 + 0.85 * (0.68125/1) = 0.15 + 0.5790625 = 0.7290625
  • PR(B) = 0.15 + 0.85 * (PR(A)/C(A)) = 0.15 + 0.85 * (0.3625/2) = 0.15 + 0.85 * 0.18125 = 0.15 + 0.1540625 = 0.3040625
  • PR(C) = 0.15 + 0.85 * (PR(A)/C(A) + PR(B)/C(B) + PR(D)/C(D)) = 0.15 + 0.85 * (0.3625/2 + 0.25625/1 + 0.15/1) = 0.15 + 0.85 * (0.18125 + 0.25625 + 0.15) = 0.15 + 0.85 * 0.5875 = 0.15 + 0.499375 = 0.649375
  • PR(D) = 0.15 + 0.85 * (0) = 0.15

With further iterations, the PageRank scores would converge, indicating the relative importance of each page in the graph.

HITS Algorithm (Hyperlink-Induced Topic Search)

HITS is a link analysis algorithm that identifies two types of pages: Hubs and Authorities. It operates on the principle that good hubs point to many good authorities, and good authorities are pointed to by many good hubs. This creates a mutually reinforcing relationship.



  • Authorities: Pages with high-quality, relevant content on a specific topic. They are pointed to by many good hubs.
  • Hubs: Pages that serve as good directories or resource lists, pointing to many authoritative pages.
Iterative Update Process

The algorithm iteratively updates Authority (a) and Hub (h) scores for each page until convergence:

  • Initialization: All pages start with an initial Authority score of 1 and a Hub score of 1.
  • Authority Update: A page's Authority score is the sum of the Hub scores of all pages that link to it.
  • Hub Update: A page's Hub score is the sum of the Authority scores of all pages it links to.
  • Normalization: After each iteration, the scores are normalized (e.g., by dividing by the sum of squares) to prevent them from growing indefinitely.
Example Walkthrough

Consider a small web graph with pages A, B, C, D, E and links: A→B, A→C, B→C, C→D, D→E, E→A.

Initial Scores (Iteration 0):

a(A)=1, a(B)=1, a(C)=1, a(D)=1, a(E)=1

h(A)=1, h(B)=1, h(C)=1, h(D)=1, h(E)=1

Iteration 1:

Authority Update:

  • a(A) = h(E) = 1
  • a(B) = h(A) = 1
  • a(C) = h(A) + h(B) = 1 + 1 = 2
  • a(D) = h(C) = 1
  • a(E) = h(D) = 1

Hub Update:

  • h(A) = a(B) + a(C) = 1 + 2 = 3
  • h(B) = a(C) = 2
  • h(C) = a(D) = 1
  • h(D) = a(E) = 1
  • h(E) = a(A) = 1

Normalization (example: sum of squares):

Norm Factor for Authorities = √(1² + 1² + 2² + 1² + 1²) = √8 ≈ 2.83

Normalized a(A)=0.35, a(B)=0.35, a(C)=0.71, a(D)=0.35, a(E)=0.35

Norm Factor for Hubs = √(3² + 2² + 1² + 1² + 1²) = √16 = 4

Normalized h(A)=0.75, h(B)=0.5, h(C)=0.25, h(D)=0.25, h(E)=0.25

Iteration 2 (using Normalized scores from Iteration 1):

Authority Update:

  • a(A) = h(E) = 0.25
  • a(B) = h(A) = 0.75
  • a(C) = h(A) + h(B) = 0.75 + 0.5 = 1.25
  • a(D) = h(C) = 0.25
  • a(E) = h(D) = 0.25

Hub Update:

  • h(A) = a(B) + a(C) = 0.75 + 1.25 = 2
  • h(B) = a(C) = 1.25
  • h(C) = a(D) = 0.25
  • h(D) = a(E) = 0.25
  • h(E) = a(A) = 0.25

Normalization (example: sum of squares):

Norm Factor for Authorities = √(0.25² + 0.75² + 1.25² + 0.25² + 0.25²) = √(0.0625 + 0.5625 + 1.5625 + 0.0625 + 0.0625) = √2.3125 ≈ 1.52

Normalized a(A)=0.16, a(B)=0.49, a(C)=0.82, a(D)=0.16, a(E)=0.16

Norm Factor for Hubs = √(2² + 1.25² + 0.25² + 0.25² + 0.25²) = √(4 + 1.5625 + 0.0625 + 0.0625 + 0.0625) = √5.75 ≈ 2.39

Normalized h(A)=0.84, h(B)=0.52, h(C)=0.10, h(D)=0.10, h(E)=0.10

The scores continue to converge with further iterations, highlighting C as a strong authority and A as a strong hub.

Recommender Systems: Collaborative Filtering

Collaborative Filtering (CF) is a powerful technique that makes recommendations by leveraging the preferences or behaviors of a community of users. The core idea is that if users had similar tastes in the past, they will likely have similar tastes in the future. CF systems find patterns in user-item interactions (e.g., ratings, purchases, views) to predict what a user might like.

Types of Collaborative Filtering

  • User-Based Collaborative Filtering: Identifies users who are similar to the active user (based on their past ratings/interactions). It then recommends items that these "similar users" liked but the active user hasn't yet experienced.

    Pros: Can recommend novel items; handles new items well. Cons: Scales poorly with many users; "cold start" for new users.

  • Item-Based Collaborative Filtering: Identifies items that are similar to items the active user has already liked. Similarity between items is calculated based on how users have rated them.

    Pros: Scales better with many users; "cold start" for new items. Cons: Less diverse recommendations; struggles with new items.

Similarity Calculation (Cosine Similarity)

A common metric to quantify similarity between users or items is Cosine Similarity, which measures the cosine of the angle between two vectors in a multi-dimensional space. A value closer to 1 indicates higher similarity.

Example: Item-Based Recommendation

Consider the following user-item rating matrix (1-5 scale, 0 for unrated):

UserMovie AMovie BMovie CMovie D
Alice5040
Bob0405
Charlie4050

Let's recommend a movie for Bob, who has not rated Movie A or Movie C.

Step 1: Calculate Item Similarities (e.g., between Movie A and Movie C)

Ratings for Movie A: Alice(5), Bob(0), Charlie(4)

Ratings for Movie C: Alice(4), Bob(0), Charlie(5)

Consider only co-rated items (Alice, Charlie):

Vector A_co = [5, 4]

Vector C_co = [4, 5]

sim(Movie A, Movie C) = (5*4 + 4*5) / (√(5²+4²) * √(4²+5²))

= (20 + 20) / (√41 * √41) = 40 / 41 ≈ 0.975

Step 2: Generate Recommendation for Bob

Bob has rated Movie B (4) and Movie D (5). We want to predict Bob's rating for Movie A.

We need similarities between Movie A and movies Bob has rated (Movie B, Movie D).

Assume sim(Movie A, Movie B) = 0.2 and sim(Movie A, Movie D) = 0.1 (these would be calculated similarly to above).

Predicted Rating for Movie A (for Bob) = (Rating(Bob,B)*sim(A,B) + Rating(Bob,D)*sim(A,D)) / (sim(A,B) + sim(A,D))

= (4*0.2 + 5*0.1) / (0.2 + 0.1) = (0.8 + 0.5) / 0.3 = 1.3 / 0.3 ≈ 4.33

Conclusion: Based on item-based collaborative filtering, Bob is predicted to rate Movie A around 4.33, suggesting it would be a good recommendation.

Basic Multimedia Search Technologies

Searching for multimedia content like images, videos, or audio files requires different strategies than searching for text. The two primary approaches are Metadata-Based Retrieval and Content-Based Retrieval, which address the challenge of understanding and matching non-textual data.

1. Metadata-Based Retrieval

This is the traditional approach, which relies on searching textual information associated with the multimedia file rather than the file itself. This metadata acts as a surrogate for the content, allowing standard text-based search techniques to be applied.

Types of metadata include:

  • Manual Annotations: User-generated tags, titles, captions, or descriptions (e.g., tagging a photo with "Paris," "Eiffel Tower," "vacation").
  • Implicit Data: Information from the context in which the media appears, such as text on the surrounding web page or comments from users.
  • File Properties: Automatically generated information like filename, date created, file format, camera settings (EXIF data for images), or artist and album info (ID3 tags for audio).

Strength:

Fast and can be highly precise if the metadata is accurate, comprehensive, and consistent. It allows for searching abstract concepts (e.g., "happiness") that are difficult to extract from content alone.

Weakness:

Heavily reliant on the quality and availability of metadata. Manual tagging is labor-intensive, expensive, subjective, and often incomplete. This can lead to poor recall if relevant items are not tagged with the query terms.

2. Content-Based Retrieval

This technique analyzes the actual content (pixels, frequencies, waveforms) of the multimedia file to find matches. It avoids the need for manual tagging by extracting intrinsic features directly from the data. The most common application is Content-Based Image Retrieval (CBIR).

The general workflow involves:

  • Feature Extraction: The system analyzes a file and extracts a set of numerical features, creating a "feature vector" that serves as a compact summary of its content.
  • Query by Example: A user provides a sample image, video, or audio clip. The system extracts its feature vector and searches a database for files with the most similar vectors.
  • Feature-based Search: Users can specify low-level features directly, such as "find images that are mostly blue" or "find songs with a tempo of 120 BPM."

Strength:

Does not depend on manual tagging, making it scalable and objective. It can discover similar items that might be described with different keywords and can be applied to any collection without prior annotation effort.

Weakness:

Can be computationally expensive. Its main challenge is the "semantic gap"—the difference between the low-level features the computer understands (e.g., color distribution, texture) and the high-level semantic meaning a human perceives (e.g., "a birthday party").

In-Depth: Content-Based Image Retrieval (CBIR)

CBIR allows searching for images using their visual content instead of relying on manual text annotations. The core idea is to translate the visual information of an image into a numerical feature vector and then perform searches over a database of these vectors.

The CBIR Workflow

  1. Feature Extraction: For every image in the database, the system automatically calculates and extracts key visual features. These features are stored as a vector in a separate database.
  2. Query Processing: When a user provides a query image, the system runs the same feature extraction process on it to create a query vector.
  3. Similarity Matching: The query vector is compared against all vectors in the database. A distance metric (like Euclidean distance) is used to calculate the similarity between the query and each database image.
  4. Result Ranking: The system returns the images with the smallest distance (i.e., highest similarity) to the query image, ranked in order of similarity.

Key Visual Features

The effectiveness of CBIR depends on the quality of its feature extraction. Common features include:

Color Histograms

Summarizes the distribution of colors in an image. It's robust to changes in orientation and small changes in viewpoint but ignores spatial layout.

Texture Patterns

Measures the visual patterns and their spatial arrangement, such as coarseness, directionality, and regularity. Useful for identifying materials like wood, fabric, or water.

Shapes & Edges

Identifies object boundaries, outlines, and shapes. This is more complex and often requires segmentation to isolate objects from the background first.

The Semantic Gap

A major challenge in CBIR is the "semantic gap"—the difference between the low-level visual features the computer extracts (color, texture) and the high-level semantic concepts a human understands (e.g., "a happy family," "a cityscape at dusk"). While an image of a sunset and a picture of an orange might have similar color histograms, they are semantically very different. Bridging this gap is a key area of research in multimedia retrieval.

Problem Solving: Test Your Knowledge

Foundations: Processing & Indexing

Q1: Using Heaps' Law

A document collection has 1,000 tokens and a vocabulary of 1,000 terms. When the collection grows to 100,000 tokens, the vocabulary is 10,000 terms. Use Heaps' law to estimate the vocabulary size for an entire 109 token collection.

Show Answer

1. Set up equations:
Using Heaps' Law, :
Eq 1: 1000 = k * (1000)b
Eq 2: 10000 = k * (100000)b

2. Solve for 'b':
Divide Eq 2 by Eq 1: 10 = (100000/1000)b10 = 100b.
Taking log on both sides: log10(10) = b * log10(100)1 = b * 2. So, b = 0.5.

3. Solve for 'k':
Substitute b=0.5 into Eq 1: 1000 = k * (1000)0.51000 = k * 31.62. So, k ≈ 31.62.

4. Estimate for the full collection:
M = 31.62 * (109)0.5 = 31.62 * 104.5 ≈ 1,000,000.
The estimated vocabulary size is 1 million terms.

Q2: Permuterm Index

What is a permuterm index and what type of queries does it handle? Show the permuterm index for the term retrieval.

Show Answer

A permuterm index is a special type of index used for tolerant retrieval, specifically to handle wildcard queries (e.g., queries containing a `*` character, like `hel*o`).

To create the index, a special symbol (like `$`) is appended to the term to mark its end.

Then, generate all rotations:

retrieval$ $retrieva l$retriev al$retrie val$retri ieval$ret rieval$re trieval$r etrieval$

Q3: Case Folding

What is case folding? With a suitable example, explain when and how case folding is done in information retrieval.

Show Answer

Case folding is a text normalization technique in information retrieval where all characters in a text are converted to a uniform case, typically lowercase. The primary goal is to treat words that differ only in capitalization as the same term, thereby improving recall and reducing the vocabulary size.

When is it done? Case folding is usually performed during the preprocessing phase of text indexing and querying. It's a fundamental step before tokenization or stemming, ensuring consistency in the representation of terms.

How is it done? It involves converting all uppercase letters to their lowercase equivalents. For most languages, this is a straightforward conversion (e.g., 'A' to 'a'). However, for some languages (e.g., German 'ß' to 'ss'), it can involve more complex rules.

Example:

  • Original text: "The quick Brown fox jumps over the Lazy Dog."
  • After case folding: "the quick brown fox jumps over the lazy dog."

Without case folding, "The", "the", "Brown", "brown", "Lazy", and "lazy" would be treated as distinct terms, leading to missed matches in search queries. For instance, a query for "the quick brown fox" would not match a document containing "The quick Brown fox" if case folding were not applied.

Text Mining: Classification & Clustering

Q1: Rocchio Classification

Given the following documents and classes, classify the query "executive suite" using the Rocchio algorithm.

  • Class 1 (World News): "China election", "Maldives executive", "Election executive"
  • Class 2 (Business): "Chief executive", "Maldives executive suite"
Show Answer

1. Compute Centroids: Calculate the average vector for each class after normalizing term frequencies.
Centroid C1 (World) ≈ <0.23, 0.47, 0.23, 0.47, 0, 0>
Centroid C2 (Business) ≈ <0, 0, 0.29, 0.64, 0.29, 0.36>

2. Vectorize Query: The normalized vector for the query "executive suite" (d6) is <0, 0, 0, 0.71, 0.71, 0>.

3. Calculate Distance: Compute the Euclidean distance from the query vector to each centroid.
Distance(C1, d6) ≈ 0.94
Distance(C2, d6) ≈ 0.68

Conclusion: Since the distance to the Business centroid (C2) is smaller, the query is classified as Business.

Q2: Agglomerative Clustering

Given a similarity matrix, you perform single-link agglomerative clustering. If the highest similarity is 0.9 between D1 and D2, and the next is 0.8 between D4 and D5, what is the next merge if Sim(D2, D3) = 0.7 and Sim(D1, D3) = 0.1?

Show Answer

1. Initial Clusters: After the first two steps, we have clusters {D1, D2}, {D4, D5}, and {D3}.

2. Find Next Highest Similarity: With single-linkage, the similarity between a merged cluster and another point is the maximum similarity between any member of the cluster and that point.

Sim({D1, D2}, D3) = max(Sim(D1,D3), Sim(D2,D3)) = max(0.1, 0.7) = 0.7

Conclusion: Since 0.7 is the next highest similarity, the next step is to merge cluster {D1, D2} with {D3}.

Web & Multimedia Retrieval

Q1: HITS Algorithm

For a web graph where A→B, B→C, and C→(A,B), run two iterations of the HITS algorithm. Start with all hub and authority scores = 1.

Show Answer

Iteration 1:
Authority Update: a(p) = ∑ h(q) for all q→p
a(A)=h(C)=1; a(B)=h(A)+h(C)=2; a(C)=h(B)=1
Hub Update: h(p) = ∑ a(q) for all p→q
h(A)=a(B)=1; h(B)=a(C)=1; h(C)=a(A)+a(B)=2

Iteration 2:
Authority Update (using Itr 1 hub scores):
a(A)=h(C)=2; a(B)=h(A)+h(C)=3; a(C)=h(B)=1
Hub Update (using Itr 1 authority scores):
h(A)=a(B)=2; h(B)=a(C)=1; h(C)=a(A)+a(B)=3

Q2: Content-Based Image Retrieval

If a CBIR system uses only color histograms, what is a major weakness of this approach? Give an example.

Show Answer

Weakness: Color histograms completely ignore all spatial information. They only capture the proportion of different colors, not where those colors are located in the image.

Example: An image of a blue sky above a green field would have the exact same color histogram as an image of a green sky above a blue field, or a checkerboard of blue and green squares. The system would see these semantically different images as identical.