COMP3211 Advanced Databases - Last-Hour Cheatsheet

Based on 2021-22, 2022-23, 2023-24, 2024-25 papers. Highest-repeat areas: IR, access structures, RA/query processing, logging/concurrency, distributed/parallel.

1. Information Retrieval

Appears every year. Learn the routine.

Precision / Recall

Precision = relevant retrieved / retrieved
Recall = relevant retrieved / total relevant

For ranked answers, scan left to right. Each time a relevant doc appears, update precision and recall. Standard recall levels are usually 0%, 10%, 20%, ... until unattainable.

Term Selection

Binary Term Vector

Make one dimension per selected term. Document has 1 if term appears, else 0. Query vector uses same dimensions.

TF-IDF

tf-idf(t,d) = tf(t,d) * idf(t)
tf(t,d) = count(t in d) / total selected terms in d
idf(t) = log2(N / df(t))

If term is absent, weight is 0. If it appears once in a 4-term title, tf=1/4, so weight is idf/4. Use the paper's exact term set after stop-word removal/stemming.

Rocchio Feedback

q' = alpha*q + beta/|Dr| * sum(d in Dr) d - gamma/|Dnr| * sum(d in Dnr) d

Relevant documents push query towards their terms. Irrelevant documents push query away. If constants not given, explain qualitatively or use symbolic alpha/beta/gamma.

2. Access Structures

Core Q1 topic: B/B+ trees, indexes, hashing, block maths.

Record / Block Maths

records per block = floor(usable block bytes / record bytes)
blocks = ceil(tuple count / records per block)

Use 4000 bytes usable when stated. Records do not span blocks.

Encoding Attributes

Choose minimum fixed bytes that can represent distinct values/range. Common rule: bytes needed = smallest b where 2^(8b) >= values.

Dense / Sparse Index

B+Tree Capacity

primary leaf capacity lp = floor((b - p) / record_len)
secondary leaf capacity ls = floor((b - p) / (key_len + p))
non-leaf fanout i = floor((b - p) / (key_len + p))
effective capacity = floor(capacity * occupancy), e.g. occupancy 0.6

In the lecture arithmetic, primary-index leaves hold records; secondary-index leaves hold key + record-pointer entries. Non-leaf entries use key + block pointer, with one extra pointer accounted for by b-p.

B+Tree Order n Rules

non-leaf max pointers = n+1, max keys = n
non-leaf min pointers = ceil((n+1)/2), min keys = ceil((n+1)/2)-1
leaf max data pointers = n+1, min data pointers = floor((n+1)/2)
root minimum = 1 key/pointer unless empty

B/B+Tree Operations

Extensible Hashing

Bitmap Index

One bitmap per value/category. Query uses Boolean ops: AND for combined conditions, OR for alternatives, NOT for negation.

3. Relational Algebra + Query Processing

Repeated in 2022-23, 2023-24, 2024-25.

Canonical SQL to RA Tree

  1. Start with Cartesian product of all FROM relations.
  2. Put full WHERE predicate as selection above product.
  3. Put projection at top for SELECT attributes.

Optimise RA Tree

Useful Cardinality Estimates

Selection A=c: T(R) / V(R,A)
Range selection: use given fraction or reasonable assumption
Join R.A=S.B: T(R)*T(S) / max(V(R,A), V(S,B))

For key-foreign-key join, result often equals size of foreign-key side if every FK matches.

Nested Loop Join Cost

Block nested loop: B(outer) + ceil(B(outer)/(M-2)) * B(inner)

Choose smaller relation as outer to minimise cost. M is buffer blocks.

Logical vs Physical Plan

System Catalogue

Stores schemas, indexes, constraints, tuple counts, block counts, distinct value counts. Optimiser needs it for cost/cardinality estimates.

4. Transactions, Concurrency, Logging

Very important in recent papers.

ACID

Concurrency Problems

Serialisability

Non-serial schedule is OK if equivalent to some serial order. For conflict serialisability, build precedence graph; cycle means not serialisable.

Timestamp Ordering

Undo Logging

Redo Logging

Checkpointing

Reduces how far back recovery must scan. Non-quiescent checkpoint records active transactions; recovery starts from checkpoint plus active transactions.

5. Distributed + Parallel Databases

Often appears as a big applied question.

Fragmentation

Parallel Join Patterns

Distributed Join

2PC

Parallel Cost Assumptions

Read the question carefully. Past paper assumptions include free scan/union, aggregation cost Card(R), and indexed join cost using smaller indexed side.

6. Fast Exam Strategy

Question Choice

You usually answer 3 of 4. Pick the questions with the most mechanical marks first:

Common Mark Traps

One-Hour Revision Order

MinutesDo this
0-10Memorise IR formulas and Rocchio routine.
10-25Practise block maths, B+Tree capacity, dense/sparse/secondary indexes.
25-38Review RA tree rewrite steps and join cost formula.
38-50Review ACID, timestamp rules, undo/redo recovery.
50-60Skim distributed/parallel joins, semijoin, 2PC states.

Low Priority Unless Needed

Non-relational databases, decentralised databases, broad intro material. Conjunctive query containment appeared in 2021-22 but is not obvious in the current lecture list.