Based on 2021-22, 2022-23, 2023-24, 2024-25 papers. Highest-repeat areas: IR, access structures, RA/query processing, logging/concurrency, distributed/parallel.
1. Information Retrieval
Appears every year. Learn the routine.
Precision / Recall
Precision = relevant retrieved / retrieved
Recall = relevant retrieved / total relevant
For ranked answers, scan left to right. Each time a relevant doc appears, update precision and recall. Standard recall levels are usually 0%, 10%, 20%, ... until unattainable.
Term Selection
- Stop-word removal: remove common words: the, of, and, in.
- Stemming/lemmatisation: reduce variants: stars/star, databases/database.
- Case folding: UNIX/unix treated the same unless stated otherwise.
Binary Term Vector
Make one dimension per selected term. Document has 1 if term appears, else 0. Query vector uses same dimensions.
TF-IDF
tf-idf(t,d) = tf(t,d) * idf(t)
tf(t,d) = count(t in d) / total selected terms in d
idf(t) = log2(N / df(t))
If term is absent, weight is 0. If it appears once in a 4-term title, tf=1/4, so weight is idf/4. Use the paper's exact term set after stop-word removal/stemming.
Rocchio Feedback
q' = alpha*q + beta/|Dr| * sum(d in Dr) d - gamma/|Dnr| * sum(d in Dnr) d
Relevant documents push query towards their terms. Irrelevant documents push query away. If constants not given, explain qualitatively or use symbolic alpha/beta/gamma.
2. Access Structures
Core Q1 topic: B/B+ trees, indexes, hashing, block maths.
Record / Block Maths
records per block = floor(usable block bytes / record bytes)
blocks = ceil(tuple count / records per block)
Use 4000 bytes usable when stated. Records do not span blocks.
Encoding Attributes
Choose minimum fixed bytes that can represent distinct values/range. Common rule: bytes needed = smallest b where 2^(8b) >= values.
Dense / Sparse Index
- Index benefits: fewer disk accesses for lookup/range queries, can support ordering, joins, uniqueness checks.
- Clustering: physically store records near/sorted by search key, so related records need fewer block reads.
- Dense: one key/pointer pair for every record in the data file; needed for secondary indexes.
- Sparse: one entry per data block; requires ordered/clustered file.
- Secondary non-key: usually key -> bucket of record pointers.
B+Tree Capacity
primary leaf capacity lp = floor((b - p) / record_len)
secondary leaf capacity ls = floor((b - p) / (key_len + p))
non-leaf fanout i = floor((b - p) / (key_len + p))
effective capacity = floor(capacity * occupancy), e.g. occupancy 0.6
In the lecture arithmetic, primary-index leaves hold records; secondary-index leaves hold key + record-pointer entries. Non-leaf entries use key + block pointer, with one extra pointer accounted for by b-p.
B+Tree Order n Rules
non-leaf max pointers = n+1, max keys = n
non-leaf min pointers = ceil((n+1)/2), min keys = ceil((n+1)/2)-1
leaf max data pointers = n+1, min data pointers = floor((n+1)/2)
root minimum = 1 key/pointer unless empty
B/B+Tree Operations
- Insert into correct leaf. If overflow, split and promote separator.
- If parent overflows, split parent recursively.
- B+Tree leaves are linked by sequence pointers for efficient range scans.
- Deletion: if underfull, borrow from sibling if possible, otherwise merge and update parent.
Extensible Hashing
- Directory uses global depth
i bits of hash.
- Bucket has local depth. On overflow, split bucket.
- If local depth equals global depth, double directory first.
Bitmap Index
One bitmap per value/category. Query uses Boolean ops: AND for combined conditions, OR for alternatives, NOT for negation.
3. Relational Algebra + Query Processing
Repeated in 2022-23, 2023-24, 2024-25.
Canonical SQL to RA Tree
- Start with Cartesian product of all FROM relations.
- Put full WHERE predicate as selection above product.
- Put projection at top for SELECT attributes.
Optimise RA Tree
- Push selections down as far as possible.
- Push projections down, keeping join attributes.
- Replace product + join predicate with join.
- Do most selective joins/selections early.
Useful Cardinality Estimates
Selection A=c: T(R) / V(R,A)
Range selection: use given fraction or reasonable assumption
Join R.A=S.B: T(R)*T(S) / max(V(R,A), V(S,B))
For key-foreign-key join, result often equals size of foreign-key side if every FK matches.
Nested Loop Join Cost
Block nested loop: B(outer) + ceil(B(outer)/(M-2)) * B(inner)
Choose smaller relation as outer to minimise cost. M is buffer blocks.
Logical vs Physical Plan
- Logical: algebraic operators: select, project, join.
- Physical: concrete algorithms/access paths: scan, index lookup, nested loop, hash join, sort-merge.
System Catalogue
Stores schemas, indexes, constraints, tuple counts, block counts, distinct value counts. Optimiser needs it for cost/cardinality estimates.
4. Transactions, Concurrency, Logging
Very important in recent papers.
ACID
- Atomicity: all or nothing.
- Consistency: constraints preserved.
- Isolation: concurrent result equivalent to serial execution.
- Durability: committed effects survive crashes.
Concurrency Problems
- Lost update: two transactions read same old value; later write overwrites earlier update.
- Dirty read: reads uncommitted data.
- Unrepeatable read: same read gives different values.
Serialisability
Non-serial schedule is OK if equivalent to some serial order. For conflict serialisability, build precedence graph; cycle means not serialisable.
Timestamp Ordering
- Read by T allowed if
TS(T) >= WTS(X); then set RTS(X)=max(RTS(X), TS(T)).
- Write by T allowed if
TS(T) >= RTS(X) and TS(T) >= WTS(X); then set WTS(X)=TS(T).
- Otherwise abort/restart T.
Undo Logging
- Logging mainly supports atomicity and durability; isolation is handled by concurrency control.
- Log old values:
<T, X, old>.
- Log record must reach disk before changed data item.
- Commit record written after all changed data has reached disk.
- Recovery: undo transactions without commit, scanning backwards.
Redo Logging
- Log new values:
<T, X, new>.
- Commit record reaches disk before changed data pages.
- Recovery: redo committed transactions, usually scanning forwards.
Checkpointing
Reduces how far back recovery must scan. Non-quiescent checkpoint records active transactions; recovery starts from checkpoint plus active transactions.
5. Distributed + Parallel Databases
Often appears as a big applied question.
Fragmentation
- Distributed DBMS design dimensions: distribution, heterogeneity, autonomy.
- Horizontal: split rows using predicates.
- Vertical: split columns; include key to reconstruct.
- Fragment reduction/localisation: use query predicate to ignore irrelevant fragments.
Parallel Join Patterns
- Direct/local join: join matching fragments where they already live.
- Broadcast join: send smaller relation/fragment to all nodes holding larger relation fragments.
- Repartitioned join: hash/range repartition both relations on join key, then local joins.
- Union: collect partial results from workers.
Distributed Join
- Ship whole: send R to S site or S to R site; choose lower transfer + join cost.
- Semijoin: send projected join attributes, filter remote relation, send only matching tuples back.
- Semijoin is bad if projected keys are large or most remote tuples match anyway.
2PC
- Coordinator asks participants to vote.
- If all vote yes, coordinator commits; otherwise aborts.
- Coordinator timeout in WAIT: may abort because a missing vote means it cannot safely commit.
- Coordinator timeout in COMMIT/ABORT: resend the global decision to participants without acknowledgements.
- Participant timeout in INITIAL: may unilaterally abort. Participant timeout in READY: blocked unless another participant/coordinator knows the decision.
- Presumed abort reduces logging/messages for aborts/read-only; presumed commit optimises commits.
Parallel Cost Assumptions
Read the question carefully. Past paper assumptions include free scan/union, aggregation cost Card(R), and indexed join cost using smaller indexed side.
6. Fast Exam Strategy
Question Choice
You usually answer 3 of 4. Pick the questions with the most mechanical marks first:
- IR: almost always highly template-based.
- Access structures: good if you practise block/B-tree maths.
- RA/query processing: good if you can draw trees and show working.
- Logging/concurrency: good if you know algorithms exactly.
Common Mark Traps
- Do not forget
floor for entries per block and ceil for number of blocks.
- Always state assumptions if selectivity/range distribution is ambiguous.
- For RA, keep join attributes when pushing projections.
- For B+Tree, remember leaf sequence pointers support range queries.
- For recovery, separate committed from uncommitted transactions before deciding redo/undo.
One-Hour Revision Order
| Minutes | Do this |
| 0-10 | Memorise IR formulas and Rocchio routine. |
| 10-25 | Practise block maths, B+Tree capacity, dense/sparse/secondary indexes. |
| 25-38 | Review RA tree rewrite steps and join cost formula. |
| 38-50 | Review ACID, timestamp rules, undo/redo recovery. |
| 50-60 | Skim distributed/parallel joins, semijoin, 2PC states. |
Low Priority Unless Needed
Non-relational databases, decentralised databases, broad intro material. Conjunctive query containment appeared in 2021-22 but is not obvious in the current lecture list.