Machine Learning Techniques

May 13, 2025 · 4 min read

Data Scientist @ Pollock Asset Management

Notes to MLE interviews.

One-Hot Encoding

Doesn't work well with high cardinality categorical features, and tree-based models like XGBoost and LightGBM due to too many zeros.

Mean Encoding

let say sample is below.

Age	Income
18	60,000
18	50,000
18	40,000
19	66,000
19	51,000
19	42,000

Mean encoding encode age by the mean of income, i.e.

Age	Income	Mean Encoding
18	60,000	50,000
18	50,000	50,000
18	40,000	50,000
19	66,000	53,000
19	51,000	53,000
19	42,000	53,000

Be aware of label leakage. Can apply additive smoothing to make it more robust.

Feature Hashing

Map to a fixed number of features. One problem is collision if hash size is too small.

Cross Feature

Join categorical features together. For example, if we have two categorical features, A and B, we can create a new feature C that is the concatenation of A and B. This can help capture interactions between the two features.

Embedding

A way to represent categorical features as continuous vectors. This is often used in deep learning models, where we can learn the embeddings during training. Embeddings can capture complex relationships between categories and are particularly useful for high cardinality features.

Continuous Bag of Words (CBOW): words[t-n] to word[t-1], word[t+1] to word[t+n] to predict word[t]
Skip-gram: word[t] to predict surrounding words.

note

Rule of thumb: $d = D4d = D^{1/4}$ where $D$ is the number of categories and $d$ is the dimension of the embedding.

tip

Pre-trained word embeddings (ex. word2vec, word2glove, etc.) are powerful.

Word to vector..

Attention Mechanism

Given embedding $E_i$ for each word $w_i$ ,

Query $Q_i = W_q E_i$ for specfic query.
Key $K_i = W_k E_i$
Check similarity between query and key, $S_{ij} = Q_i \cdot K_j$ . High similiarity means $K_i$ attends to $Q_j$ !
Normalize $S_{ij}$ by softmax to $A_i = \text{softmax}(S_{i}) V$ into attention pattern.
- Masking: Set all $S_{ij}$ to $-\infty$ if $j < i$ to prevent attending to future words.
Value $V_i = W_v E_j$ the changes to the embedding from prevous embedding.
Update embedding buy sum of previous values and weighted by attention pattern, $E_i' = \sum_j A_{ij} V_j$ .

Cross attention: $Q_i$ is from one sequence, $K_j$ and $V_j$ are from another sequence, e.g. for translation.
Multi-head attention: Use multiple sets of $W_q$ , $W_k$ , and $W_v$ to capture different aspects of the input. Then concatenate the outputs from each head.
- Pros: Allows the model to focus on different parts of the input simultaneously (Parallel processing).

More layers: e.g. Attention -> Multi-layer perceptron (MLP) -> Attention -> ... in order to capture more complex relationships.

Transformer

Situations, predict next token given previous tokens.

Step structures:

Input embedding: Convert input tokens into continuous vectors.
Positional encoding: Add positional information to the input embeddings to capture the order of tokens.
Multi-head self-attention: Compute attention scores for each token with respect to all other tokens in the sequence.
Feed-forward neural network: Apply a feed-forward network to each token independently.
Layer normalization: Normalize the output of the feed-forward network.
Residual connection: Add the input of the layer to the output to help with gradient flow.
Output layer: Convert the final representations back to token probabilities.

One-Hot Encoding​

Mean Encoding​

Feature Hashing​

Cross Feature​

Embedding​

Attention Mechanism​

Transformer​