Machine Learning Techniques
Notes to MLE interviews.
One-Hot Encoding
Doesn't work well with high cardinality categorical features, and tree-based models like XGBoost and LightGBM due to too many zeros.
Mean Encoding
let say sample is below.
| Age | Income |
|---|---|
| 18 | 60,000 |
| 18 | 50,000 |
| 18 | 40,000 |
| 19 | 66,000 |
| 19 | 51,000 |
| 19 | 42,000 |
Mean encoding encode age by the mean of income, i.e.
| Age | Income | Mean Encoding |
|---|---|---|
| 18 | 60,000 | 50,000 |
| 18 | 50,000 | 50,000 |
| 18 | 40,000 | 50,000 |
| 19 | 66,000 | 53,000 |
| 19 | 51,000 | 53,000 |
| 19 | 42,000 | 53,000 |
Be aware of label leakage. Can apply additive smoothing to make it more robust.
Feature Hashing
Map to a fixed number of features. One problem is collision if hash size is too small.
Cross Feature
Join categorical features together. For example, if we have two categorical features, A and B, we can create a new feature C that is the concatenation of A and B. This can help capture interactions between the two features.
Embedding
A way to represent categorical features as continuous vectors. This is often used in deep learning models, where we can learn the embeddings during training. Embeddings can capture complex relationships between categories and are particularly useful for high cardinality features.
- Continuous Bag of Words (CBOW): words[t-n] to word[t-1], word[t+1] to word[t+n] to predict word[t]
- Skip-gram: word[t] to predict surrounding words.
Rule of thumb: where is the number of categories and is the dimension of the embedding.
Pre-trained word embeddings (ex. word2vec, word2glove, etc.) are powerful.
- Word to vector..
Attention Mechanism
Given embedding for each word ,
-
Query for specfic query.
-
Key
-
Check similarity between query and key, . High similiarity means attends to !
-
Normalize by softmax to into attention pattern.
- Masking: Set all to if to prevent attending to future words.
-
Value the changes to the embedding from prevous embedding.
-
Update embedding buy sum of previous values and weighted by attention pattern, .
-
Cross attention: is from one sequence, and are from another sequence, e.g. for translation.
-
Multi-head attention: Use multiple sets of , , and to capture different aspects of the input. Then concatenate the outputs from each head.
- Pros: Allows the model to focus on different parts of the input simultaneously (Parallel processing).
More layers: e.g. Attention -> Multi-layer perceptron (MLP) -> Attention -> ... in order to capture more complex relationships.
Transformer
Situations, predict next token given previous tokens.
Step structures:
-
Input embedding: Convert input tokens into continuous vectors.
-
Positional encoding: Add positional information to the input embeddings to capture the order of tokens.
-
Multi-head self-attention: Compute attention scores for each token with respect to all other tokens in the sequence.
-
Feed-forward neural network: Apply a feed-forward network to each token independently.
-
Layer normalization: Normalize the output of the feed-forward network.
-
Residual connection: Add the input of the layer to the output to help with gradient flow.
-
Output layer: Convert the final representations back to token probabilities.
