Skip to main content

Machine Learning Techniques

· 4 min read
Hinny Tsang
Data Scientist @ Pollock Asset Management

Notes to MLE interviews.

One-Hot Encoding

Doesn't work well with high cardinality categorical features, and tree-based models like XGBoost and LightGBM due to too many zeros.

Mean Encoding

let say sample is below.

AgeIncome
1860,000
1850,000
1840,000
1966,000
1951,000
1942,000

Mean encoding encode age by the mean of income, i.e.

AgeIncomeMean Encoding
1860,00050,000
1850,00050,000
1840,00050,000
1966,00053,000
1951,00053,000
1942,00053,000

Be aware of label leakage. Can apply additive smoothing to make it more robust.

Feature Hashing

Map to a fixed number of features. One problem is collision if hash size is too small.

Cross Feature

Join categorical features together. For example, if we have two categorical features, A and B, we can create a new feature C that is the concatenation of A and B. This can help capture interactions between the two features.

Embedding

A way to represent categorical features as continuous vectors. This is often used in deep learning models, where we can learn the embeddings during training. Embeddings can capture complex relationships between categories and are particularly useful for high cardinality features.

  • Continuous Bag of Words (CBOW): words[t-n] to word[t-1], word[t+1] to word[t+n] to predict word[t]
  • Skip-gram: word[t] to predict surrounding words.
note

Rule of thumb: d=D4d=D1/4d = D4d = D^{1/4} where DD is the number of categories and dd is the dimension of the embedding.

tip

Pre-trained word embeddings (ex. word2vec, word2glove, etc.) are powerful.

  • Word to vector..

Attention Mechanism

Given embedding EiE_i for each word wiw_i,

  1. Query Qi=WqEiQ_i = W_q E_i for specfic query.

  2. Key Ki=WkEiK_i = W_k E_i

  3. Check similarity between query and key, Sij=QiKjS_{ij} = Q_i \cdot K_j. High similiarity means KiK_i attends to QjQ_j!

  4. Normalize SijS_{ij} by softmax to Ai=softmax(Si)VA_i = \text{softmax}(S_{i}) V into attention pattern.

    • Masking: Set all SijS_{ij} to -\infty if j<ij < i to prevent attending to future words.
  5. Value Vi=WvEjV_i = W_v E_j the changes to the embedding from prevous embedding.

  6. Update embedding buy sum of previous values and weighted by attention pattern, Ei=jAijVjE_i' = \sum_j A_{ij} V_j.

  • Cross attention: QiQ_i is from one sequence, KjK_j and VjV_j are from another sequence, e.g. for translation.

  • Multi-head attention: Use multiple sets of WqW_q, WkW_k, and WvW_v to capture different aspects of the input. Then concatenate the outputs from each head.

    • Pros: Allows the model to focus on different parts of the input simultaneously (Parallel processing).

More layers: e.g. Attention -> Multi-layer perceptron (MLP) -> Attention -> ... in order to capture more complex relationships.

Transformer

Situations, predict next token given previous tokens.

Step structures:

  1. Input embedding: Convert input tokens into continuous vectors.

  2. Positional encoding: Add positional information to the input embeddings to capture the order of tokens.

  3. Multi-head self-attention: Compute attention scores for each token with respect to all other tokens in the sequence.

  4. Feed-forward neural network: Apply a feed-forward network to each token independently.

  5. Layer normalization: Normalize the output of the feed-forward network.

  6. Residual connection: Add the input of the layer to the output to help with gradient flow.

  7. Output layer: Convert the final representations back to token probabilities.