[논문 리뷰] [NLP] Distributed Representations of Words and Phrasesand their Compositionality

23학번이수현 2025. 3. 29. 21:34

0. Reference

Distributed Representations of Words and Phrases and their Compositionality

The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extens

arxiv.org

1. Introduction

- word를 verctor space의 distributed representation으로 나타내면,

- 의미적으로 유사한 단어들끼리 가까운 위치에 clustering되며, NLP algorithms의 성능 향상에 큰 도움을 준다고 한다.

- Skip-gram 모델에서 Probability function(Softmax)를 통해 Probability를 얻어내는데,

- 이 과정에서 softmax가 계산량이 너무 크다고 한다.(왜냐하면 Voca에 속하는 단어 길이만큼)

- 이를 개선하고자, 두 가지 선택지중 한가지를 선택하게 된다.

i) Hierarchical Softmax : huffman tree로 softmax 계산을 단축(O(log_2 V))

ii) Negative Sampling : 전체 softmax를 생략하고 일부 단어만 sampling하여, binary classification처럼 학습하게 된다.

2. Hierarchical Softmax

- 밑 포스팅을 참고하시기 바랍니다.

https://ceulkun04.tistory.com/245

3. Negative Sampling

- negative sampling은 우리가 목표로 삼는 positive sample외에,

- negative sample을 sampling하여 그들과의 차이를 학습하도록 한다.

- 예를 들어, 'king'이라는 단어 주변에 'queen'이 등장했다면, 이 쌍은 positive이다.

- 하지만, "king" 주변에 "banana"는 보통 등장하지 않으니, 이런 단어를 일부러 샘플링해서 negative로 학습을 진행하게 된다.

- 즉, 모델은 다음을 학습하는 것과 마찬가지가 된다.

--> Positive pair : 점수를 높이기

--> Negative pari : 점수를 낮추기

- 이렇게 하면, 모든 단어에 대해 softmax를 돌리는 대신, 소수의 negative 샘플에 대해서만 계산하므로 효율적이다.

- Word2Vec에서 Skip-gram 모델에서 Negative Sampling을 적용한 loss function은 다음과 같다.

- 즉, positive라면 1로, negative라면 0으로 계속 학습하게 된다.

- 추가적으로, negative sampling을 하는 집단은 다음과 같이 보정을 해준다.

- 추가적으로, the,a,of 같은 너무 빈도가 높은 단어를 학습에서 제외시키는 보정도 들어간다.

- 즉, negative sampling은 전체 세계를 보는 대신, 진짜와 가짜만 골라서 구분하는 학습이다.

- 이는 우리가 모든 사람을 알 순 없지만, 친구와 낯선 사람을 구분할 줄 아는 것과 같다.