Bigram probabilities in R

La fonction ci-dessous, biGram(), calcule la probabilité d’un bigramme pour un mot selon un corpus modèle. Le résultat est la somme des logarithmes des probabilités individuelles pour chaque bigramme. La fonction exige deux arguments, à savoir : un mot (x) et un corpus/une liste de mots. Quelques lignes dans la fonction ci-dessous dépendent du Portuguese Stress Lexicon. La fonction nécessite l’extension tidyverse.

Truc

Pour travailler avec des n-grammes en général, je recommande l’excellente extension ngram.

biGram = function(x, corpus){
  
  library(tidyverse)
  words = unlist(str_split(corpus, pattern = " "))
  words = str_replace_all(string = words, pattern = "-", replacement = "")
  words = str_replace_all(string = words, pattern = "'", replacement = "")
  
  x1 = str_split(x, pattern = "")[[1]]
  
  bigrams = c()
  bigrams[1] = paste("^", x1[1], sep = "")
  
  # Adding word-internal bigrams
  for(i in 1:(length(x1)-1)){
    seq = str_c(x1[i], x1[i+1], sep = "")
    bigrams[length(bigrams)+1] = seq
  }
  
  # Adding word-final bigram
  bigrams[length(bigrams)+1] = str_c(x1[length(x1)], "$", sep = "")
  
  # Variable for all probabilities
  probs = c()
  
  for(bigram in bigrams){
    probs[length(probs)+1] = sum(str_count(words, bigram)) / 
      sum(str_count(words, str_split(bigram, pattern = "")[[1]][1]))
  }
  
  return(log(prod(probs)))
}