Translating Quarto (and other markdown files) into Any Language

Translating the language of the text, not the programming language!
translation
Author

Frank Aragona

Published

August 21, 2024

Modified

November 13, 2024

Multilingual Quarto

I made a simple workflow for translating the text of a qmd file (or really any md file) into a new language. I’ll use English to Spanish as an example.

🤗 Hugging Face Transformers

The first step is to find a model that can translate into whatever language we want. I looked into a few different apis (like Google, DeepL), but they all required a credit card on file (even free versions), an api token, and they all have a tier approach where you can only make so many api calls.

I wanted to simplfy everything and avoid putting my credit card into any browser. Queue Hugging Face Transformers

Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities, such as:

📝 Natural Language Processing: text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation. 🖼️ Computer Vision: image classification, object detection, and segmentation. 🗣️ Audio: automatic speech recognition and audio classification. 🐙 Multimodal: table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.

Code to use Transformers

There is an R library for hugging face, but I think it requires conda to install some python libraries and I had conda issues, so I just made a more simple package that uses pip to install the python dependencies. And this package is really only for translation and for this demo.

I originally did all of this in Python, but decided to convert everything I could into R because parsing an md file is surprisingly (or unsurprisingly?) way more straightforward in R, and that’s where the real magic happens.

  • first install reticulate,
  • then the package
  • and then you need to install the python dependencies

reticulate:

# you need reticulate to use the python code
renv::install('reticulate')

package:

# install my package
remotes::install_github('edenian-prince/translatemd')

python dependencies:

This will install a separated virtual environment called r-transformers but you can rename it and specify the location of the venv if you want. It will then pip install torch, transformers and sentencepiece

translatemd::install_transformers()

Getting started, this video helped a ton!

Here you need to find a NLP translation model from Hugging Face. I recommend the Helsinki models

When you find a model you want, copy the entire model name and paste it as a string into the function below, like this for the spanish model:

translator <- translatemd::select_lang("Helsinki-NLP/opus-mt-en-es")
Note

Note that the models are written like from xx to yy so in this case it’s en-es which is english to spanish

You can input text in the translate function and it will translate english to spanish:

translatemd::translate('Hello, my name is Frank')
[1] "Hola, mi nombre es Frank."

Translate an Entire QMD File

  • Parse the QMD
  • Apply the translate function to the text
  • Re-write the translated QMD into a new document

md parse

The lightparser package is fantastic (and light!). It will take the qmd or rmd and return a tibble of its elements.

(parsed <- lightparser::split_to_tbl('_english.qmd'))
It seems you are currently knitting a Rmd/Qmd file. The parsing of the file will be done in a new R session.
# A tibble: 8 × 8
  type    label           params       text  code  heading heading_level section
  <chr>   <chr>           <list>       <nam> <lis> <chr>           <dbl> <chr>  
1 yaml    <NA>            <named list> <lgl> <lgl> <NA>               NA <NA>   
2 inline  <NA>            <lgl [1]>    <chr> <lgl> <NA>               NA <NA>   
3 heading <NA>            <lgl [1]>    <chr> <lgl> Quarto              1 Quarto 
4 inline  <NA>            <lgl [1]>    <chr> <lgl> <NA>               NA Quarto 
5 heading <NA>            <lgl [1]>    <chr> <lgl> Automa…             1 Automa…
6 inline  <NA>            <lgl [1]>    <chr> <lgl> <NA>               NA Automa…
7 block   unnamed-chunk-1 <named list> <lgl> <chr> <NA>               NA Automa…
8 inline  <NA>            <lgl [1]>    <chr> <lgl> <NA>               NA Automa…

md translate

unnest the text and apply the translate function

parsed_es <- parsed |>
  tidyr::unnest(cols = text) |>
  dplyr::mutate(text_es = 
    purrr::map(text,translatemd::translate)
  )

let’s see what it looks like.

parsed_es |>
  tidyr::unnest(cols = text_es) |>
  dplyr::select(type,text) |>
  head()
parsed_es |>
  tidyr::unnest(cols = text_es) |>
  dplyr::select(type,text_es) |>
  head()
# A tibble: 6 × 2
  type    text                                                                  
  <chr>   <chr>                                                                 
1 heading # Quarto                                                              
2 inline  Quarto enables you to weave together content and executable code into…
3 inline  To create the release cycle in your repo you may want to use Conventi…
4 inline  Conventional Commits are a way to format and standardize your commit …
5 inline  -   The word `feat:` can trigger a Github Action to add that commit t…
6 inline  -   and it will up-version the minor release version number.          
# A tibble: 6 × 2
  type    text_es                                                               
  <chr>   <chr>                                                                 
1 heading # Quarto                                                              
2 inline  Quarto le permite entretejer el contenido y el código ejecutable en u…
3 inline  Para crear el ciclo de lanzamiento en su repo es posible que desee ut…
4 inline  Commits convencionales son una forma de formatear y estandarizar sus …
5 inline  - La palabra `feat:` puede activar una acción de Github para añadir q…
6 inline  - y subirá el número de versión de lanzamiento menor.                 

write to new qmd

clean up and write to new

parsed_es_to_qmd <- parsed_es |>
  dplyr::select(-text) |>
  dplyr::rename(text = text_es)

# output to qmd
lightparser::combine_tbl_to_file(
  parsed_es_to_qmd,
  "_spanish.qmd"
)

Results

And now you have a document in English and another in Spanish

Bugs

I’ve caught a few bugs to this approach and you maybe even noticed some!

  1. A # got removed in the translate - look at the section called Automating the Release Cycle. Since the # got removed it is no longer a header ☹️
  2. The lightparser package has a reported bug with quarto chunk yaml parameters. Here it converted #| eval: false into #| eval: no, but we know that the #| eval: false should not be treated as text. Hopefully this is fixed

I recommend going through the document and looking for bugs like these! Some manual edits to the translated qmd file may be necessary.

Full Script Example

translate.R
# install
translatemd::install_transformers()

# select language
translator <- translatemd::select_lang("Helsinki-NLP/opus-mt-en-es")

# parse your qmd
(parsed <- lightparser::split_to_tbl('english.qmd'))

# translate the qmd
parsed_es <- parsed |>
  tidyr::unnest(cols = text) |>
  dplyr::mutate(text_es = purrr::map(text,translatemd::translate))

# write to a new qmd
parsed_es_to_qmd <- parsed_es |>
  dplyr::select(-text) |>
  dplyr::rename(text = text_es)

# output to qmd
lightparser::combine_tbl_to_file(
  parsed_es_to_qmd,
  "_spanish.qmd"
)