Multilingual Quarto
I made a simple workflow for translating the text of a qmd file (or really any md file) into a new language. I’ll use English to Spanish as an example.
🤗 Hugging Face Transformers
The first step is to find a model that can translate into whatever language we want. I looked into a few different apis (like Google, DeepL), but they all required a credit card on file (even free versions), an api token, and they all have a tier approach where you can only make so many api calls.
I wanted to simplfy everything and avoid putting my credit card into any browser. Queue Hugging Face Transformers
Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities, such as:
📝 Natural Language Processing: text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation. 🖼️ Computer Vision: image classification, object detection, and segmentation. 🗣️ Audio: automatic speech recognition and audio classification. 🐙 Multimodal: table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.
Code to use Transformers
There is an R library for hugging face, but I think it requires conda to install some python libraries and I had conda issues, so I just made a more simple package that uses pip to install the python dependencies. And this package is really only for translation and for this demo.
I originally did all of this in Python, but decided to convert everything I could into R because parsing an md file is surprisingly (or unsurprisingly?) way more straightforward in R, and that’s where the real magic happens.
- first install reticulate,
- then the package
- and then you need to install the python dependencies
reticulate:
# you need reticulate to use the python code
::install('reticulate') renv
package:
# install my package
::install_github('edenian-prince/translatemd') remotes
python dependencies:
This will install a separated virtual environment called r-transformers
but you can rename it and specify the location of the venv if you want. It will then pip install torch
, transformers
and sentencepiece
::install_transformers() translatemd
Getting started, this video helped a ton!
Here you need to find a NLP translation model from Hugging Face. I recommend the Helsinki models
When you find a model you want, copy the entire model name and paste it as a string into the function below, like this for the spanish model:
<- translatemd::select_lang("Helsinki-NLP/opus-mt-en-es") translator
Note that the models are written like from xx to yy so in this case it’s en-es which is english to spanish
You can input text in the translate function and it will translate english to spanish:
::translate('Hello, my name is Frank') translatemd
[1] "Hola, mi nombre es Frank."
Translate an Entire QMD File
- Parse the QMD
- Apply the translate function to the text
- Re-write the translated QMD into a new document
md parse
The lightparser
package is fantastic (and light!). It will take the qmd or rmd and return a tibble of its elements.
<- lightparser::split_to_tbl('_english.qmd')) (parsed
It seems you are currently knitting a Rmd/Qmd file. The parsing of the file will be done in a new R session.
# A tibble: 8 × 8
type label params text code heading heading_level section
<chr> <chr> <list> <nam> <lis> <chr> <dbl> <chr>
1 yaml <NA> <named list> <lgl> <lgl> <NA> NA <NA>
2 inline <NA> <lgl [1]> <chr> <lgl> <NA> NA <NA>
3 heading <NA> <lgl [1]> <chr> <lgl> Quarto 1 Quarto
4 inline <NA> <lgl [1]> <chr> <lgl> <NA> NA Quarto
5 heading <NA> <lgl [1]> <chr> <lgl> Automa… 1 Automa…
6 inline <NA> <lgl [1]> <chr> <lgl> <NA> NA Automa…
7 block unnamed-chunk-1 <named list> <lgl> <chr> <NA> NA Automa…
8 inline <NA> <lgl [1]> <chr> <lgl> <NA> NA Automa…
md translate
unnest the text and apply the translate function
<- parsed |>
parsed_es ::unnest(cols = text) |>
tidyr::mutate(text_es =
dplyr::map(text,translatemd::translate)
purrr )
let’s see what it looks like.
|>
parsed_es ::unnest(cols = text_es) |>
tidyr::select(type,text) |>
dplyrhead()
|>
parsed_es ::unnest(cols = text_es) |>
tidyr::select(type,text_es) |>
dplyrhead()
# A tibble: 6 × 2
type text
<chr> <chr>
1 heading # Quarto
2 inline Quarto enables you to weave together content and executable code into…
3 inline To create the release cycle in your repo you may want to use Conventi…
4 inline Conventional Commits are a way to format and standardize your commit …
5 inline - The word `feat:` can trigger a Github Action to add that commit t…
6 inline - and it will up-version the minor release version number.
# A tibble: 6 × 2
type text_es
<chr> <chr>
1 heading # Quarto
2 inline Quarto le permite entretejer el contenido y el código ejecutable en u…
3 inline Para crear el ciclo de lanzamiento en su repo es posible que desee ut…
4 inline Commits convencionales son una forma de formatear y estandarizar sus …
5 inline - La palabra `feat:` puede activar una acción de Github para añadir q…
6 inline - y subirá el número de versión de lanzamiento menor.
write to new qmd
clean up and write to new
<- parsed_es |>
parsed_es_to_qmd ::select(-text) |>
dplyr::rename(text = text_es)
dplyr
# output to qmd
::combine_tbl_to_file(
lightparser
parsed_es_to_qmd,"_spanish.qmd"
)
Results
And now you have a document in English and another in Spanish
Bugs
I’ve caught a few bugs to this approach and you maybe even noticed some!
- A
#
got removed in the translate - look at the section calledAutomating the Release Cycle
. Since the#
got removed it is no longer a header ☹️ - The
lightparser
package has a reported bug with quarto chunk yaml parameters. Here it converted#| eval: false
into#| eval: no
, but we know that the#| eval: false
should not be treated as text. Hopefully this is fixed
I recommend going through the document and looking for bugs like these! Some manual edits to the translated qmd file may be necessary.
Full Script Example
translate.R
# install
::install_transformers()
translatemd
# select language
<- translatemd::select_lang("Helsinki-NLP/opus-mt-en-es")
translator
# parse your qmd
<- lightparser::split_to_tbl('english.qmd'))
(parsed
# translate the qmd
<- parsed |>
parsed_es ::unnest(cols = text) |>
tidyr::mutate(text_es = purrr::map(text,translatemd::translate))
dplyr
# write to a new qmd
<- parsed_es |>
parsed_es_to_qmd ::select(-text) |>
dplyr::rename(text = text_es)
dplyr
# output to qmd
::combine_tbl_to_file(
lightparser
parsed_es_to_qmd,"_spanish.qmd"
)