Data pipelines in {rixpress}
often require controlling
how objects are stored and restored, especially when dealing with:
{qs}
compressed files,
etc.).This vignette focuses on encoding and decoding in R,
and on transferring data between R and Python using
rxp_py2r()
and rxp_r2py()
.
By default, {rixpress}
uses saveRDS()
and
readRDS()
. You can override this to handle different
formats or complex objects:
library(rixpress)
# Encode output as CSV instead of RDS
d2 <- rxp_r(
mtcars_head,
my_head(mtcars_am, 100),
user_functions = "my_head.R",
nix_env = "default.nix",
encoder = write.csv
)
# Encode as qs, decode input from CSV
d3 <- rxp_r(
mtcars_tail,
my_tail(mtcars_head),
user_functions = "my_tail.R",
nix_env = "default2.nix",
encoder = qs::qsave,
decoder = read.csv
)
# Decode multiple upstream objects with different decoders
d4 <- rxp_r(
mtcars_mpg,
full_join(mtcars_tail, mtcars_head),
nix_env = "default2.nix",
decoder = c(
mtcars_tail = "qs::qread",
mtcars_head = "read.csv"
)
)
Key points:
encoder
controls how this step’s output is stored.decoder
specifies how to read inputs from upstream
derivations.As shown in the examples above, you can pass a function or a string
representation of the function to encoder
and
decoder
.
By encoding the object in a cross-language format, it is possible to pass it to another language. For example, read a csv file using Julia, encode it to Arrow and read it back in R:
library(rixpress)
list(
rxp_jl_file(
mtcars,
# Assume here that mtcars.csv is separated by "|" instead of ","
path = "data/mtcars.csv",
read_function = "read_csv",
user_functions = "functions.jl",
encoder = "write_arrow"
# read_csv and write_arrow are both
# defined in the functions.jl script
# and looks like this:
#function write_arrow(df::DataFrame, filename::String)
# Arrow.write(filename, df)
#end
#function read_csv(path::String)
# df = CSV.read(path, DataFrame; delim="|")
#return df
#end
),
rxp_r(
mtcars2,
select(mtcars, am, cyl, mpg),
decoder = "read_feather"
)
) |>
rxp_populate()
You can find this example here. You can use the same approach to transfer data to Python (well, from and to any of the three supported languages).
In the specific case of transferring objects (data, lists, vectors,
arrays, etc.) between R and Python, it also possible to use
{reticulate}
’s built-in conversion by using
rxp_py2r()
and rxp_r2py()
. These functions
enable seamless movement of objects between R and Python:
library(rixpress)
# Python step producing pandas DataFrame
d1 <- rxp_py(
name = mtcars_pl_am,
expr = "mtcars_pl.filter(polars.col('am') == 1).to_pandas()"
)
# Transfer Python -> R
d2 <- rxp_py2r(
name = mtcars_am,
expr = mtcars_pl_am
)
# R step processing the data
d3 <- rxp_r(
name = mtcars_head,
expr = my_head(mtcars_am),
user_functions = "functions.R"
)
# Transfer R -> Python
d3_1 <- rxp_r2py(
name = mtcars_head_py,
expr = mtcars_head
)
For this to work, you need to add {reticulate}
to the
pipeline’s execution environment.
encoder
/decoder
for non-RDS objects
(CSV, {qs}
, Keras models) and to pass data to and from
different languages.rxp_py2r()
and rxp_r2py()
if you want
to re-use {reticulate}
’s bulit-in conversion (useful for
more complex objects).