--- title: "Built on tidyverse" author: Daniel Russ date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Built on tidyverse} %\VignetteEngine{knitr::rmarkdown} %\VignetteDepends{ggplot2} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Occupational data is often nice and tabular, but sometimes you have data that is not so nice. If we have soccer results, we may only want to look at the top 3, and we may have 3 human coder results. Instead of using having columns soc2010_1,score_1,soc2010_2,score_2... soc2010_n,score_n you may want to simply using List columns. This is supported by data.frame and tibble. This allows us to put vectors as values. One drawback is the you need to map when you mutate instead of having a simple function. If your are using socR, codingsystem and xwalks are built on top of tibbles. I implemented some of the dplyr verbs on codingsystems and crosswalk, but if a verb does not exist you can get the underlying tibble using as_tibble(). Lets make some fake data ```{r setup, echo=TRUE,results='hide',message=FALSE} library(socR) library(dplyr) library(purrr) library(ggplot2) ``` ```{r} results <- tibble(id=c("job-1","job-2","job-3","job-4"), JobTitle=c("tv repairman","home plumber","fire fighter","President"), JobTask =c("fix tvs","various plumbing work","put out fires, EMS","owns and operates"), soc2010_1 = c("49-2097","47-2152","33-2011","11-1021"), soc2010_2 = c("49-9099","47-3015","33-1021","11-1011"), soc2010_3 = c("49-9031","49-9071","55-3019","11-9021"), score_1 = c(0.9199,0.9866,0.9903,0.8812), score_2 = c(0.0888,0.1003,0.0132,0.7980), score_3 = c(0.0159,0.0337,0.0113,0.2286) ) results ``` Let combine the soccer results and the soccer scores. ```{r} results <- results |> mutate(soc2010_soccer=pmap(pick(starts_with("soc2010_")),\(...) c(...) ), scores_soccer =pmap(pick(starts_with("score_")),\(...) c(...) ) ) |> select(id:JobTask,soc2010_soccer,scores_soccer) results ``` Now lets make some fake human coder results ```{r} coder <- tibble(id=c("job-1","job-2","job-3","job-4"), JobTitle=c("tv repairman","home plumber","fire fighter","President"), JobTask =c("fix tvs","various plumbing work","put out fires, EMS","owns and operates"), soc2010_1 = c("49-2097","47-2152","33-2011","11-1011"), soc2010_2 = c("","46-130",NA,NA), soc2010_3 = c(NA,NA,NA,NA) ) coder ``` Lets make sure that the human coder gave real codes. The socR package comes with a predefined function is_valid_6digit_soc2010, let's not use it but reinvent it. We will use the soc2010 coding system that comes with socR ```{r} is_valid_6digit_soc2010 <- valid_code( filter(soc2010_all,Level==6) ) coder |> mutate( across(c(starts_with("soc2010_")),is_valid_6digit_soc2010)) ``` Lets combine all the results so that the codes are valid ```{r} coder <- coder |> mutate( soc2010_coder= pmap( pick(starts_with("soc2010_")),\(...){ codes = c(...) codes = codes[is_valid_6digit_soc2010(codes)] } ) ) |> select(id,soc2010_coder) ``` Now we can join the soccer results and the coder results. ```{r} combined_results <- left_join(results,coder,by="id") combined_results ``` Lets see how soccer did with these jobs. Since there are 3 SOCcer results the rank can be 1,2,3,or 4+. Let's make this an ordered factor with levels "1","2","3","4+" ```{r,fig.width=7} factor_levels <- 1:4 factor_labels <- c(1:3,"4+") dta <- combined_results |> mutate(rank= map2_int(soc2010_coder,soc2010_soccer,\(x,y)min(match(x,y)) ), rank=ordered(rank,factor_levels,factor_labels)) dta dta |> group_by(rank) |> summarise(n=n()) |> mutate(pct=cumsum(n)/sum(n) ) |> ggplot(aes(x=rank,y=pct,group = 1)) + geom_point() + geom_line()+ scale_y_continuous("Percent Agreement",limits = c(0,1),labels = scales::percent) + labs(title="SOCcerNET: Agreement with Expert Coding at different ranks", subtitle = "This is an example, not indicative of real data") ``` By default SOCcer returns the top-10 scoring codes, in this example only 3 are given and for the 4 jobs all are successfully coded by soc2010_1 or soc2010_2 (rank=1 or rank=2). From this small and very easy to code set of jobs, you can see that SOCcer agreed with the experts 75% at the most detailed level and 100 of the jobs agreed if you include two codes (Do not expect results this good.). There is no point in continuing the line beyond 2. I hope you see how we can use the list columns and the purrr::map family to handle the case of multiple results. If you do not like map, you may be able to use `dplyr::rowwise` to handle the lists. When you are ready to write out the data, you probably don't want to use csv format. I suggest using `jsonlite::write_json` or if you are adventurous use `arrow::write_feather`