| Title: | Useful functions for working with occupation coding |
|---|---|
| Description: | A set of functions that I find useful in my research into occupational coding and codes. |
| Authors: | Daniel E. Russ [aut, cre] (ORCID: <https://orcid.org/0000-0003-4040-4416>) |
| Maintainer: | Daniel E. Russ <[email protected]> |
| License: | file LICENSE |
| Version: | 0.8.0 |
| Built: | 2026-06-09 15:48:18 UTC |
| Source: | https://github.com/danielruss/socR |
Create a coding system from a data frame
as_codingsystem(x, name = "", ...) ## S3 method for class 'data.frame' as_codingsystem(x, name = "", ...) ## S3 method for class 'codingsystem' as_codingsystem(x, name = "", ...)as_codingsystem(x, name = "", ...) ## S3 method for class 'data.frame' as_codingsystem(x, name = "", ...) ## S3 method for class 'codingsystem' as_codingsystem(x, name = "", ...)
x |
the data frame containing columns "code" and "title" |
name |
coding system name |
... |
additional parameters |
a codingsystem object.
The bin_center() function takes a vector of scores between 0-1 and
a number of bins an returns the center of the bin the score falls in.
bin_center(score, n_bins)bin_center(score, n_bins)
score |
The scores (values from 0-1) that are being binned |
n_bins |
the number of bins the score |
the center of the score bin for all scores
This codes one job at a time. In order to code multiple jobs, you can create a tibble (data frame) and use pmap_dfr to produce a results tibble similar to the web-based version of SOCcer (https://soccer.nci.nih.gov)
codeJobHistory(title, task = "", industry = "", ..., n = 10)codeJobHistory(title, task = "", industry = "", ..., n = 10)
title |
The job title |
task |
tasks performed on the job |
industry |
industry (SIC 1987 code) |
... |
(not used) |
n |
the number of soc codes to return (default) |
Please use the web-based version for handling large jobs.
a tibble consisting of the title/task/industry and the top n SOCcer results and scores
## Not run: soccer_results <- codeJobHistory("epidemiologist") jobs <- tibble::tibble(title=c("chemist","farmer","data scientist"), task=rep("",3),industry=rep("",3)) soccer_results_3 <- purrr::pmap_dfr(jobs,codeJobHistory,n=20) ## End(Not run)## Not run: soccer_results <- codeJobHistory("epidemiologist") jobs <- tibble::tibble(title=c("chemist","farmer","data scientist"), task=rep("",3),industry=rep("",3)) soccer_results_3 <- purrr::pmap_dfr(jobs,codeJobHistory,n=20) ## End(Not run)
returns the all the codes in column code_column
codes(x, code_column)codes(x, code_column)
x |
A crosswalk of class xwalk |
code_column |
The column names for the desired codes. |
Compares if a vector of codes is in a vector of reviewer codes.
codesAgree(codes, reviewer)codesAgree(codes, reviewer)
codes |
codes to compare |
reviewer |
reviewer's code – "gold" standard |
Particularly useful when combined with purrr::map_lgl
TRUE if the codes are in the reviewer otherwise FALSE
x <- '11-1011' y <- c('11-1011','11-1031') codesAgree(x,c("11-1011","11-1021")) codesAgree(y,c("11-1021","11-1031")) codesAgree(x,c("13-1011","11-1021")) codesAgree(y,c("13-1011","11-1021"))x <- '11-1011' y <- c('11-1011','11-1031') codesAgree(x,c("11-1011","11-1021")) codesAgree(y,c("11-1021","11-1031")) codesAgree(x,c("13-1011","11-1021")) codesAgree(y,c("13-1011","11-1021"))
constructor create a coding system S3 class
codingsystem(codes, titles, ..., name = "")codingsystem(codes, titles, ..., name = "")
codes |
vector of codes, a dataframe containing the columns "code" (with codes) and "title" (with titles), or a url/file path of a csv file containing the codes and titles with header row containing at least "code" and title. Other columns may be present. |
titles |
vector of title |
... |
additional parameters passed into rio::import |
name |
coding system name |
the codingsystem object
url <- "https://danielruss.github.io/codingsystems/naics2022_all.csv" naic2022 <- codingsystem(url,name = "naics2022", colClasses=c(rep("character",2),"integer",rep("character",5)))url <- "https://danielruss.github.io/codingsystems/naics2022_all.csv" naic2022 <- codingsystem(url,name = "naics2022", colClasses=c(rep("character",2),"integer",rep("character",5)))
Takes two concordance tables (xw1 and xw2), where xw1 go from coding system one to an intermediary coding system, and xw2 goes from the intermediary coding system to coding system two. The goal is to make one table that goes from coding system 1 to coding system 2.
combine_crosswalks(xw1, xw2)combine_crosswalks(xw1, xw2)
xw1 |
- crosswalk 1, either an xwalk object or a data.frame |
xw2 |
- crosswalk 2, either an xwalk object or a data.frame |
# the noc_isco example has an extra column that confuses the parser, # so I have to specify the parts or skip the last column. noc_isco <- xwalk("https://danielruss.github.io/codingsystems/noc2011_isco2008.csv", col_types = "cccc-") isco_soc <- xwalk("https://danielruss.github.io/codingsystems/isco2008_soc2010.csv") combine_crosswalks(noc_isco,isco_soc)# the noc_isco example has an extra column that confuses the parser, # so I have to specify the parts or skip the last column. noc_isco <- xwalk("https://danielruss.github.io/codingsystems/noc2011_isco2008.csv", col_types = "cccc-") isco_soc <- xwalk("https://danielruss.github.io/codingsystems/isco2008_soc2010.csv") combine_crosswalks(noc_isco,isco_soc)
creates a multihot encoder from a list of labels
createMultiHotEncoder(allLabels)createMultiHotEncoder(allLabels)
allLabels |
The complete set of labels |
a function that preforms multihot encoding
Use the concordance table (crosswalk) to convert from one coding system to another.
crosswalk(codes, xwalk, invert = FALSE, unlist = FALSE)crosswalk(codes, xwalk, invert = FALSE, unlist = FALSE)
codes |
the vector of codes that will be crosswalked |
xwalk |
the concordance table. |
invert |
by default the crosswalk goes from codes1 to codes2 setting invert to TRUE make the crosswalk go from codes2 to codes1 |
unlist |
instead of returning a list, return an unamed vector use it when crosswalking a dataframe column with mutate |
an unnamed list of codes in the resulting coding system
If you have a data frame of data with multiple columns that need to be crosswalked, use this in a pipe.
crosswalk_columns(.data, xwalk, new_column_name, ..., unnest_results = TRUE)crosswalk_columns(.data, xwalk, new_column_name, ..., unnest_results = TRUE)
.data |
job data |
xwalk |
crosswalk going from code system 1 to coding system 2 |
new_column_name |
the column name for the results if the results are unnested, the results will be colname_1, colname_2 ... colname_n, otherwise the results are a list column with name new_column_name. |
... |
Columns that need to be crosswalked |
unnest_results |
default=TRUE, should the results be separated into individual columns or else as a single list column |
a crosswalked tibble.
## Not run: a <- tibble::tibble(id=c("job-1","job-2"),soc2010_1=c("11-1011","11-2011"), soc2010_2=c("11-1021",NA)) xw <- socR::xwalk("https://danielruss.github.io/codingsystems/soc2010_soc2018.csv") a |> crosswalk_columns(xw,soc2018_xw,soc2010_1,soc2010_2) a |> crosswalk_columns(xw,soc2018_xw,soc2010_1,soc2010_2,unnest_results=FALSE) ## End(Not run)## Not run: a <- tibble::tibble(id=c("job-1","job-2"),soc2010_1=c("11-1011","11-2011"), soc2010_2=c("11-1021",NA)) xw <- socR::xwalk("https://danielruss.github.io/codingsystems/soc2010_soc2018.csv") a |> crosswalk_columns(xw,soc2018_xw,soc2010_1,soc2010_2) a |> crosswalk_columns(xw,soc2018_xw,soc2010_1,soc2010_2,unnest_results=FALSE) ## End(Not run)
Retrieve or set the dimension of an object.
## S3 method for class 'codingsystem' dim(x)## S3 method for class 'codingsystem' dim(x)
x |
an R object, for example a matrix, array or data frame. |
The functions dim and dim<- are internal generic
primitive functions.
dim has a method for data.frames, which returns
the lengths of the row.names attribute of x and
of x (as the numbers of rows and columns respectively).
For an array (and hence in particular, for a matrix) dim retrieves
the dim attribute of the object. It is NULL or a vector
of mode integer.
The replacement method changes the "dim" attribute (provided the
new value is compatible) and removes any "dimnames" and
"names" attributes.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
x <- 1:12 ; dim(x) <- c(3,4) x # simple versions of nrow and ncol could be defined as follows nrow0 <- function(x) dim(x)[1] ncol0 <- function(x) dim(x)[2]x <- 1:12 ; dim(x) <- c(3,4) x # simple versions of nrow and ncol could be defined as follows nrow0 <- function(x) dim(x)[1] ncol0 <- function(x) dim(x)[2]
Takes valid 1980 standardized codes (the ones in the book) and extends them so that unit codes are always the most detailed (even if it is exactly the same as the parent code.)
extend_standard_soc1980_codes(codes)extend_standard_soc1980_codes(codes)
codes |
The codes we are extending |
extended codes.
These methods allow you to work with the underlying data within a crosswalk as if was a tibble.
## S3 method for class 'xwalk' filter(.data, ..., .by = NULL, .preserve = FALSE) ## S3 method for class 'xwalk' arrange(.data, ..., .by_group = FALSE) ## S3 method for class 'xwalk' as_tibble( x, ..., .rows = NULL, .name_repair = c("check_unique", "unique", "universal", "minimal"), rownames = pkgconfig::get_config("tibble::rownames", NULL) )## S3 method for class 'xwalk' filter(.data, ..., .by = NULL, .preserve = FALSE) ## S3 method for class 'xwalk' arrange(.data, ..., .by_group = FALSE) ## S3 method for class 'xwalk' as_tibble( x, ..., .rows = NULL, .name_repair = c("check_unique", "unique", "universal", "minimal"), rownames = pkgconfig::get_config("tibble::rownames", NULL) )
.data |
The crosswalk |
... |
< |
.by |
< |
.preserve |
Relevant when the |
.by_group |
If |
x |
A data frame, list, matrix, or other object that could reasonably be coerced to a tibble. |
.rows |
The number of rows, useful to create a 0-column tibble or just as an additional check. |
.name_repair |
Treatment of problematic column names:
This argument is passed on as |
rownames |
How to treat existing row names of a data frame or matrix:
Read more in rownames. |
formats a codingsystem
## S3 method for class 'codingsystem' format(x, ...)## S3 method for class 'codingsystem' format(x, ...)
x |
- the codingsystem |
... |
not currently used |
a formatted character vector
Get a list of codes from a coding system
get_codes(.codingsystem)get_codes(.codingsystem)
.codingsystem |
either a codingsystem or a tibble that has a a column named "code". |
a vector of codes
Returns the first or last parts of a vector, matrix, table, data frame
or function. Since head() and tail() are generic
functions, they may also have been extended to other classes.
## S3 method for class 'codingsystem' head(x, ...)## S3 method for class 'codingsystem' head(x, ...)
x |
an object |
... |
arguments to be passed to or from other methods. |
For vector/array based objects, head() (tail()) returns
a subset of the same dimensionality as x, usually of
the same class. For historical reasons, by default they select the
first (last) 6 indices in the first dimension ("rows") or along the
length of a non-dimensioned vector, and the full extent (all indices)
in any remaining dimensions. head.matrix() and
tail.matrix() are exported.
The default and array(/matrix) methods for head() and
tail() are quite general. They will work as is for any class
which has a dim() method, a length() method (only
required if dim() returns NULL), and a [ method
(that accepts the drop argument and can subset in all
dimensions in the dimensioned case).
For functions, the lines of the deparsed function are returned as character strings.
When x is an array(/matrix) of dimensionality two and more,
tail() will add dimnames similar to how they would appear in a
full printing of x for all dimensions k where
n[k] is specified and non-missing and dimnames(x)[[k]]
(or dimnames(x) itself) is NULL. Specifically, the
form of the added dimnames will vary for different dimensions as follows:
k=1 (rows): "[n,]" (right justified with
whitespace padding)
k=2 (columns): "[,n]" (with no whitespace
padding)
k>2 (higher dims): "n", i.e., the indices as
character values
Setting keepnums = FALSE suppresses this behaviour.
As data.frame subsetting (‘indexing’) keeps
attributes, so do the head() and tail()
methods for data frames.
An object (usually) like x but generally smaller. Hence, for
arrays, the result corresponds to x[.., drop=FALSE].
For ftable objects x, a transformed format(x).
For array inputs the output of tail when keepnums is TRUE,
any dimnames vectors added for dimensions >2 are the original
numeric indices in that dimension as character vectors. This
means that, e.g., for 3-dimensional array arr,
tail(arr, c(2,2,-1))[ , , 2] and
tail(arr, c(2,2,-1))[ , , "2"] may both be valid but have
completely different meanings.
Patrick Burns, improved and corrected by R-Core. Negative argument added by Vincent Goulet. Multi-dimension support added by Gabriel Becker.
head(letters) head(letters, n = -6L) head(freeny.x, n = 10L) head(freeny.y) head(iris3) head(iris3, c(6L, 2L)) head(iris3, c(6L, -1L, 2L)) tail(letters) tail(letters, n = -6L) tail(freeny.x) ## the bottom-right "corner" : tail(freeny.x, n = c(4, 2)) tail(freeny.y) tail(iris3) tail(iris3, c(6L, 2L)) tail(iris3, c(6L, -1L, 2L)) ## iris with dimnames stripped a3d <- iris3 ; dimnames(a3d) <- NULL tail(a3d, c(6, -1, 2)) # keepnums = TRUE is default here! tail(a3d, c(6, -1, 2), keepnums = FALSE) ## data frame w/ a (non-standard) attribute: treeS <- structure(trees, foo = "bar") (n <- nrow(treeS)) stopifnot(exprs = { # attribute is kept identical(htS <- head(treeS), treeS[1:6, ]) identical(attr(htS, "foo") , "bar") identical(tlS <- tail(treeS), treeS[(n-5):n, ]) ## BUT if I use "useAttrib(.)", this is *not* ok, when n is of length 2: ## --- because [i,j]-indexing of data frames *also* drops "other" attributes .. identical(tail(treeS, 3:2), treeS[(n-2):n, 2:3] ) }) tail(library) # last lines of function head(stats::ftable(Titanic)) ## 1d-array (with named dim) : a1 <- array(1:7, 7); names(dim(a1)) <- "O2" stopifnot(exprs = { identical( tail(a1, 10), a1) identical( head(a1, 10), a1) identical( head(a1, 1), a1 [1 , drop=FALSE] ) # was a1[1] in R <= 3.6.x identical( tail(a1, 2), a1[6:7]) identical( tail(a1, 1), a1 [7 , drop=FALSE] ) # was a1[7] in R <= 3.6.x })head(letters) head(letters, n = -6L) head(freeny.x, n = 10L) head(freeny.y) head(iris3) head(iris3, c(6L, 2L)) head(iris3, c(6L, -1L, 2L)) tail(letters) tail(letters, n = -6L) tail(freeny.x) ## the bottom-right "corner" : tail(freeny.x, n = c(4, 2)) tail(freeny.y) tail(iris3) tail(iris3, c(6L, 2L)) tail(iris3, c(6L, -1L, 2L)) ## iris with dimnames stripped a3d <- iris3 ; dimnames(a3d) <- NULL tail(a3d, c(6, -1, 2)) # keepnums = TRUE is default here! tail(a3d, c(6, -1, 2), keepnums = FALSE) ## data frame w/ a (non-standard) attribute: treeS <- structure(trees, foo = "bar") (n <- nrow(treeS)) stopifnot(exprs = { # attribute is kept identical(htS <- head(treeS), treeS[1:6, ]) identical(attr(htS, "foo") , "bar") identical(tlS <- tail(treeS), treeS[(n-5):n, ]) ## BUT if I use "useAttrib(.)", this is *not* ok, when n is of length 2: ## --- because [i,j]-indexing of data frames *also* drops "other" attributes .. identical(tail(treeS, 3:2), treeS[(n-2):n, 2:3] ) }) tail(library) # last lines of function head(stats::ftable(Titanic)) ## 1d-array (with named dim) : a1 <- array(1:7, 7); names(dim(a1)) <- "O2" stopifnot(exprs = { identical( tail(a1, 10), a1) identical( head(a1, 10), a1) identical( head(a1, 1), a1 [1 , drop=FALSE] ) # was a1[1] in R <= 3.6.x identical( tail(a1, 2), a1[6:7]) identical( tail(a1, 1), a1 [7 , drop=FALSE] ) # was a1[7] in R <= 3.6.x })
Check if a value is a url by looking for the http(s):// .Works with vectors...
is_url(x)is_url(x)
x |
String to check |
logical vector TRUE if the x is a url False otherwise
Check if a set of codes are valid for a coding system
is_valid(code, system)is_valid(code, system)
code |
vector of codes to check |
system |
the coding system |
boolean vector corresponding to whether the codes are in the coding system
Is this object a coding system
is.codingsystem(x)is.codingsystem(x)
x |
object to test |
Is this object a crosswalk
is.xwalk(x)is.xwalk(x)
x |
A crosswalk of class xwalk |
Gets the levels for a vector of codes from a codingsystem The type returned depends on the data.
level(data, codes) ## S3 method for class 'codingsystem' level(data, codes)level(data, codes) ## S3 method for class 'codingsystem' level(data, codes)
data |
- a codingsystem |
codes |
- a vector of codes to check |
a vector of Levels
level(soc1980_all,"99-99") # "division" level(soc2010_all,c("11-1011","11-2010")) # c(6,5)level(soc1980_all,"99-99") # "division" level(soc2010_all,c("11-1011","11-2010")) # c(6,5)
load the data from a SOCAssign SQLite database
load_socassign_db(fname, addSrc = FALSE)load_socassign_db(fname, addSrc = FALSE)
fname |
the SOCAssign db file |
addSrc |
should I add the file name as an src column |
tibble with coder results
Look up code
lookup_code(x, system)lookup_code(x, system)
x |
list of codes to lookup |
system |
the coding system |
a vector of titles for the codes
convert a list column of codes to vector of string for display
make_code_str(x)make_code_str(x)
x |
codes column |
a vector a string concatenating all the codes
df <- tibble::tibble(soc2010_codes = list(c("11-1011","11-1021"),c("11-1000"))) df <- dplyr::mutate(df,code_str=make_code_str(soc2010_codes))df <- tibble::tibble(soc2010_codes = list(c("11-1011","11-1021"),c("11-1000"))) df <- dplyr::mutate(df,code_str=make_code_str(soc2010_codes))
Returns the user assigned name of the coding system
name(system)name(system)
system |
coding system |
the name of the coding system (may be blank)
Canadian 4 digit National Occupational Classification (NOC) 2011
noc2011_4digitnoc2011_4digit
a 4 digit code formated like '0011', be careful must be a string not an integer
a short definition of the code
https://danielruss.github.io/codingsystems/noc_2011_4d.csv
https://www.statcan.gc.ca/eng/subjects/standard/noc/2011/index
Canadian 4 digit National Occupational Classification (NOC) 2011
noc2011_allnoc2011_all
a 1-4-digit code formated like '0011', be careful must be a string not an integer
a short definition of the code
Unofficial name for the level in the hierarchy (number of digits) for the code, 1, 2, 3, or 4
Official name for the level in the hierarchy
the 1-digit noc code associated with the code
the 2-digit noc code associated with the code, is NA for 1-digit codes
the 3-digit noc code associated with the code, is NA for 1- or 2-digit codes
the 4-digit noc code associated with the code, is NA for 1-, 2-, or 3-digit codes
https://danielruss.github.io/codingsystems/noc_2011_4d.csv
https://www.statcan.gc.ca/eng/subjects/standard/noc/2011/index
prints a codingsystem
## S3 method for class 'codingsystem' print(x, ...)## S3 method for class 'codingsystem' print(x, ...)
x |
- the codingsystem |
... |
parameter for format, not currently used |
These methods allow you to use the codingsystem like a tibble. When using select, make sure you keep the code/title or else you can break the functionality of the codingsystem.
## S3 method for class 'codingsystem' select(.data, ...) ## S3 method for class 'codingsystem' filter(.data, ..., .by = NULL, .preserve = FALSE, name = NULL) ## S3 method for class 'codingsystem' mutate(.data, ...) ## S3 method for class 'codingsystem' arrange(.data, ..., .by_group = FALSE) ## S3 method for class 'codingsystem' as_tibble(x, ..., .rows = NULL, .name_repair = NULL, rownames = NULL) ## S3 method for class 'codingsystem' count(x, ..., wt = NULL, sort = FALSE, name = NULL)## S3 method for class 'codingsystem' select(.data, ...) ## S3 method for class 'codingsystem' filter(.data, ..., .by = NULL, .preserve = FALSE, name = NULL) ## S3 method for class 'codingsystem' mutate(.data, ...) ## S3 method for class 'codingsystem' arrange(.data, ..., .by_group = FALSE) ## S3 method for class 'codingsystem' as_tibble(x, ..., .rows = NULL, .name_repair = NULL, rownames = NULL) ## S3 method for class 'codingsystem' count(x, ..., wt = NULL, sort = FALSE, name = NULL)
.data |
the coding system |
... |
parts of the coding system |
.by |
passed to dplyr::filter |
.preserve |
passed to dplyr::filter |
name |
name for the filtered coding system |
.by_group |
passed to dplyr::arrange |
x |
the coding system |
.rows |
passed to dplyer::as_tibble |
.name_repair |
passed to dplyer::as_tibble |
rownames |
passed to dplyer::as_tibble |
a new codingsystem
US SOC 1980 classification system
soc1980_allsoc1980_all
the n-digit soc 1980
a short definition of the code
https://danielruss.github.io/codingsystems/soc1980.csv
The US SOC 1980 classification system can have higher level (major or minor codes) codes without any children. This data contains all the most detailed codes regardless of the code level.
soc1980_detailedsoc1980_detailed
the soc 1980 code
a short definition of the code
the level of the soc 1980 code
the parent of the soc 1980 code, note: at the division level, the parent is 0000
for any soc 1980 code, what is the division code
for any soc 1980 code, what is the major code. Is NA for division codes.
for any soc 1980 code, what is the minor code. Is NA for division and major codes.
for any soc 1980 code, what is the unit code. Is NA for non-unit codes.
https://danielruss.github.io/codingsystems/soc1980_most_detailed.csv
The US SOC 1980 classification system can have higher level (major or minor codes) codes without any children. We extended the SOC 1980 classification system to require all major codes (2-digit code) to have at least 1 minor code (3-digit code ), and every minor codes to have at least 1 unit code (4-digit code). The most detailed code is now always a unit code.
soc1980_extendedsoc1980_extended
the soc 1980 code
a short definition of the code
the level of the soc 1980 code
the parent of the soc 1980 code, note: at the division level, the parent is 0000
for any soc 1980 code, what is the division code
for any soc 1980 code, what is the major code. Is NA for division codes.
for any soc 1980 code, what is the minor code. Is NA for division and major codes.
for any soc 1980 code, what is the unit code. Is NA for non-unit codes.
https://danielruss.github.io/codingsystems/soc_1980_extended.csv
Downloaded by Daniel Russ
soc2010_6digitsoc2010_6digit
a 6 digit code formated like '11-1011'
a short definition of the code
https://danielruss.github.io/codingsystems/soc_2010_6digit.csv
https://www.bls.gov/soc/2010/2010_major_groups.htm
The complete US SOC 2010 classification system. This data contains all the codes regardless of the code level.
soc2010_allsoc2010_all
the soc 2010 code
a short definition of the code
The number of significant digits in the code
The name of the level
the parent of the soc code, note: 2 digit soc code dont have parents
for any soc code, what is the 2-digit code
for any soc code, what is the 3-digit code. Is NA for 2-digit codes.
for any soc code, what is the 5-digit code. Is NA for 2- and 3-digit codes.
for any soc code, what is the 6-digit code. Is NA for 2-, 3-, and 5-digit codes.
https://danielruss.github.io/codingsystems/soc2010_all.csv
The complete US SOC 2018 classification system. This data contains all the codes regardless of the code level.
soc2018_allsoc2018_all
the soc 2018 code
a short definition of the code
The number of significant digits in the code
The name of the level
the parent of the soc code, note: 2 digit soc code dont have parents
for any soc code, what is the 2-digit code
for any soc code, what is the 3-digit code. Is NA for 2-digit codes.
for any soc code, what is the 5-digit code. Is NA for 2- and 3-digit codes.
for any soc code, what is the 6-digit code. Is NA for 2-, 3-, and 5-digit codes.
https://danielruss.github.io/codingsystems/soc2018_all.csv
A simple deterministic mechanism for splitting data into training, development, and test data based on the MD5 hash of a unused string parameters.
split_data(x, pTrain = 0.9, pDev = 0.09, pTest = 0.01)split_data(x, pTrain = 0.9, pDev = 0.09, pTest = 0.01)
x |
unused string data used to split the data |
pTrain |
approximate percent of the training split |
pDev |
approximate percent of the development split |
pTest |
approximate percent of the test split |
a vector of factors (Train,Dev,Test) denoting the data split
split_data(rownames(mtcars))split_data(rownames(mtcars))
US SOC 1980 codes are often written in none stand form (e.g 4600 instead of 46-47). This function attempt to standardize some of the ways SOC 1980 codes are written.
standardize_soc1980_codes(codes)standardize_soc1980_codes(codes)
codes |
vector of US SOC 1980 codes |
the function trims leading and trailing zeros ("up to 2 trailing zero - 20 is a valid soc code)
standardized US SOC 1980 codes
standardize_soc1980_codes(c("2000",'7600'))standardize_soc1980_codes(c("2000",'7600'))
Returns the first or last parts of a vector, matrix, table, data frame
or function. Since head() and tail() are generic
functions, they may also have been extended to other classes.
## S3 method for class 'codingsystem' tail(x, ...)## S3 method for class 'codingsystem' tail(x, ...)
x |
an object |
... |
arguments to be passed to or from other methods. |
For vector/array based objects, head() (tail()) returns
a subset of the same dimensionality as x, usually of
the same class. For historical reasons, by default they select the
first (last) 6 indices in the first dimension ("rows") or along the
length of a non-dimensioned vector, and the full extent (all indices)
in any remaining dimensions. head.matrix() and
tail.matrix() are exported.
The default and array(/matrix) methods for head() and
tail() are quite general. They will work as is for any class
which has a dim() method, a length() method (only
required if dim() returns NULL), and a [ method
(that accepts the drop argument and can subset in all
dimensions in the dimensioned case).
For functions, the lines of the deparsed function are returned as character strings.
When x is an array(/matrix) of dimensionality two and more,
tail() will add dimnames similar to how they would appear in a
full printing of x for all dimensions k where
n[k] is specified and non-missing and dimnames(x)[[k]]
(or dimnames(x) itself) is NULL. Specifically, the
form of the added dimnames will vary for different dimensions as follows:
k=1 (rows): "[n,]" (right justified with
whitespace padding)
k=2 (columns): "[,n]" (with no whitespace
padding)
k>2 (higher dims): "n", i.e., the indices as
character values
Setting keepnums = FALSE suppresses this behaviour.
As data.frame subsetting (‘indexing’) keeps
attributes, so do the head() and tail()
methods for data frames.
An object (usually) like x but generally smaller. Hence, for
arrays, the result corresponds to x[.., drop=FALSE].
For ftable objects x, a transformed format(x).
For array inputs the output of tail when keepnums is TRUE,
any dimnames vectors added for dimensions >2 are the original
numeric indices in that dimension as character vectors. This
means that, e.g., for 3-dimensional array arr,
tail(arr, c(2,2,-1))[ , , 2] and
tail(arr, c(2,2,-1))[ , , "2"] may both be valid but have
completely different meanings.
Patrick Burns, improved and corrected by R-Core. Negative argument added by Vincent Goulet. Multi-dimension support added by Gabriel Becker.
head(letters) head(letters, n = -6L) head(freeny.x, n = 10L) head(freeny.y) head(iris3) head(iris3, c(6L, 2L)) head(iris3, c(6L, -1L, 2L)) tail(letters) tail(letters, n = -6L) tail(freeny.x) ## the bottom-right "corner" : tail(freeny.x, n = c(4, 2)) tail(freeny.y) tail(iris3) tail(iris3, c(6L, 2L)) tail(iris3, c(6L, -1L, 2L)) ## iris with dimnames stripped a3d <- iris3 ; dimnames(a3d) <- NULL tail(a3d, c(6, -1, 2)) # keepnums = TRUE is default here! tail(a3d, c(6, -1, 2), keepnums = FALSE) ## data frame w/ a (non-standard) attribute: treeS <- structure(trees, foo = "bar") (n <- nrow(treeS)) stopifnot(exprs = { # attribute is kept identical(htS <- head(treeS), treeS[1:6, ]) identical(attr(htS, "foo") , "bar") identical(tlS <- tail(treeS), treeS[(n-5):n, ]) ## BUT if I use "useAttrib(.)", this is *not* ok, when n is of length 2: ## --- because [i,j]-indexing of data frames *also* drops "other" attributes .. identical(tail(treeS, 3:2), treeS[(n-2):n, 2:3] ) }) tail(library) # last lines of function head(stats::ftable(Titanic)) ## 1d-array (with named dim) : a1 <- array(1:7, 7); names(dim(a1)) <- "O2" stopifnot(exprs = { identical( tail(a1, 10), a1) identical( head(a1, 10), a1) identical( head(a1, 1), a1 [1 , drop=FALSE] ) # was a1[1] in R <= 3.6.x identical( tail(a1, 2), a1[6:7]) identical( tail(a1, 1), a1 [7 , drop=FALSE] ) # was a1[7] in R <= 3.6.x })head(letters) head(letters, n = -6L) head(freeny.x, n = 10L) head(freeny.y) head(iris3) head(iris3, c(6L, 2L)) head(iris3, c(6L, -1L, 2L)) tail(letters) tail(letters, n = -6L) tail(freeny.x) ## the bottom-right "corner" : tail(freeny.x, n = c(4, 2)) tail(freeny.y) tail(iris3) tail(iris3, c(6L, 2L)) tail(iris3, c(6L, -1L, 2L)) ## iris with dimnames stripped a3d <- iris3 ; dimnames(a3d) <- NULL tail(a3d, c(6, -1, 2)) # keepnums = TRUE is default here! tail(a3d, c(6, -1, 2), keepnums = FALSE) ## data frame w/ a (non-standard) attribute: treeS <- structure(trees, foo = "bar") (n <- nrow(treeS)) stopifnot(exprs = { # attribute is kept identical(htS <- head(treeS), treeS[1:6, ]) identical(attr(htS, "foo") , "bar") identical(tlS <- tail(treeS), treeS[(n-5):n, ]) ## BUT if I use "useAttrib(.)", this is *not* ok, when n is of length 2: ## --- because [i,j]-indexing of data frames *also* drops "other" attributes .. identical(tail(treeS, 3:2), treeS[(n-2):n, 2:3] ) }) tail(library) # last lines of function head(stats::ftable(Titanic)) ## 1d-array (with named dim) : a1 <- array(1:7, 7); names(dim(a1)) <- "O2" stopifnot(exprs = { identical( tail(a1, 10), a1) identical( head(a1, 10), a1) identical( head(a1, 1), a1 [1 , drop=FALSE] ) # was a1[1] in R <= 3.6.x identical( tail(a1, 2), a1[6:7]) identical( tail(a1, 1), a1 [7 , drop=FALSE] ) # was a1[7] in R <= 3.6.x })
A utility function for converting occupational codes to higher levels in the hierarchy.
to_level(codingsystem, level)to_level(codingsystem, level)
codingsystem |
The coding system we are using |
level |
The level in the coding system we want. Should be a column name in the codingsystem table. |
a function that converts a vector of codes from a lower level to a the level input.
to_soc2010_2d <- to_level(soc2010_all, soc2d) to_soc2010_2d(c("11-1011","15-1110"))to_soc2010_2d <- to_level(soc2010_all, soc2d) to_soc2010_2d(c("11-1011","15-1110"))
This function replaces a set of input columns that you pass in with a list column containing the values of input column on a row-by-row basis.
to_list_column(df, colname, ...)to_list_column(df, colname, ...)
df |
the data frame you are modifying |
colname |
the name of the new column |
... |
the columns you are combining into a list column |
The original data frame with a new list column 'colname' replacing the columns given
df <- tibble::tibble(a_1=1:3,a_2=2:4,a_3=3:5,b=4:6) |> to_list_column(a,a_1,a_2,a_3)df <- tibble::tibble(a_1=1:3,a_2=2:4,a_3=3:5,b=4:6) |> to_list_column(a,a_1,a_2,a_3)
check whether a code is valid
valid_code(codeList) is_valid_6digit_soc2010(code) is_valid_soc1980(code) is_most_detailed_soc1980(code) is_valid_extended_soc1980(code) is_most_detailed_extended_soc1980(code)valid_code(codeList) is_valid_6digit_soc2010(code) is_valid_soc1980(code) is_most_detailed_soc1980(code) is_valid_extended_soc1980(code) is_most_detailed_extended_soc1980(code)
codeList |
a vector of valid codes |
code |
codes to compare |
valid_code is a functional that create a function that check if a vector of codes is valid
is_valid_4digit_soc1980, is_valid_6digit_soc2010 and is_valid_4digit_noc2011 were made using valid_code functional.
valid_code returns a function. The functions (e.g. is_valid_soc2010) take a code or a vector of codes and returns a logic vector representing if the codes are valid.
[standardize_soc1980_codes()]
is_valid_toy <- valid_code(c("A","B","C")) is_valid_toy(c("X","A","Z","B"))is_valid_toy <- valid_code(c("A","B","C")) is_valid_toy(c("X","A","Z","B"))
takes a data frame (the crosswalk) and which columns are the codes and titles and create an xwalk object that can perform crosswalks...
xwalk( dta, codes1, titles1, codes2, titles2, col_types = ifelse(grepl("\\.xlsx?$", dta), "text", "c"), ... )xwalk( dta, codes1, titles1, codes2, titles2, col_types = ifelse(grepl("\\.xlsx?$", dta), "text", "c"), ... )
dta |
the data frame of the crosswalk, or the filename/URL of a csv crosswalk file or the filename of an excel file. |
codes1 |
Codes for the (Default) input coding system for crosswalking |
titles1 |
Titles for the (Default) input coding system. |
codes2 |
Codes for the (Default) output coding system for crosswalking |
titles2 |
Titles for the (Default) output coding system. |
col_types |
set the default col_type parameter for read_csv/read_excel |
... |
additional parameters passed to read_csv |
The more potential codes that a crosswalk will allow an intial code to become, the higher the entropy. The entropy (S) is given by
where p = 1/n and n is the number of potential codes. a single code can map to, natural logs are used in the calculation. The inner summation can be is the sum of n iteration of 1/n, so the equation can be simplified to
xwalk_entropy(x)xwalk_entropy(x)
x |
The crosswalk |
the entropy