Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicate non-numeric columns #1356

Open
EmilHvitfeldt opened this issue Aug 7, 2024 · 1 comment
Open

Remove duplicate non-numeric columns #1356

EmilHvitfeldt opened this issue Aug 7, 2024 · 1 comment
Labels
feature a feature request or enhancement new steps

Comments

@EmilHvitfeldt
Copy link
Member

In general we don't want perfectly identical features in our model, because 1) it doesn't provide any value, and 2) because it will break some stuff because it creates linear combinations of columns.

Right now we only have step_lincomb() for numeric data, and step_corr() for when it is close. We don't have anything for non-numeric.

Below the levels are identical, but even if they are not, it would still be an issue.

library(recipes)

data(ames, package = "modeldata")

ames <- ames[c(2, 3)]
ames$MS_Zoning_copy <- ames$MS_Zoning
ames$Lot_Frontage_copy <- ames$Lot_Frontage

ames
#> # A tibble: 2,930 × 4
#>    MS_Zoning                Lot_Frontage MS_Zoning_copy        Lot_Frontage_copy
#>    <fct>                           <dbl> <fct>                             <dbl>
#>  1 Residential_Low_Density           141 Residential_Low_Dens…               141
#>  2 Residential_High_Density           80 Residential_High_Den…                80
#>  3 Residential_Low_Density            81 Residential_Low_Dens…                81
#>  4 Residential_Low_Density            93 Residential_Low_Dens…                93
#>  5 Residential_Low_Density            74 Residential_Low_Dens…                74
#>  6 Residential_Low_Density            78 Residential_Low_Dens…                78
#>  7 Residential_Low_Density            41 Residential_Low_Dens…                41
#>  8 Residential_Low_Density            43 Residential_Low_Dens…                43
#>  9 Residential_Low_Density            39 Residential_Low_Dens…                39
#> 10 Residential_Low_Density            60 Residential_Low_Dens…                60
#> # ℹ 2,920 more rows

recipe(~ ., data = ames) |>
  step_corr(all_numeric_predictors()) |>
  prep() |>
  bake(NULL)
#> # A tibble: 2,930 × 3
#>    MS_Zoning                MS_Zoning_copy           Lot_Frontage_copy
#>    <fct>                    <fct>                                <dbl>
#>  1 Residential_Low_Density  Residential_Low_Density                141
#>  2 Residential_High_Density Residential_High_Density                80
#>  3 Residential_Low_Density  Residential_Low_Density                 81
#>  4 Residential_Low_Density  Residential_Low_Density                 93
#>  5 Residential_Low_Density  Residential_Low_Density                 74
#>  6 Residential_Low_Density  Residential_Low_Density                 78
#>  7 Residential_Low_Density  Residential_Low_Density                 41
#>  8 Residential_Low_Density  Residential_Low_Density                 43
#>  9 Residential_Low_Density  Residential_Low_Density                 39
#> 10 Residential_Low_Density  Residential_Low_Density                 60
#> # ℹ 2,920 more rows

Created on 2024-08-07 with reprex v2.1.0

@EmilHvitfeldt EmilHvitfeldt added new steps feature a feature request or enhancement labels Aug 7, 2024
@nhward
Copy link

nhward commented Nov 28, 2024

A variable may also appear twice, once as a number and again as a character. For example, see FIPS and FIPSNO below. Bizaar, but it happens.

> library(sf)
> glimpse(st_read(system.file("shape/nc.shp", package="sf")))
Reading layer `nc' from data source `/usr/lib/R/site-library/sf/shape/nc.shp' using driver `ESRI Shapefile'
Simple feature collection with 100 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS:  NAD27
Rows: 100
Columns: 15
$ AREA      <dbl> 0.114, 0.061, 0.143, 0.070, 0.153, 0.097, 0.062, 0.091, 0.118, 0.124, 0.114, 0.153, 0.143, 0.109, 0.072, 0.190, 0.053,…
$ PERIMETER <dbl> 1.442, 1.231, 1.630, 2.968, 2.206, 1.670, 1.547, 1.284, 1.421, 1.428, 1.352, 1.616, 1.663, 1.325, 1.085, 2.204, 1.171,…
$ CNTY_     <dbl> 1825, 1827, 1828, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1846, 1848, 1874, 1880, 1881…
$ CNTY_ID   <dbl> 1825, 1827, 1828, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1846, 1848, 1874, 1880, 1881…
$ NAME      <chr> "Ashe", "Alleghany", "Surry", "Currituck", "Northampton", "Hertford", "Camden", "Gates", "Warren", "Stokes", "Caswell"…
$ FIPS      <chr> "37009", "37005", "37171", "37053", "37131", "37091", "37029", "37073", "37185", "37169", "37033", "37157", "37077", "…
$ FIPSNO    <dbl> 37009, 37005, 37171, 37053, 37131, 37091, 37029, 37073, 37185, 37169, 37033, 37157, 37077, 37145, 37181, 37083, 37139,…
$ CRESS_ID  <int> 5, 3, 86, 27, 66, 46, 15, 37, 93, 85, 17, 79, 39, 73, 91, 42, 70, 97, 95, 72, 21, 6, 99, 35, 34, 41, 1, 8, 68, 32, 64,…
$ BIR74     <dbl> 1091, 487, 3188, 508, 1421, 1452, 286, 420, 968, 1612, 1035, 4449, 1671, 1556, 2180, 3608, 1638, 3146, 1323, 484, 751,…
$ SID74     <dbl> 1, 0, 5, 1, 9, 7, 0, 0, 4, 1, 2, 16, 4, 4, 4, 18, 3, 4, 1, 1, 1, 0, 1, 2, 10, 23, 13, 6, 4, 16, 8, 0, 10, 6, 0, 2, 16,…
$ NWBIR74   <dbl> 10, 10, 208, 123, 1066, 954, 115, 254, 748, 160, 550, 1243, 930, 613, 1179, 2365, 622, 200, 17, 230, 368, 4, 65, 736, …
$ BIR79     <dbl> 1364, 542, 3616, 830, 1606, 1838, 350, 594, 1190, 2038, 1253, 5386, 2074, 1790, 2753, 4463, 2275, 3725, 1775, 676, 899…
$ SID79     <dbl> 0, 3, 6, 2, 3, 5, 2, 2, 2, 5, 2, 5, 4, 4, 6, 17, 4, 7, 1, 0, 1, 0, 1, 0, 18, 38, 11, 5, 6, 22, 7, 2, 9, 9, 1, 1, 31, 2…
$ NWBIR79   <dbl> 19, 12, 260, 145, 1197, 1237, 139, 371, 844, 176, 597, 1369, 1058, 650, 1492, 2980, 933, 222, 33, 310, 491, 5, 76, 950…
$ geometry  <MULTIPOLYGON [°]> MULTIPOLYGON (((-81.47276 3..., MULTIPOLYGON (((-81.23989 3..., MULTIPOLYGON (((-80.45634 3..., MULTIPOLY…
> 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement new steps
Projects
None yet
Development

No branches or pull requests

2 participants