-
Notifications
You must be signed in to change notification settings - Fork 28
10. The select:where: queries
SELECT is the most commonly used SQL statement that allows you to subset your data by applying filters to it using WHERE clause. The query language of DataFrame is designed to resemble SQL, so if you have some experience with relational databases, you should "feel like home".
The examples in this section will be using Iris dataset
df := DataFrame loadIris.
There are two things you need to specify in order to subset your data with select:where:
message:
- What features (columns) do you want to get
- What conditions should the observations (rows) satisfy in order to be selected
First argument of the select:where:
message should be an array of column names. They will not affect the selection of rows, but the resulting data frame will contain only these columns. Second argument should be a block with boolean conditions that will be applied to each row of data frame. Only those rows that make a block return true
will be selected. In your conditions you will be referencing the features of your observations. For example, in Iris dataset you might want to select those flowers that belong to #setosa
species and have the width of sepal equal to 3
. To make queries more readable, DataFrame provides a querying language that allows you to specify the columns which you are using in your conditions as arguments of the where-block, and use these arguments in your conditions. So, for example, a block [ :species | species = #setosa ]
passed to select:where:
message will be translated to [ :row | (row atKey: #species) = #setosa ]
and applied to every row of data frame. This means that all the arguments of the block you pass must correspond to the column names of your data frame.
Here is a query that selects petal_width
and petal_length
columns, and all the rows that satisfy the condition described in the previous paragraph
df select: #(petal_width petal_length)
where: [ :species :sepal_width |
species = #setosa and: sepal_width = 3 ].
If you rather want to select all the columns of a data frame, use the selectAllWhere:
message. It works in a same way as SELECT * WHERE
in SQL
df selectAllWhere: [ :species :sepal_width |
species = #setosa and: sepal_width = 3 ].
This query will return a data frame will all 5 columns of Iris dataset and 6 rows that satisfy the given condition.
| sepal_length sepal_width petal_length petal_width species
----+---------------------------------------------------------------
2 | 4.9 3 1.4 0.2 setosa
13 | 4.8 3 1.4 0.1 setosa
14 | 4.3 3 1.1 0.1 setosa
26 | 5 3 1.6 0.2 setosa
39 | 4.4 3 1.3 0.2 setosa
46 | 4.8 3 1.4 0.3 setosa
The previous query will return you only the petal_width
and petal_length
columns of this data frame. Try it yourself!