DISTINCT
function helps the user to identify unique values within a data field. DISTINCT
can also be used to identify unique combinations of multiple data fields. WHERE
is used to identify the beginning of filtering conditions being applied to the dataset. Within one table query section, there should only be one WHERE
and that should be located after the “From” statement after the name of the table. WHERE
should be followed immediately by the first filtering expression. AND
, OR
, NOT
are all supplementary expressions to help combine and refine filter expressions. AND
allows users to obtain the intersecting or overlapping sections of two filtered datasets; OR
allows users to obtain the union of two filtered datasets; NOT
allows users to obtain the complement of a filtered dataset;ORDER BY
allow the user to sort the table based on one or multiple fields. The default sorting mechanism is ascending, alphabetical for string fields and from smallest to largest for numeric fields. ORDER BY
can also be used in combination with DESC
to reverse the sorting order into descending. INSERT INTO
can be used to add new records into a table. In other words, this is equivalent to make a manual entry. Note that Insert Into is not a query operation, the processed result will just be more records added to the table. INSERT INTO
, users can choose to outline the value for every field;NULL
is a special word in SQL to identify truly empty field values, please note that null is not equivalent to an empty string (""). LIMIT
is used to select a requested number of sample records. Sometimes the table is very large and a simple select * statement can take longer to process. Hence during data exploration Limit can be used to get a small sample of the data. Limit is usually written at the very end of a query statement. Min
and Max
are mathematical operators in SQL that will allow the user to obtain the minimum or maximum value of a specified field or set. It can be used on numeric values as well as strings, where the strings are ranked in alphabetically ascending order. COUNT
, AVG
, and SUM
are mathematical operations that can be used to summarize field data. Note that COUNT
does not need to be applied to a numeric field, but AVG
and SUM
should. All three expressions need to be followed by brackets, “()”, and there can be further operations within the brackets. COUNT(*)
can be used to calculate the total number of records. LIKE
is usually used in string expressions to find close lookalikes when exact match strings are cumbersome to express. It is often used together with the character wildcard %
or _
. LIKE
to find strings that match a certain pattern. %
is the most popular wildcard. It can represent any or no character._
represent one single character, having multiple _
in a row will represent that exact number of missing characters.IN
can be used when specifying not one but multiple values, it can be useful in setting filters by combining several OR
expression into one single expression. It can also be combined with NOT
to form NOT IN
which will reflect the inverse of multiple exclusions. IN
always need to be followed by a bracket ()
where the selected matching values will be listed. BETWEEN
can be used to identify ranges of values, it can be applied to numeric values or strings, where strings are ranked alphabetically ascending. BETWEEN
needs to be followed by an AND
in between the specified upper and lower bound values. AS
. Or, you can simply skip the AS
and directly type the nickname after the full expression. JOIN
is equivalent to INNER JOIN
. Both expressions will work in a query. INNER JOIN
will bring two tables together on records that match the identified key field(s) only. In one sense, this means the resulting table will be intercept of the two source tables, unmatched records from either side of the table will be dropped. SELECT *
in a JOIN
statement will output all columns from both tables. If only selected columns are needed then the chosen columns should be specified after SELECT
. Each column name should have a prefix of the table name, which indicates the column is from which source table. LEFT JOIN
is equivalent to LEFT OUTER JOIN
, both can be used as keywords in a query. SELECT b.state_name,b.county,b.date,a.never FROM mask_use_by_county a
left join us_counties b on a.county_fips_code = b.county_fips_code;
RIGHT JOIN
is the opposite of LEFT JOIN
where the unmatched records from the second named or right table will be kept regardless but the unmatched records from the first table will be dropped. It is equivalent to RIGHT OUTER JOIN
and will both work as keywords in a query. FULL JOIN
is the no data loss join. In this case, unmatched records from both the left and right tables will be kept. That is, no records will be dropped from either side of the join. It is equivalent to FULL OUTER JOIN
. SELF JOIN
is an unconventional join statement expression since the keyword JOIN
is actually not found in the query anywhere. It is one way to obtain combinations within the same dataset. AS
to rename two tables and the outputted columns, so SQL will not show errors due to multiple columns with the same name. SELECT A.state_name as higher_state,B.state_name as lower_state,A.deaths as higher_state_deaths,B.deaths as lower_state_deaths
FROM us_states A, us_states B
where A.state_name <> B.state_name
and A.confirmed_cases > B.confirmed_cases;
JOIN
is the concept of merging multiple tables together horizontally (more added fields), UNION ALL
is the concept of appending two separate tables together vertically, by combing the total amount of records from both sides. In a union the user must be very careful with specifying the fields to bring from each table and make sure the correct fields are consolidating together. In a union expression, the keyword UNION ALL
is located in between two separate select statements. GROUP BY
is an extremely useful function in SQL to aggregate data to a higher level. It can also generate a pivot table on top of datasets. GROUP BY
statement needs to have two parts that are in sync. After the initial SELECT
, a set of columns that will be used as aggregation key needs to be listed, followed by the calculated metrics that will be calculated at the aggregated level. Filters, joins, and other operations can follow just like in any regular query. After filtering conditions, a GROUP BY
key word is needed followed by the exact same set of fields used as aggregation key. HAVING
is a very useful filtering mechanism that allow users to place filtering conditions on aggregated results after a group by query. In one sense this allows the user to do query on top of another query. This can be extremely useful in data exploration. HAVING
must be elements of the top select statement. CASE
is the SQL way of placing “If & else” logic into a query. It means generating a new field using case logic into the existing table, usually used to categorize records. Although the case statement itself can take several rows to complete, it is equivalent to one new field, therefore it usually goes immediately behind SELECT
parallel to the fields to be displayed and in front of WHERE
. COALESCE
can be used to embed a simple case logic when encountering null values, it can be very useful in calculations as null values may easily skew or null out the calculated results. COALESCE
is followed by a bracket with two parameters, the first parameter is the field where the null check will be performed, the second parameter is the value to replace NULL
with.