Monday, November 12, 2018

Understanding Relational Division in SQL



Here is the script to create the JobSkills database, with data, used in this example.

-- Database by Doug
-- Douglas Kline
-- 11/10/2018
-- Understanding Relational Division - a (relatively) simple example

USE JobSkills

-- show the diagram of the data model
-- explain

-- here are our applicants (note - 4 of them)

SELECT   *
FROM     Applicant

-- here are our skills (note - 4 of them)

SELECT   *
FROM     Skill

-- now what if every applicant had every skill
-- that would mean that there would be 16 (4 x 4)
-- records in the ApplicantSkill table
-- it would look something like this:

SELECT         *
FROM           Applicant
   CROSS JOIN  Skill
ORDER BY       Applicant.applicantID,
               Skill.skillID

-- if you haven't seen this before,
-- this is a cross join 
-- a cross join of two tables creates all possible combinations
-- of the records in the two tables
-- with no condition - no requirement on lining up FKs and PKs

-- this is known as the *cartesian product* of all records in both tables
-- and you might recall that *product* is another way of saying 
-- 'the result of multiplication'

-- in other words, CROSS JOIN is kind of *relational multiplication*
-- sort of (Applicant records) x (Skill records)

-- think of this in math terms a * b = c
-- where a is Applicants
-- b is Skills
-- and c is the result

-- now if you remember your algebra,
-- if a * b = c
-- that means that c / b = a

-- in our context 4 records times 4 records equals 16
-- so 16 divided by 4 should equal 4

-- let's try to do that!
-- I'm going to use a common table expression (CTE),
-- so you can tell there's nothing up my sleeve!
-- note that this is exactly the CROSS JOIN from above

; WITH [c] (applicantID, lastName, skillID, skillName)
AS
(
   SELECT         *
   FROM           Applicant
      CROSS JOIN  Skill
)
SELECT   applicantID,
         lastName
FROM     [c]
GROUP BY applicantID,
         lastName  
HAVING   COUNT(skillID) = (SELECT COUNT(*) FROM Skill)

-- this could also be rephrased as a subquery like this:

 SELECT   applicantID,
         lastName
FROM     (
            SELECT         *
            FROM           Applicant
               CROSS JOIN  Skill
          ) [c]
GROUP BY applicantID,
         lastName  
HAVING   COUNT(skillID) = (SELECT COUNT(*) FROM Skill)

-- the logic is that I'm showing all applicants
-- that have the same number of skills
-- as are in the skill table

-- so if CROSS JOIN is the "relational multiplication" operator
-- what is the "relational division" operator?

-- unfortunately, there is no relational division operator
-- that's why we are using HAVING and COUNT to get the logical equivalent

-- summary so far
-- 3 applicants CROSS JOIN 4 skills = 12 results, e.g., everyone has every skill
-- then 12 results divided by 4 skills = 3 applicants, the original applicants

-- but this is maybe an unrealistically simple example
-- not all applicants will have all skills

-- what we'll have in reality is a job that requires a certain set of skills
-- then we'll have applicants that each have a certain set of skills
-- and what we want is a set of applicants that whose skills *match* the requirements

-- here is the data model

-- let's start simply, with a single job that requires two skills
-- and a set of applicants that each have a set of skills

-- I've cooked some data for that...
-- examining jobskill...

SELECT   Job.jobID,
         Job.jobName,
         Skill.SkillID,
         Skill.SkillName
FROM     Job
   JOIN  JobSkill ON Job.jobID = JobSkill.jobID
   JOIN  Skill    ON JobSkill.skillID = Skill.skillID
WHERE    Job.JobID = 10
ORDER BY Job.jobID, 
         Skill.SkillID

-- this says that the software developer job 
-- requires the sql and javascript skills

-- now let's cook up some applicants with some skills

SELECT   Applicant.applicantID,
         Applicant.lastName,
         Skill.skillID,
         Skill.skillName
FROM     Applicant
   JOIN  ApplicantSkill ON Applicant.applicantID = ApplicantSkill.applicantID
   JOIN  Skill          ON ApplicantSkill.skillID = Skill.skillid
ORDER BY Applicant.applicantID,
         Skill.skillID

-- so it looks like viable candidates are Jones and Brown
-- Jones has exactly the right skills and no more
-- Brown has the right skills, plus one

-- so to find a candidate that satisfies our job...
-- a * b = c
-- candidateSkills = jobskills
-- candidates * skills = jobskills
-- candidates = jobskills / skills

-- this analogy is not perfect, in that * is not really a cross join
-- and candidateskills and jobskills are actual records, 
-- for which we are trying to find alignment via joins
--
-- we'll solve this with the COUNT method we used before

SELECT   Applicant.applicantID,
         Applicant.lastName,
         COUNT(jobSkill.SkillID) AS [MatchingSkillCount]
FROM     ApplicantSkill
   JOIN  JobSkill ON ApplicantSkill.skillID = JobSkill.SkillID
   JOIN  Applicant ON Applicant.ApplicantID = ApplicantSkill.ApplicantID
WHERE    JobSkill.JobID = 10
GROUP BY Applicant.applicantID,
         Applicant.lastName
HAVING   COUNT(jobSkill.SkillID) = (SELECT COUNT(*) FROM JobSkill WHERE JobSkill.jobID = 10)
ORDER BY Applicant.applicantID

-- note that the outer query limits the applicant skill count to only
-- skills that match exactly the skills for this job
-- and the subquery does the same

-- let's look at this in a more general sense:

SELECT   Job.jobID,
         Job.jobName,
         Skill.SkillID,
         Skill.SkillName
FROM     Job
   JOIN  JobSkill ON Job.jobID = JobSkill.jobID
   JOIN  Skill    ON JobSkill.skillID = Skill.skillID
ORDER BY Job.jobID, 
         Skill.SkillID

-- two jobs, each of which require two skills
-- disjoint skill set

-- same set of applicants
SELECT   Applicant.applicantID,
         Applicant.lastName,
         Skill.skillID,
         Skill.skillName
FROM     Applicant
   JOIN  ApplicantSkill ON Applicant.applicantID = ApplicantSkill.applicantID
   JOIN  Skill          ON ApplicantSkill.skillID = Skill.skillid
ORDER BY Applicant.applicantID,
         Skill.skillID

-- and the number of matching applicant skills for each job

SELECT   Applicant.applicantID,
         Applicant.lastName,
         Job.JobID,
         Job.jobName,
         COUNT(ApplicantSkill.SkillID) AS [MatchingJobSkillCount]
FROM     ApplicantSkill
   JOIN  JobSkill ON ApplicantSkill.skillID = JobSkill.SkillID -- this is *key*
   JOIN  Job      ON Job.jobID = JobSkill.JobID
   JOIN  Applicant ON Applicant.ApplicantID = ApplicantSkill.ApplicantID
GROUP BY Applicant.applicantID,
         Applicant.lastName,
         Job.JobID,
         Job.jobName
ORDER BY Applicant.applicantID,
         Job.JobID

-- now we'd like to limit this to the applicants with jobskill count
-- equal to the jobskill count of each job
-- the trick here is that the the HAVING subquery must be correlated

SELECT   Applicant.applicantID,
         Applicant.lastName,
         Job.JobID,
         Job.jobName,
         COUNT(ApplicantSkill.SkillID) AS [MatchingJobSkillCount]
FROM     ApplicantSkill
   JOIN  JobSkill ON ApplicantSkill.skillID = JobSkill.SkillID -- this is *key*
   JOIN  Job      ON Job.jobID = JobSkill.JobID
   JOIN  Applicant ON Applicant.ApplicantID = ApplicantSkill.ApplicantID
GROUP BY Applicant.applicantID,
         Applicant.lastName,
         Job.JobID,
         Job.jobName
HAVING   COUNT(ApplicantSkill.SkillID) = (SELECT COUNT(js.skillID)
                                          FROM   JobSkill js
                                          WHERE  js.jobID = job.jobID)
ORDER BY Applicant.applicantID,
         Job.JobID

-- perhaps this is clearer as a CTE:
; WITH [JobWithSkillCount] (jobID, skillCount)
AS
(
   SELECT   jobID,
            COUNT(skillID)
   FROM     JobSkill
   GROUP BY jobID 
)
SELECT   Applicant.applicantID,
         Applicant.lastName,
         Job.JobID,
         Job.jobName,
         COUNT(ApplicantSkill.SkillID) AS [MatchingJobSkillCount]
FROM     ApplicantSkill
   JOIN  JobSkill ON ApplicantSkill.skillID = JobSkill.SkillID -- this is *key*
   JOIN  Job      ON Job.jobID = JobSkill.JobID
   JOIN  Applicant ON Applicant.ApplicantID = ApplicantSkill.ApplicantID
   JOIN  [JobWithSkillCount] ON [JobWithSkillCount].jobID = job.jobID
GROUP BY Applicant.applicantID,
         Applicant.lastName,
         Job.JobID,
         Job.jobName,
         JobWithSkillCount.skillCount
HAVING   COUNT(ApplicantSkill.SkillID) = JobWithSkillCount.skillCount
ORDER BY Applicant.applicantID,
         Job.JobID

-- hopefully, you find this a helpful example for understanding relational division

-- A couple last comments

-- I've presented this from the employer point of view:
   -- applicants who match job requirements
-- but it is exactly equivalent form an applicant point of view:
-- jobs that match skillsets

-- I'm not showing the "remainder" of the division
-- for example, Brown actually has *extra* skills!
-- this example shows applicants with the exact matching skillsets
-- and some may have extra skills

-- this is not necessarily the *most efficient* way to
-- do relational division
-- my goal was to create an example for understanding relational division

-- an performant solution would of course depend on many things:
-- the data volume, its statistical distribution, available indexes,
-- the schema, etc.

-- I hope that you found this helpful!

-- Database by Doug
-- Douglas Kline
-- 11/10/2018
-- Understanding Relational Division - a (relatively) simple example



No comments:

Post a Comment

Followers