Here is the script to create the JobSkills database, with data, used in this example.
-- Database by Doug
-- Douglas Kline
-- 11/10/2018
-- Understanding Relational Division - a (relatively) simple example
USE JobSkills
-- show the diagram of the data model
-- explain
-- here are our applicants (note - 4 of them)
SELECT *
FROM Applicant
-- here are our skills (note - 4 of them)
SELECT *
FROM Skill
-- now what if every applicant had every skill
-- that would mean that there would be 16 (4 x 4)
-- records in the ApplicantSkill table
-- it would look something like this:
SELECT *
FROM Applicant
CROSS JOIN Skill
ORDER BY Applicant.applicantID,
Skill.skillID
-- if you haven't seen this before,
-- this is a cross join
-- a cross join of two tables creates all possible combinations
-- of the records in the two tables
-- with no condition - no requirement on lining up FKs and PKs
-- this is known as the *cartesian product* of all records in both tables
-- and you might recall that *product* is another way of saying
-- 'the result of multiplication'
-- in other words, CROSS JOIN is kind of *relational multiplication*
-- sort of (Applicant records) x (Skill records)
-- think of this in math terms a * b = c
-- where a is Applicants
-- b is Skills
-- and c is the result
-- now if you remember your algebra,
-- if a * b = c
-- that means that c / b = a
-- in our context 4 records times 4 records equals 16
-- so 16 divided by 4 should equal 4
-- let's try to do that!
-- I'm going to use a common table expression (CTE),
-- so you can tell there's nothing up my sleeve!
-- note that this is exactly the CROSS JOIN from above
; WITH [c] (applicantID, lastName, skillID, skillName)
AS
(
SELECT *
FROM Applicant
CROSS JOIN Skill
)
SELECT applicantID,
lastName
FROM [c]
GROUP BY applicantID,
lastName
HAVING COUNT(skillID) = (SELECT COUNT(*) FROM Skill)
-- this could also be rephrased as a subquery like this:
SELECT applicantID,
lastName
FROM (
SELECT *
FROM Applicant
CROSS JOIN Skill
) [c]
GROUP BY applicantID,
lastName
HAVING COUNT(skillID) = (SELECT COUNT(*) FROM Skill)
-- the logic is that I'm showing all applicants
-- that have the same number of skills
-- as are in the skill table
-- so if CROSS JOIN is the "relational multiplication" operator
-- what is the "relational division" operator?
-- unfortunately, there is no relational division operator
-- that's why we are using HAVING and COUNT to get the logical equivalent
-- summary so far
-- 3 applicants CROSS JOIN 4 skills = 12 results, e.g., everyone has every skill
-- then 12 results divided by 4 skills = 3 applicants, the original applicants
-- but this is maybe an unrealistically simple example
-- not all applicants will have all skills
-- what we'll have in reality is a job that requires a certain set of skills
-- then we'll have applicants that each have a certain set of skills
-- and what we want is a set of applicants that whose skills *match* the requirements
-- here is the data model
-- let's start simply, with a single job that requires two skills
-- and a set of applicants that each have a set of skills
-- I've cooked some data for that...
-- examining jobskill...
SELECT Job.jobID,
Job.jobName,
Skill.SkillID,
Skill.SkillName
FROM Job
JOIN JobSkill ON Job.jobID = JobSkill.jobID
JOIN Skill ON JobSkill.skillID = Skill.skillID
WHERE Job.JobID = 10
ORDER BY Job.jobID,
Skill.SkillID
-- this says that the software developer job
-- requires the sql and javascript skills
-- now let's cook up some applicants with some skills
SELECT Applicant.applicantID,
Applicant.lastName,
Skill.skillID,
Skill.skillName
FROM Applicant
JOIN ApplicantSkill ON Applicant.applicantID = ApplicantSkill.applicantID
JOIN Skill ON ApplicantSkill.skillID = Skill.skillid
ORDER BY Applicant.applicantID,
Skill.skillID
-- so it looks like viable candidates are Jones and Brown
-- Jones has exactly the right skills and no more
-- Brown has the right skills, plus one
-- so to find a candidate that satisfies our job...
-- a * b = c
-- candidateSkills = jobskills
-- candidates * skills = jobskills
-- candidates = jobskills / skills
-- this analogy is not perfect, in that * is not really a cross join
-- and candidateskills and jobskills are actual records,
-- for which we are trying to find alignment via joins
--
-- we'll solve this with the COUNT method we used before
SELECT Applicant.applicantID,
Applicant.lastName,
COUNT(jobSkill.SkillID) AS [MatchingSkillCount]
FROM ApplicantSkill
JOIN JobSkill ON ApplicantSkill.skillID = JobSkill.SkillID
JOIN Applicant ON Applicant.ApplicantID = ApplicantSkill.ApplicantID
WHERE JobSkill.JobID = 10
GROUP BY Applicant.applicantID,
Applicant.lastName
HAVING COUNT(jobSkill.SkillID) = (SELECT COUNT(*) FROM JobSkill WHERE JobSkill.jobID = 10)
ORDER BY Applicant.applicantID
-- note that the outer query limits the applicant skill count to only
-- skills that match exactly the skills for this job
-- and the subquery does the same
-- let's look at this in a more general sense:
SELECT Job.jobID,
Job.jobName,
Skill.SkillID,
Skill.SkillName
FROM Job
JOIN JobSkill ON Job.jobID = JobSkill.jobID
JOIN Skill ON JobSkill.skillID = Skill.skillID
ORDER BY Job.jobID,
Skill.SkillID
-- two jobs, each of which require two skills
-- disjoint skill set
-- same set of applicants
SELECT Applicant.applicantID,
Applicant.lastName,
Skill.skillID,
Skill.skillName
FROM Applicant
JOIN ApplicantSkill ON Applicant.applicantID = ApplicantSkill.applicantID
JOIN Skill ON ApplicantSkill.skillID = Skill.skillid
ORDER BY Applicant.applicantID,
Skill.skillID
-- and the number of matching applicant skills for each job
SELECT Applicant.applicantID,
Applicant.lastName,
Job.JobID,
Job.jobName,
COUNT(ApplicantSkill.SkillID) AS [MatchingJobSkillCount]
FROM ApplicantSkill
JOIN JobSkill ON ApplicantSkill.skillID = JobSkill.SkillID -- this is *key*
JOIN Job ON Job.jobID = JobSkill.JobID
JOIN Applicant ON Applicant.ApplicantID = ApplicantSkill.ApplicantID
GROUP BY Applicant.applicantID,
Applicant.lastName,
Job.JobID,
Job.jobName
ORDER BY Applicant.applicantID,
Job.JobID
-- now we'd like to limit this to the applicants with jobskill count
-- equal to the jobskill count of each job
-- the trick here is that the the HAVING subquery must be correlated
SELECT Applicant.applicantID,
Applicant.lastName,
Job.JobID,
Job.jobName,
COUNT(ApplicantSkill.SkillID) AS [MatchingJobSkillCount]
FROM ApplicantSkill
JOIN JobSkill ON ApplicantSkill.skillID = JobSkill.SkillID -- this is *key*
JOIN Job ON Job.jobID = JobSkill.JobID
JOIN Applicant ON Applicant.ApplicantID = ApplicantSkill.ApplicantID
GROUP BY Applicant.applicantID,
Applicant.lastName,
Job.JobID,
Job.jobName
HAVING COUNT(ApplicantSkill.SkillID) = (SELECT COUNT(js.skillID)
FROM JobSkill js
WHERE js.jobID = job.jobID)
ORDER BY Applicant.applicantID,
Job.JobID
-- perhaps this is clearer as a CTE:
; WITH [JobWithSkillCount] (jobID, skillCount)
AS
(
SELECT jobID,
COUNT(skillID)
FROM JobSkill
GROUP BY jobID
)
SELECT Applicant.applicantID,
Applicant.lastName,
Job.JobID,
Job.jobName,
COUNT(ApplicantSkill.SkillID) AS [MatchingJobSkillCount]
FROM ApplicantSkill
JOIN JobSkill ON ApplicantSkill.skillID = JobSkill.SkillID -- this is *key*
JOIN Job ON Job.jobID = JobSkill.JobID
JOIN Applicant ON Applicant.ApplicantID = ApplicantSkill.ApplicantID
JOIN [JobWithSkillCount] ON [JobWithSkillCount].jobID = job.jobID
GROUP BY Applicant.applicantID,
Applicant.lastName,
Job.JobID,
Job.jobName,
JobWithSkillCount.skillCount
HAVING COUNT(ApplicantSkill.SkillID) = JobWithSkillCount.skillCount
ORDER BY Applicant.applicantID,
Job.JobID
-- hopefully, you find this a helpful example for understanding relational division
-- A couple last comments
-- I've presented this from the employer point of view:
-- applicants who match job requirements
-- but it is exactly equivalent form an applicant point of view:
-- jobs that match skillsets
-- I'm not showing the "remainder" of the division
-- for example, Brown actually has *extra* skills!
-- this example shows applicants with the exact matching skillsets
-- and some may have extra skills
-- this is not necessarily the *most efficient* way to
-- do relational division
-- my goal was to create an example for understanding relational division
-- an performant solution would of course depend on many things:
-- the data volume, its statistical distribution, available indexes,
-- the schema, etc.
-- I hope that you found this helpful!
-- Database by Doug
-- Douglas Kline
-- 11/10/2018
-- Understanding Relational Division - a (relatively) simple example
No comments:
Post a Comment