Here is the script to create the JobSkills database, with data, used in this example.
-- Database by Doug -- Douglas Kline -- 11/10/2018 -- Understanding Relational Division - a (relatively) simple example USE JobSkills -- show the diagram of the data model -- explain -- here are our applicants (note - 4 of them) SELECT * FROM Applicant -- here are our skills (note - 4 of them) SELECT * FROM Skill -- now what if every applicant had every skill -- that would mean that there would be 16 (4 x 4) -- records in the ApplicantSkill table -- it would look something like this: SELECT * FROM Applicant CROSS JOIN Skill ORDER BY Applicant.applicantID, Skill.skillID -- if you haven't seen this before, -- this is a cross join -- a cross join of two tables creates all possible combinations -- of the records in the two tables -- with no condition - no requirement on lining up FKs and PKs -- this is known as the *cartesian product* of all records in both tables -- and you might recall that *product* is another way of saying -- 'the result of multiplication' -- in other words, CROSS JOIN is kind of *relational multiplication* -- sort of (Applicant records) x (Skill records) -- think of this in math terms a * b = c -- where a is Applicants -- b is Skills -- and c is the result -- now if you remember your algebra, -- if a * b = c -- that means that c / b = a -- in our context 4 records times 4 records equals 16 -- so 16 divided by 4 should equal 4 -- let's try to do that! -- I'm going to use a common table expression (CTE), -- so you can tell there's nothing up my sleeve! -- note that this is exactly the CROSS JOIN from above ; WITH [c] (applicantID, lastName, skillID, skillName) AS ( SELECT * FROM Applicant CROSS JOIN Skill ) SELECT applicantID, lastName FROM [c] GROUP BY applicantID, lastName HAVING COUNT(skillID) = (SELECT COUNT(*) FROM Skill) -- this could also be rephrased as a subquery like this: SELECT applicantID, lastName FROM ( SELECT * FROM Applicant CROSS JOIN Skill ) [c] GROUP BY applicantID, lastName HAVING COUNT(skillID) = (SELECT COUNT(*) FROM Skill) -- the logic is that I'm showing all applicants -- that have the same number of skills -- as are in the skill table -- so if CROSS JOIN is the "relational multiplication" operator -- what is the "relational division" operator? -- unfortunately, there is no relational division operator -- that's why we are using HAVING and COUNT to get the logical equivalent -- summary so far -- 3 applicants CROSS JOIN 4 skills = 12 results, e.g., everyone has every skill -- then 12 results divided by 4 skills = 3 applicants, the original applicants -- but this is maybe an unrealistically simple example -- not all applicants will have all skills -- what we'll have in reality is a job that requires a certain set of skills -- then we'll have applicants that each have a certain set of skills -- and what we want is a set of applicants that whose skills *match* the requirements -- here is the data model -- let's start simply, with a single job that requires two skills -- and a set of applicants that each have a set of skills -- I've cooked some data for that... -- examining jobskill... SELECT Job.jobID, Job.jobName, Skill.SkillID, Skill.SkillName FROM Job JOIN JobSkill ON Job.jobID = JobSkill.jobID JOIN Skill ON JobSkill.skillID = Skill.skillID WHERE Job.JobID = 10 ORDER BY Job.jobID, Skill.SkillID -- this says that the software developer job -- requires the sql and javascript skills -- now let's cook up some applicants with some skills SELECT Applicant.applicantID, Applicant.lastName, Skill.skillID, Skill.skillName FROM Applicant JOIN ApplicantSkill ON Applicant.applicantID = ApplicantSkill.applicantID JOIN Skill ON ApplicantSkill.skillID = Skill.skillid ORDER BY Applicant.applicantID, Skill.skillID -- so it looks like viable candidates are Jones and Brown -- Jones has exactly the right skills and no more -- Brown has the right skills, plus one -- so to find a candidate that satisfies our job... -- a * b = c -- candidateSkills = jobskills -- candidates * skills = jobskills -- candidates = jobskills / skills -- this analogy is not perfect, in that * is not really a cross join -- and candidateskills and jobskills are actual records, -- for which we are trying to find alignment via joins -- -- we'll solve this with the COUNT method we used before SELECT Applicant.applicantID, Applicant.lastName, COUNT(jobSkill.SkillID) AS [MatchingSkillCount] FROM ApplicantSkill JOIN JobSkill ON ApplicantSkill.skillID = JobSkill.SkillID JOIN Applicant ON Applicant.ApplicantID = ApplicantSkill.ApplicantID WHERE JobSkill.JobID = 10 GROUP BY Applicant.applicantID, Applicant.lastName HAVING COUNT(jobSkill.SkillID) = (SELECT COUNT(*) FROM JobSkill WHERE JobSkill.jobID = 10) ORDER BY Applicant.applicantID -- note that the outer query limits the applicant skill count to only -- skills that match exactly the skills for this job -- and the subquery does the same -- let's look at this in a more general sense: SELECT Job.jobID, Job.jobName, Skill.SkillID, Skill.SkillName FROM Job JOIN JobSkill ON Job.jobID = JobSkill.jobID JOIN Skill ON JobSkill.skillID = Skill.skillID ORDER BY Job.jobID, Skill.SkillID -- two jobs, each of which require two skills -- disjoint skill set -- same set of applicants SELECT Applicant.applicantID, Applicant.lastName, Skill.skillID, Skill.skillName FROM Applicant JOIN ApplicantSkill ON Applicant.applicantID = ApplicantSkill.applicantID JOIN Skill ON ApplicantSkill.skillID = Skill.skillid ORDER BY Applicant.applicantID, Skill.skillID -- and the number of matching applicant skills for each job SELECT Applicant.applicantID, Applicant.lastName, Job.JobID, Job.jobName, COUNT(ApplicantSkill.SkillID) AS [MatchingJobSkillCount] FROM ApplicantSkill JOIN JobSkill ON ApplicantSkill.skillID = JobSkill.SkillID -- this is *key* JOIN Job ON Job.jobID = JobSkill.JobID JOIN Applicant ON Applicant.ApplicantID = ApplicantSkill.ApplicantID GROUP BY Applicant.applicantID, Applicant.lastName, Job.JobID, Job.jobName ORDER BY Applicant.applicantID, Job.JobID -- now we'd like to limit this to the applicants with jobskill count -- equal to the jobskill count of each job -- the trick here is that the the HAVING subquery must be correlated SELECT Applicant.applicantID, Applicant.lastName, Job.JobID, Job.jobName, COUNT(ApplicantSkill.SkillID) AS [MatchingJobSkillCount] FROM ApplicantSkill JOIN JobSkill ON ApplicantSkill.skillID = JobSkill.SkillID -- this is *key* JOIN Job ON Job.jobID = JobSkill.JobID JOIN Applicant ON Applicant.ApplicantID = ApplicantSkill.ApplicantID GROUP BY Applicant.applicantID, Applicant.lastName, Job.JobID, Job.jobName HAVING COUNT(ApplicantSkill.SkillID) = (SELECT COUNT(js.skillID) FROM JobSkill js WHERE js.jobID = job.jobID) ORDER BY Applicant.applicantID, Job.JobID -- perhaps this is clearer as a CTE: ; WITH [JobWithSkillCount] (jobID, skillCount) AS ( SELECT jobID, COUNT(skillID) FROM JobSkill GROUP BY jobID ) SELECT Applicant.applicantID, Applicant.lastName, Job.JobID, Job.jobName, COUNT(ApplicantSkill.SkillID) AS [MatchingJobSkillCount] FROM ApplicantSkill JOIN JobSkill ON ApplicantSkill.skillID = JobSkill.SkillID -- this is *key* JOIN Job ON Job.jobID = JobSkill.JobID JOIN Applicant ON Applicant.ApplicantID = ApplicantSkill.ApplicantID JOIN [JobWithSkillCount] ON [JobWithSkillCount].jobID = job.jobID GROUP BY Applicant.applicantID, Applicant.lastName, Job.JobID, Job.jobName, JobWithSkillCount.skillCount HAVING COUNT(ApplicantSkill.SkillID) = JobWithSkillCount.skillCount ORDER BY Applicant.applicantID, Job.JobID -- hopefully, you find this a helpful example for understanding relational division -- A couple last comments -- I've presented this from the employer point of view: -- applicants who match job requirements -- but it is exactly equivalent form an applicant point of view: -- jobs that match skillsets -- I'm not showing the "remainder" of the division -- for example, Brown actually has *extra* skills! -- this example shows applicants with the exact matching skillsets -- and some may have extra skills -- this is not necessarily the *most efficient* way to -- do relational division -- my goal was to create an example for understanding relational division -- an performant solution would of course depend on many things: -- the data volume, its statistical distribution, available indexes, -- the schema, etc. -- I hope that you found this helpful! -- Database by Doug -- Douglas Kline -- 11/10/2018 -- Understanding Relational Division - a (relatively) simple example
No comments:
Post a Comment