Missing Integers? 🙈

This is the “Missing Integer” algorithm, which finds the smallest positive integer that does not occur in a list while accounting for the list not being contiguous and having gaps between values.

At first thought, I think of a simple iteration, though if you approach this problem in the simple and intuitive way, you very quickly run into issues due to the nature of high-level programming languages like Python and everything falls apart.

https://app.codility.com/programmers/lessons/4-counting_elements/missing_integer/

Common Problems

We need to make sure that if no positive numbers actually exist, we just return 1. We don’t want to waste time by running our algorithm when we don’t need to. This would be called a filter.

After this, we want to look for the smallest number as we iterate through the list. For this we should implement both the for each and the best so far styles of algorithm solving. [3, Kay, Andrew [2018])

Algorithm Synthesis! (Attempt #1)

To implement this algorithm, the first thing that would jump to my mind is a loop starting at 1, checking if it occurs in the list and then counting upwards until we hit a number that does not exist. Simple loop, right! Easy-peasy. Not quite.

This becomes extremely performance exhaustive very quickly in Python, we get a time-complexity of O(N**2) or O(N^2), and sadly gets a performance score of just 25% within Codility. Darn.

This means that as our input list increases in size, the algorithm gets exponentially slower and slower because it has to go through the entire list over and over for each increment of i. This is not acceptable. [1, Gao, Albert [2016])

Here are the variables we’ve used and why:

Three failed performance tests as well as a failed correctness test! Oh dear, how should we fix this?

We also failed a test for the input of [1]! This is not good. This is indicative of a logical error in the design of this algorithm. Fortunately this was a simple fix and was from a mistaken less-than-or-equals competitor being used instead of a less-than sign for the filter’s predicate.

This meant that whenever [1] was input, it would return 1 which is incorrect, as 1 exists in the list. It’s not a missing integer.

Attempt #2

Let’s try a completely different approach to this. We don’t want to iterate through this list over and over, so how could we avoid that? Well, if we try to think about this algorithm almost backwards, there are a couple primary scenarios that can happen.

Sub-problems. (Three special cases!)
1) The input list is entirely negative, we should always return 1.
2) The input list has a bunch of “gaps” which make it difficult to know what to return. We typically want the first “gap”.
3) The input list is entirely contiguous and there exist no gaps. We should always return the maximum value within the list, plus one.

Something interesting we could do is try to use a series of filters instead and completely avoid needing to use iteration in the slightest for this. The 1st and 3rd problem is super easy to use filters with, but the 2nd problem is difficult. We don’t know what gaps will occur, if any!

Thankfully, we can use Python’s set objects to quickly find the difference between our input_list and a completely contiguous range of integers, a quickly generated dummy-list. [7, Semwal, Smitha. [n.d.])

Since you’re able to find the exclusive disjunction or the difference of two sets. This is incredibly fast too, because it’s all implemented in C/C++ and avoids using the slow Python interpreter as much as possible. [2, Weisstein, Eric W. [n.d.]) [8, Taylor, Courtney [2018])

Venn diagram of Exclusive or — Source – (Watchduck, Public Domain)
The exclusive disjunction of two sets within set theory.

Variables & Data-Structures:

With just three filters, we managed to get a linear big-O notation of O(N) which is significantly better than O(N^2)! 🎉
[1, Gao, Albert [2016])

Evaluation

From breaking down the problem into sub-problems, it was much easier to come up with an efficient solution for this algorithm, with three filters instead of requiring slow and inefficient Python iteration. [5, Watshon, Greg [2017])

With this method we keep things as speedy as possible by removing all iteration and keeping the intensive part of the code within a single line so that the Python interpreter is able to execute it in one go, similar to the advantage given by list comprehensions. [6, Strika, Luciano [2018])

“List comprehension is faster because it is optimized for the Python interpreter to spot a predictable pattern during looping. Besides the syntactic benefit of list comprehensions, they are often as fast or faster than equivalent use of map.”
Greg Watshon, New York University [NYU], 2017.
Direct Quotation from Source [5], Paragraph 8.

To get faster, I believe it would be necessary to switch to a lower-level language like Rust or a C-based language.

Appendix

Just for fun, I decided to benchmark 100 runs of each attempt against a list of 10,000,000 randomly generated integers. Attempt_1.5.py was cut due to word limit.) In total this benchmark took about 5 minutes for 300 runs total with 3 billion integers being processed.

My second attempt of the algorithm ended up being, on average, 33.16x faster than my first attempt and 65.75x faster than the attempt that was cut.

A well-designed algorithm, like Attempt_2.py is, can be the difference between your algorithm averaging 1.671 seconds or just 25.4 milliseconds, over 65x faster.

Sources & References

[1] Gao, Albert. “Everything you need to know about Big O notation”, Jun. 2, 2016, medium.com/@albertgao/everything-you-need-to-know-about-big-o-notation-5436d2702199
[2] Weisstein, Eric W. “Exclusive Disjunction.” California Institute of Technology [Caltech], Wolfram Research MathWorld. mathworld.wolfram.com/ExclusiveDisjunction.html
[3] Kay, Andrew. “Algorithmic ‘Plans’ ” Birmingham City University [BCU], Mar. 19, 2018, dsaa.werp.site/post/algorithmic-plans/
[4] Samual, Sam. “Find difference between two lists.” TutorialsPoint, Sep. 25, 2018, tutorialspoint.com/python-program-to-list-the-difference-between-two-lists
[5] Watshon, Greg. “Python | Optimizing Loops”, New York University [NYU], Jan. 27, 2017, nyu-cds.github.io/python-performance-tips/08-loops/
(git-blame available here: https://github.com/nyu-cds/python-performance-tips/commit/a2632e3cd1b008eebe99efd1399152dbfd029a85)
[6] Strika, Luciano. “Python’s List Comprehensions: Uses and Advantages”, Towards Data Science, Aug. 19, 2018, towardsdatascience.com/how-list-comprehensions-can-help-your-code-look-better-and-run-smoother-3cf8f87172ae
[7] Semwal, Smitha. “Python | Find numbers in a sorted list range” , GeekforGeeks, n.d. www.geeksforgeeks.org/python-find-missing-numbers-in-a-sorted-list-range/
[8] Taylor, Courtney. “What Is the Difference of Two Sets in Set Theory?” ThoughtCo, Jun. 27, 2018, thoughtco.com/difference-of-two-sets-3126580.

Images
[9] Cmglee, CC BY-SA 4.0 Creative Commons, commons.wikimedia.org/wiki/File:Comparison_computational_complexity.svg
[10] Watchduck, Public Domain, commons.wikimedia.org/wiki/File:Venn0110.svg
[…] Any images that have not been sourced are self-produced, with the exception of any emoji graphics. Twemoji is licensed under the CC-BY 4.0 license courtesy of Twitter, Inc and other contributors.

All of the code written and produced for this assignment is available on my personal GitHub account under the MIT Open-Source license. All code & commits are digitally signed with GPG and can be cryptographically verified to have been my own creation. https://github.com/Snuggle/MissingIntegers

This document is for academic & educational purposes only.

Missing Integers? 🙈

Common Problems

Algorithm Synthesis! (Attempt #1)

Attempt #2

Evaluation

Appendix

Sources & References

Word Count

Published by codeandsandwiches

Leave a comment Cancel reply

Common Problems

Algorithm Synthesis! (Attempt #1)

Attempt #2

Evaluation

Appendix

Sources & References

Word Count

Share this:

Related

Published by codeandsandwiches

Leave a comment Cancel reply