The Binary Chop!

What problem does the algorithm solve?

What is a binary search algorithm? It allows you to search for things incredibly quickly! It’s simple really, when you need to find something in a massive list or collection of data, you can use a binary search algorithm to find it.

How does it work?

The algorithm is kind of similar in concept to what we would call binary search trees, which are an incredibly popular way to organise file-systems and databases to allow for fast lookup/search of data.

All modern filesystems for operating-systems use an adaption of a binary-tree style of organising data, usually called a B-tree. This includes Microsoft Window’s NTFS, MacOS APFS/HFS+ and Linux’s Ext4/Btrfs/ZFS, all of which minus NTFS have incredibly fast indexing and lookup for searching files. [1]

In a binary search tree, you look at the root node and do a comparison. Is my object (Could be pretty much any data, a string, an integer, an executable’s hash, whatever!) smaller or larger in value than the root? If smaller, you typically go to the left side and if larger, you typically go to the right side.

The tree is organised, sorted, in a way such that if you follow this simple check over and over, you will find the item you’re looking for. [8]

The correct item is found after three comparisons within a collection of nine items, which means we’ve found

{\textstyle \lfloor \log _{2}(n)\rfloor }

The correct item is found after three comparisons within a collection of nine items, which means we’ve found our item in nearly exactly log2(n) comparisons log2(9) = ~3.17.

In fact, this is just a tiny bit “better” than the actual worst-case possible scenario for this algorithm, which happens to be log2(n) + 1. Using this, we can find out that the maximum number of comparisons with nine items is log2(9) = ~4.17, but for general trends we should always drop constants so our Big-O notation ends up as log2(n).

A binary search tree is pretty much what a binary search algorithm does, though, in essence. You must start with sorted data, but then you find the midpoint (The middle value), do your comparison and follow it along, almost exactly like the binary search tree!

Let’s have a look at some Python code:

def b(A,k):
 l,h=0,len(A)
 while l<h:
  m=(l+h)//2
  if A[m]>k:h=m
  elif A[m]<k:l=m+1
  else:return m
 return -1

(Code sourced from https://codegolf.stackexchange.com/questions/63726/iterative-binary-search/63727 modifications had to be made since in Python 3, a floored division is required // instead of / otherwise you’ll get an error like below.)

Uh… Okay. That’s a little hard to decipher. Let’s try… Hm. That’s also not that easy to understand. This is unfortunately an algorithm that’s incredibly hard to write correctly since you need to be very careful about off-by-one errors and make sure your while loop exits correctly. [2] [3]

Essentially what this algorithm is doing, for each sequence presented, it splits it at the midpoint and checks if the midpoint is smaller than the current item that’s being searched. It then adjusts the outer boundaries of the list using the first/last pointers and repeats over and over until it is finished. [7]

def binarySearch(alist, item):
    first = 0
    last = len(alist)-1
    found = False

    while first<=last and not found:
        midpoint = (first + last)//2
        if alist[midpoint] == item:
            found = True
        else:
            if item < alist[midpoint]:
                last = midpoint-1
            else:
                first = midpoint+1

    return found

Static/Dynamic Analysis

Using [4] and tweaking it slightly, I was able to do a running-time analysis of the two most common searching algorithms, linear & binary. (Python already has a linear search algorithm implemented with the in builtin.)

I collected the data and then plotted it to show the general trend.

#!/usr/bin/env python3
import time

# Change this from 10^0 to 10^8
range_limit = 1000000000
x = [x for x in range(range_limit)]

def Time_linear(alist,item):
    t1  = time.time()
    found = item in alist
    t2 = time.time()
    timer = t2 - t1  
    return timer

def Time_binarySearch(alist, item):
    first = 0
    last = len(alist)-1
    found = False 
    t1 = time.time()
    while first<=last and not found:
        midpoint = (first + last)//2
        if alist[midpoint] == item:
            found = True
        else:
            if item < alist[midpoint]:
                last = midpoint-1
            else:
                first = midpoint+1
    t2 = time.time()
    timer = t2 - t1
    return timer

print(f"binarySearch: {Time_binarySearch(x, range_limit//2)}")
print(f"linearSearch: {Time_linear(x, range_limit//2)")

This makes a lot of sense if we look at our algorithm and attempt a static analysis! Our collection-size is going to be our limiting input. For each item in n, we want to divide it into 2, n/2 and then do it over and over and over again. This gives us n/2^x = 1 which can be rearranged into n = 2^x. We can remove the exponent by using a logarithm instead which leaves us with log2(n), which happens to be the known worst-case performance.

If we look at the root node, and our item happens to match, we have a single sub-problem of size 1, nothing else happens since we’ve already found our item and we finish with a performance of O(1). This happens to be the best-case performance. [5]

Proof of Termination

while first<=last and not found:
    midpoint = (first + last)//2
    if alist[midpoint] == item:
        found = True
    else:
        if item < alist[midpoint]:
            last = midpoint-1
        else:
            first = midpoint+1
===== Simplify =====
 while first<=last:
    if CONDITION:
        last = midpoint-1
    else:
        first = midpoint+1

# Assuming a sequence that is in order, the while loop *must* end because either branch make the first + last variables closer and closer until one surpasses the other.

This causes `first !<= last` and the while loop ends.

Preconditions

A list/sequence must be supplied, as well as an item to look up. Usually an integer.
The list/sequence MUST be sorted for the algorithm to work.

Invariants:
– The value I am looking for is always within the current split section or “sub-problem”, if it exists.

References

[1] Comer, D. (June 1979). “The Ubiquitous B-Tree”. Computing Surveys. https://en.wikipedia.org/wiki/B-tree
[2] TFeld (November 2015) Stack Overflow, Code Golf https://codegolf.stackexchange.com/questions/63726/iterative-binary-search/63727
[3] Miller, B., Ranum, D. (2014). “
Problem Solving with Algorithms and Data Structures” Interactive Python https://interactivepython.org/runestone/static/pythonds/SortSearch/TheBinarySearch.html
[4] Gospodinov, N. (August 2015) Stack Overflow https://stackoverflow.com/questions/31936657/binarysearch-vs-in-unexpected-results-python
[5] Kay, A. (May 2018) “Static Analysis” DS&A Data Structures and Algorithms https://dsaa.werp.site/post/static-analysis/
[6] 2cupsOfTech, (October 2012) Stack Overflow https://stackoverflow.com/questions/2307283/what-does-olog-n-mean-exactly/13093274#13093274
[7] Khan Academy (n.d.), “Binary search” Computer Science Courses https://www.khanacademy.org/computing/computer-science/algorithms/binary-search/a/binary-search
[8] Javatpoint (n.d.), “Searching in Binary Search Tree” https://www.javatpoint.com/searching-in-binary-search-tree

Missing Integers? 🙈

This is the “Missing Integer” algorithm, which finds the smallest positive integer that does not occur in a list while accounting for the list not being contiguous and having gaps between values.

At first thought, I think of a simple iteration, though if you approach this problem in the simple and intuitive way, you very quickly run into issues due to the nature of high-level programming languages like Python and everything falls apart.

https://app.codility.com/programmers/lessons/4-counting_elements/missing_integer/

Common Problems

We need to make sure that if no positive numbers actually exist, we just return 1. We don’t want to waste time by running our algorithm when we don’t need to. This would be called a filter.

After this, we want to look for the smallest number as we iterate through the list. For this we should implement both the for each and the best so far styles of algorithm solving. [3, Kay, Andrew [2018])

Algorithm Synthesis! (Attempt #1)

To implement this algorithm, the first thing that would jump to my mind is a loop starting at 1, checking if it occurs in the list and then counting upwards until we hit a number that does not exist. Simple loop, right! Easy-peasy. Not quite.

This becomes extremely performance exhaustive very quickly in Python, we get a time-complexity of O(N**2) or O(N^2), and sadly gets a performance score of just 25% within Codility. Darn.

This means that as our input list increases in size, the algorithm gets exponentially slower and slower because it has to go through the entire list over and over for each increment of i. This is not acceptable. [1, Gao, Albert [2016])

Here are the variables we’ve used and why:

Three failed performance tests as well as a failed correctness test! Oh dear, how should we fix this?

We also failed a test for the input of [1]! This is not good. This is indicative of a logical error in the design of this algorithm. Fortunately this was a simple fix and was from a mistaken less-than-or-equals competitor being used instead of a less-than sign for the filter’s predicate.

This meant that whenever [1] was input, it would return 1 which is incorrect, as 1 exists in the list. It’s not a missing integer.

Attempt #2

Let’s try a completely different approach to this. We don’t want to iterate through this list over and over, so how could we avoid that? Well, if we try to think about this algorithm almost backwards, there are a couple primary scenarios that can happen.

Sub-problems. (Three special cases!)
1) The input list is entirely negative, we should always return 1.
2) The input list has a bunch of “gaps” which make it difficult to know what to return. We typically want the first “gap”.
3) The input list is entirely contiguous and there exist no gaps. We should always return the maximum value within the list, plus one.

Something interesting we could do is try to use a series of filters instead and completely avoid needing to use iteration in the slightest for this. The 1st and 3rd problem is super easy to use filters with, but the 2nd problem is difficult. We don’t know what gaps will occur, if any!

Thankfully, we can use Python’s set objects to quickly find the difference between our input_list and a completely contiguous range of integers, a quickly generated dummy-list. [7, Semwal, Smitha. [n.d.])

Since you’re able to find the exclusive disjunction or the difference of two sets. This is incredibly fast too, because it’s all implemented in C/C++ and avoids using the slow Python interpreter as much as possible. [2, Weisstein, Eric W. [n.d.]) [8, Taylor, Courtney [2018])

Venn diagram of Exclusive or — Source – (Watchduck, Public Domain)
The exclusive disjunction of two sets within set theory.

Variables & Data-Structures:

With just three filters, we managed to get a linear big-O notation of O(N) which is significantly better than O(N^2)! 🎉
[1, Gao, Albert [2016])

Evaluation

From breaking down the problem into sub-problems, it was much easier to come up with an efficient solution for this algorithm, with three filters instead of requiring slow and inefficient Python iteration. [5, Watshon, Greg [2017])

With this method we keep things as speedy as possible by removing all iteration and keeping the intensive part of the code within a single line so that the Python interpreter is able to execute it in one go, similar to the advantage given by list comprehensions. [6, Strika, Luciano [2018])

“List comprehension is faster because it is optimized for the Python interpreter to spot a predictable pattern during looping. Besides the syntactic benefit of list comprehensions, they are often as fast or faster than equivalent use of map.”
Greg Watshon, New York University [NYU], 2017.
Direct Quotation from Source [5], Paragraph 8.

To get faster, I believe it would be necessary to switch to a lower-level language like Rust or a C-based language.

Appendix

Just for fun, I decided to benchmark 100 runs of each attempt against a list of 10,000,000 randomly generated integers. Attempt_1.5.py was cut due to word limit.) In total this benchmark took about 5 minutes for 300 runs total with 3 billion integers being processed.

My second attempt of the algorithm ended up being, on average, 33.16x faster than my first attempt and 65.75x faster than the attempt that was cut.

A well-designed algorithm, like Attempt_2.py is, can be the difference between your algorithm averaging 1.671 seconds or just 25.4 milliseconds, over 65x faster.

Sources & References

[1] Gao, Albert. “Everything you need to know about Big O notation”, Jun. 2, 2016, medium.com/@albertgao/everything-you-need-to-know-about-big-o-notation-5436d2702199
[2] Weisstein, Eric W. “Exclusive Disjunction.” California Institute of Technology [Caltech], Wolfram Research MathWorld. mathworld.wolfram.com/ExclusiveDisjunction.html
[3] Kay, Andrew. “Algorithmic ‘Plans’ ” Birmingham City University [BCU], Mar. 19, 2018, dsaa.werp.site/post/algorithmic-plans/
[4] Samual, Sam. “Find difference between two lists.” TutorialsPoint, Sep. 25, 2018, tutorialspoint.com/python-program-to-list-the-difference-between-two-lists
[5] Watshon, Greg. “Python | Optimizing Loops”, New York University [NYU], Jan. 27, 2017, nyu-cds.github.io/python-performance-tips/08-loops/
(git-blame available here: https://github.com/nyu-cds/python-performance-tips/commit/a2632e3cd1b008eebe99efd1399152dbfd029a85)
[6] Strika, Luciano. “Python’s List Comprehensions: Uses and Advantages”, Towards Data Science, Aug. 19, 2018, towardsdatascience.com/how-list-comprehensions-can-help-your-code-look-better-and-run-smoother-3cf8f87172ae
[7] Semwal, Smitha. “Python | Find numbers in a sorted list range” , GeekforGeeks, n.d. www.geeksforgeeks.org/python-find-missing-numbers-in-a-sorted-list-range/
[8] Taylor, Courtney. “What Is the Difference of Two Sets in Set Theory?” ThoughtCo, Jun. 27, 2018, thoughtco.com/difference-of-two-sets-3126580.

Images
[9] Cmglee, CC BY-SA 4.0 Creative Commons, commons.wikimedia.org/wiki/File:Comparison_computational_complexity.svg
[10] Watchduck, Public Domain, commons.wikimedia.org/wiki/File:Venn0110.svg
[…] Any images that have not been sourced are self-produced, with the exception of any emoji graphics. Twemoji is licensed under the CC-BY 4.0 license courtesy of Twitter, Inc and other contributors.

All of the code written and produced for this assignment is available on my personal GitHub account under the MIT Open-Source license. All code & commits are digitally signed with GPG and can be cryptographically verified to have been my own creation. https://github.com/Snuggle/MissingIntegers

This document is for academic & educational purposes only.

Word Count

(Not including the appendix, direct-quotations or the first main header.)

Array-Lists, Linked Lists, Hash Tables (Dictionaries)

Dynamic arrays, or array-lists, have the ability to scale depending on how much memory is required to actually store all the data. Since a single array has a fixed size and cannot be resized easily, as in recreation of the array is necessary, the array-list is an abstract datatype that can be used to emulate lists through getter/setter functions. [1. Kay, 2019]

A linked list is a different take of making a dynamically resizable list as an abstract datatype. Instead of using larger arrays of fixed size and aggregating them, linked lists work by having a series of nodes each with a pointer to the “next” or “prev” node. When you step through a linked list, you can have theoretically infinite items as well as be able to remove/resize the list easily by just changing pointers to point to other nodes, perhaps even skipping some. The starting node, called the head, needs to be recognised as the root of the list. The final node, often called the tail, has a NULL (Or NoneType in Python) pointer, which signifies the end of the linked list. [2. Kay, 2019] [3. Garg, n.d.]

Single-linked-lists only have one “next” pointer per node.
Double-linked-lists have both a “next” and a “previous” pointer per node.

A hash table works by using a hash function to generate a UID for each object that you’d like to store. When you’d like to lookup the value, you can simply use the hash function to find the UID and access the item directly, wasting little time. (Ignoring the fact that hash functions tend to be somewhat computationally intensive, to counter this, a weaker hash function is usually used first to quickly lookup items, followed by a longer hash function with much much lower chance of collisions occurring, to verify that it’s the correct object. The chances of having a collision in both the weak & strong hash algorithm are exponentially minute, example given in the next paragraph.) [4. Kay, 2019] [5. Garg, n.d.]

From the research I have done into the Python programming language, it uses the last 15 bits of the object the “weak” hashing function and then later uses a stronger hashing algorithm afterwards. One of the side-effects of this, is that you can get incredibly regular patterns for integers and the source code has a comment which I will paraphrase & condense three paragraphs into one for brevity: [6. Peters, 2001]

Having a function that collides somewhat frequently is useful to be able to fill up consecutive gaps in-between your objects. It’s also silly to slow your hash table down for the rare occurrences and it’s best to keep the initial check dirt-cheap, hence we just take the last 15 bits as our first check, which is also 100% collision-less for contiguous integers. Failing this, collision resolution has two different ways to also calculate hashes.
Python v3.8.0 Source Code
[python/cpython; /Objects/dictobject.c#L135]
Primary source found using ‘git blame’. [6. Peters, 2001]

As you can see in the example I’ve run on my machine, a range of integers [1, 2, 3, 4, …] maps directly to the hashes [1, 2, 3, 4, …]

References & Sources

1. Kay, A. (2018) The ‘Array-List’ Data Structure. [online] Data Structures and Algorithms. Published by: Birmingham City University, Available at: https://dsaa.werp.site/post/the-array-list-data-structure/ [Accessed 22 Feb. 2019].
2. Kay, A. (2018) The ‘The Linked List’ Data Structure. [online] Data Structures and Algorithms. Published by: Birmingham City University, Available at: https://dsaa.werp.site/post/the-linked-list-data-structure/ [Accessed 22 Feb. 2019].
3. Garg, P. (n.d.). Singly Linked List | Data Structures. [online] HackerEarth. Available at: https://www.hackerearth.com/practice/data-structures/linked-list/singly-linked-list/tutorial/ [Accessed 22 Feb. 2019]
4. Kay, A. (2018) The ‘The Hash Map’ Data Structure. [online] Data Structures and Algorithms. Published by: Birmingham City University, Available at: https://dsaa.werp.site/post/the-hash-map-data-structure/ [Accessed 22 Feb. 2019].
5. Garg, P. (n.d.). Basics of Hash Tables | Data Structures. [online] HackerEarth. Available at: https://www.hackerearth.com/practice/data-structures/hash-tables/basics-of-hash-tables/tutorial/ [Accessed 22 Feb. 2019]
6. Peters, T; et al. (Open-source-software, Accessed: 2019-05-02). Python v2.2 – v3.8.0 Source Code [online] Python Software Foundation. Available at: https://github.com/python/cpython/blob/4631da1242fc96002a3c0462a87d087e567368aa/Objects/dictobject.c#L135 | Published with commit on 2001-05-27: https://github.com/python/cpython/commit/15d4929ae4ad5206e2f7fa947d632bfba9b8bfd1#diff-2131209d0deb0e50c93a88ec6c7b0d52R34

Integers and Binary

Computers are very simple electrical machines at a fundamental level and can only have two different states.

“On” and “Off”/”True” and “False”/”1” and “0”.

Now this gets quite interesting relatively quickly. The purpose of a computer is to compute, to perform algorithms and implement mathematics to achieve a specific function.

We typically use a base-10 number system where we have ten different symbols and every tenth number expands into a new column. For computers, they only have two “symbols” to count with and every second increment expands into the next column.
[1] (Base-X System by Faceprep Knowledgebase)

Base-10: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, [...]
Base-2: 0, 1, 10, 11, 100, 101, 110, 111, 1000, 1001, [...]

How the increment function works within the binary.py file, is that it just builds upon this simple understanding. It starts with a basic “for loop”, which will go over each column (or each “bit”) within the number it’s going to increment, in reverse. If the current bit position, i, is a 0, this should be changed to a 1 and the script should end. If however, the bit is a 1, it should be changed to a 0 and then i will be decremented and the loop will repeat until it finds the first 0 to turn into a 1.

If however, there are no remaining bit positions which contain a 0, the integer cannot be incremented safely and an exception should be raised, however this does not happen in the library and instead an integer overflow occurs, where the value “wraps-around” from 0b11111111 (0d255) to 0b00000000 (0d0). This is a common memory-management bug and could cause breakages and unexpected behaviour if not caught.
[2] (Binary.py file provided by BCU)

Acknowledgements:
[1] (Base-X System by Faceprep Knowledgebase) https://www.faceprep.in/aptipedia/knowledgebase/base-system/

[2] (Binary.py file provided by BCU)

The Journey Begins

Thanks for joining me!

Good company in a journey makes the way seem shorter. — Izaak Walton

post