Notes from Udacity Data Science Interview Preparation lesson.

Data Structure

The difference between Array and list

  • Array has index
  • List has order but no indices
  • Python list is built as an array

linked lists

  • will tell you where is the next list
  • the same as array is that store the value information
  • the difference with array is that they store the different information. array store the index inforation, and linked list store a reference to the next element in the list.

when coding:
it has two parts in one node. one of them is value of linked list(self.value), another is pointer(self.next) which point to next list.

1
2
3
4
5
6
7
8
9
10
11
12
13
lass Node():
__slots__=['_item','_next'] #限定Node实例的属性
def __init__(self,item):
self._item=item
self._next=None #Node的指针部分默认指向None
def getItem(self):
return self._item
def getNext(self):
return self._next
def setItem(self,newitem):
self._item=newitem
def setNext(self,newnext):
self._next=newnext

reference:
python数据结构——链表的实现 http://www.cnblogs.com/linxiyue/p/3551633.html

stack

  • like stack pancake, list-based data structure. But different from linked_list and Array
  • LIFO: last in first out.
  • PUSH:insert; POP:remove

Queues

  • FIFO: first in first out; list-based
  • dequeues or double-end queues can dequeue and enqueue in both head and tail
  • priority queue. dequeue with highest priority(1). if have same priority, dequeue the oldest one.

sets

  • the difference from list is that it don’t have orders, but can’t have repeated elements

Maps(dictionary)

  • like dictionary. Maps = <key, value>. a group of keys is a set.
  • Dictionaries are wonderfully flexible—you can store a wide variety of structures as values. You store another dictionary or a list:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Cities to add:
Bangalore (India, Asia)
Atlanta (USA, North America)
Cairo (Egypt, Africa)
Shanghai (China, Asia)"""

locations = {'North America': {'USA': ['Mountain View']}}
locations['North America']['USA'].append('Atlanta')
locations['Asia'] = {'Inida':['Bangalore']}
locations['Asia']['China']= ['Shanghai']
locations['Africa'] = {'Egypt': ['Cairo']}

for city in sorted(locations['North America']['USA']):
print city

m = []
for country, city in locations['Asia'].iteritems():
m.append(city[0]+'-'+ country)
for listm in sorted(m):
print listm

Hashing

Using a data structure that employs a hash function allows you to look up in constant time.

  • using the reminder as index, store this value at this place
  • if have collisions, two ways: 1, change hash function; 2, store all the value into one collection, the array will turn to ‘bucket’.
  • in bucket, still can apply hash function
  • For the load factor, you should divide the number of values by the number of buckets. $Load Factor = Number of Entries / Number of Buckets$
    https://classroom.udacity.com/courses/ud944/lessons/7118294395/concepts/79336691870923

Hash Maps

- In interview, you will be asked about creating "Hash table" to test your understand hashing. also need to know the upside and downside of a hash function

Hash for String Keys

  • string turn into ASCLL, A=65, B=66,C=67
- why use 31, it's more likely convention.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
example: Hash Value = (ASCII Value of First Letter * 100) + ASCII Value of Second Letter
class HashTable(object):
def __init__(self):
self.table = [None]*10000

def store(self, string):
hv = self.calculate_hash_value(string)
if hv != -1:
if self.table[hv] != None:
self.table[hv].append(string)
else:
self.table[hv] = [string]

def lookup(self, string):
hv = self.calculate_hash_value(string)
if hv != -1:
if self.table[hv] != None:
if string in self.table[hv]:
return hv
return -1

def calculate_hash_value(self, string):
value = ord(string[0])*100 + ord(string[1])
return value

Trees_Data Structure

  • tree is an extension of linked-list
  • the following figure, different color has its meaning.

Tree Traversal

  • DFS(depth first search):

    • pre-order

    • in-order

    • post-order (right first)

    • Q&A: D, F, E, B, C, A

  • BFS(Breadth first search): check the same level node first

  • Search and Delete, insert. perfect tree, each node has two child

Binary search tree(BST)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class BinaryTree(object):
def __init__(self, root):
self.root = Node(root)

def search(self, find_val):
return self.preorder_search(tree.root, find_val)

def print_tree(self):
return self.preorder_print(tree.root, "")[:-1]

def preorder_search(self, start, find_val):
if start:
if start.value == find_val:
return True
else:
return self.preorder_search(start.left, find_val) or self.preorder_search(start.right, find_val)
return False

def preorder_print(self, start, traversal):
if start:
traversal += (str(start.value) + "-")
traversal = self.preorder_print(start.left, traversal)
traversal = self.preorder_print(start.right, traversal)
return traversal
  • rule to BSTS: right of node is larger than left

  • unbalanced BSTS

  • apply the rule to BSTS and pre-order Traversal. Search and insert

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    class BST(object):
    def __init__(self, root):
    self.root = Node(root)

    def insert(self, new_val):
    self.insert_helper(self.root, new_val)

    def insert_helper(self, current, new_val):
    if current.value < new_val:
    if current.right:
    self.insert_helper(current.right, new_val)
    else:
    current.right = Node(new_val)
    else:
    if current.left:
    self.insert_helper(current.left, new_val)
    else:
    current.left = Node(new_val)

    def search(self, find_val):
    return self.search_helper(self.root, find_val)

    def search_helper(self, current, find_val):
    if current:
    if current.value == find_val:
    return True
    elif current.value < find_val:
    return self.search_helper(current.right, find_val)
    else:
    return self.search_helper(current.left, find_val)
    return False

Heaps

  • also a kind of tree has its own rule
  • Max heaps and min heaps. max: parent always big than child; min vice versa. link
  • heaps can have several trees, not like binary
  • heaps binary trees search big 0 time is $O(n)$
  • heapify: insert in the child leaf first, then compare it to the parent, if bigger then swap their value. The big O time is $O(logn)$
  • heap implementation: heap often store as sorted array. as shown below, array only store value and index, but tree will store bunch of pointers, which mean array save space.

self-balanced tree(red-black)

  • definition: minimize the level of tree

  • Red Black-Tree an extension of Binary tree. 5 rules

    • every node is red or black
    • Every leaf (NULL) is black. Every node in your tree doesn’t otherwise have two leaves, must have null children(black)
    • If a node is red, then both its children are black.
    • the root node is black
    • Every simple path from a node to a descendant leaf contains the same number of black nodes.

still watching: https://classroom.udacity.com/courses/ud944/lessons/7122604912/concepts/78867246220923
https://www.youtube.com/watch?v=O5Yl-m0YbVA

Graphs(Networks)

- tree is a type of graph. - Node, edge

directions and cycles

  • directions like this figure show

  • avoid graphs has cycles, infinite loop

  • DAG: directed Acyclic(no cycles) Graph. type show up often

connectivity

- right seems has more connection than left one. - remove one connection in left group, whole group will destoryed(disconnected)

definition

  • disconnected:
    Disconnected graphs are very similar whether the graph’s directed or undirected—there is some vertex or group of vertices that have no connection with the rest of the graph.

  • weakly connected:
    A directed graph is weakly connected when only replacing all of the directed edges with undirected edges can cause it to be connected. Imagine that your graph has several vertices with one outbound edge, meaning an edge that points from it to some other vertex in the graph. There’s no way to reach all of those vertices from any other vertex in the graph, but if those edges were changed to be undirected all vertices would be easily accessible.

  • connected:
    Here we only use “connected graph” to refer to undirected graphs. In a connected graph, there is some path between one vertex and every other vertex.

  • Strongly Connected:
    Strongly connected directed graphs must have a path from every node and every other node. So, there must be a path from A to B AND B to A.

Graph Representations

vertex orbject can have a list of edges it’s connected to and vice versa.

There are other graph represent ways:

  • Edge list, could be 2D, and 3D

  • Adjacency list, store the node that adjacent to index

  • Adjacency Matrix, store a matrix, if node 0 connect to node 1, then show 1, otherwise show 0 in matrix.

code from udacity link

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
class Node(object):
def __init__(self, value):
self.value = value
self.edges = []

class Edge(object):
def __init__(self, value, node_from, node_to):
self.value = value
self.node_from = node_from
self.node_to = node_to

class Graph(object):
def __init__(self, nodes=[], edges=[]):
self.nodes = nodes
self.edges = edges

def insert_node(self, new_node_val):
new_node = Node(new_node_val)
self.nodes.append(new_node)

def insert_edge(self, new_edge_val, node_from_val, node_to_val):
from_found = None
to_found = None
for node in self.nodes:
if node_from_val == node.value:
from_found = node
if node_to_val == node.value:
to_found = node
if from_found == None:
from_found = Node(node_from_val)
self.nodes.append(from_found)
if to_found == None:
to_found = Node(node_to_val)
self.nodes.append(to_found)
new_edge = Edge(new_edge_val, from_found, to_found)
from_found.edges.append(new_edge)
to_found.edges.append(new_edge)
self.edges.append(new_edge)

def get_edge_list(self):
edge_list = []
for edge_object in self.edges:
edge = (edge_object.value, edge_object.node_from.value, edge_object.node_to.value)
edge_list.append(edge)
return edge_list

def get_adjacency_list(self):
max_index = self.find_max_index()
adjacency_list = [None] * (max_index + 1)
for edge_object in self.edges:
if adjacency_list[edge_object.node_from.value]:
adjacency_list[edge_object.node_from.value].append((edge_object.node_to.value, edge_object.value))
else:
adjacency_list[edge_object.node_from.value] = [(edge_object.node_to.value, edge_object.value)]
return adjacency_list

def get_adjacency_matrix(self):
max_index = self.find_max_index()
adjacency_matrix = [[0 for i in range(max_index + 1)] for j in range(max_index + 1)]
for edge_object in self.edges:
adjacency_matrix[edge_object.node_from.value][edge_object.node_to.value] = edge_object.value
return adjacency_matrix

def find_max_index(self):
max_index = -1
if len(self.nodes):
for node in self.nodes:
if node.value > max_index:
max_index = node.value
return max_index

graph = Graph()
graph.insert_edge(100, 1, 2)
graph.insert_edge(101, 1, 3)
graph.insert_edge(102, 1, 4)
graph.insert_edge(103, 3, 4)
# Should be [(100, 1, 2), (101, 1, 3), (102, 1, 4), (103, 3, 4)]
print graph.get_edge_list()
# Should be [None, [(2, 100), (3, 101), (4, 102)], None, [(4, 103)], None]
print graph.get_adjacency_list()
# Should be [[0, 0, 0, 0, 0], [0, 0, 100, 101, 102], [0, 0, 0, 0, 0], [0, 0, 0, 0, 103], [0, 0, 0, 0, 0]]
print graph.get_adjacency_matrix()

Graph Traversal

  • just like tree traversal
  • DFS(depth first search)
  • BFS(breadth first search)

Algorithm_Searching and Sorting

code visualization: https://visualgo.net/en

  • efficiency: $O(log(n))$

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    def binarySearch(listData, value):
    low = 0
    high = len(listData)-1
    while (low <= high):
    mid = (low+high)/2
    if (listData[mid] = value):
    return mid
    elif(listData[mid] < value):
    low = mid+1
    else:
    high = mid-1
    return -1

    Resource: https://www.cs.usfca.edu/~galles/visualization/Search.html

Recursion

  • 1, call it self; 2, base case(get out the loop); 3, alter the input parament
    1
    2
    3
    4
    5
    the Fibonacci Sequence.
    def get_fib(position):
    if position == 0 or position == 1:
    return position
    return get_fib(position - 1) + get_fib(position - 2)

Sorting

  • in plce or not, in place: lest space; not: lest time. Tradeoff

bubble Sort(in place)

Merge Sort

  • divide the arrray into one by one, then combine every two element except the first element. Then the combined two element compare first, then compare to others just as following images.

  • Merge sort efficiency: $O(n*logn)$

Quick Sort

  • random choose one of element, then compare and move as following figure show. - efficiency: the worst case is O(n^2), the average and the best case is $O(nlogn)$
1
2
3
4
5
6
7
for each (unsorted) partition
set first element as pivot
storeIndex = pivotIndex + 1
for i = pivotIndex + 1 to rightmostIndex
if element[i] < element[pivot]
swap(i, storeIndex); storeIndex++
swap(pivot, storeIndex - 1)

reference: http://visualgo.net/en/sorting/

some Resources

常用查找数据结构及算法(Python实现)