Data structure and algorithm(1)
- 1. Data Structure
- 2. Trees_Data Structure
- 3. Graphs(Networks)
- 4. Algorithm_Searching and Sorting
Notes from Udacity Data Science Interview Preparation lesson.
- big O Notation
- Array has index
- List has order but no indices
- Python list is built as an array
- will tell you where is the next list
- the same as array is that store the value information
- the difference with array is that they store the different information. array store the index inforation, and linked list store a reference to the next element in the list.
it has two parts in one node. one of them is value of linked list(self.value), another is pointer(self.next) which point to next list.
- like stack pancake, list-based data structure. But different from linked_list and Array
- LIFO: last in first out.
- PUSH:insert； POP：remove
- FIFO: first in first out; list-based
- dequeues or double-end queues can dequeue and enqueue in both head and tail
- priority queue. dequeue with highest priority(1). if have same priority, dequeue the oldest one.
- the difference from list is that it don’t have orders, but can’t have repeated elements
- like dictionary. Maps = <key, value>. a group of keys is a set.
- Dictionaries are wonderfully flexible—you can store a wide variety of structures as values. You store another dictionary or a list:
Cities to add:
Using a data structure that employs a hash function allows you to look up in constant time.
- using the reminder as index, store this value at this place
- if have collisions, two ways: 1, change hash function; 2, store all the value into one collection, the array will turn to ‘bucket’.
- in bucket, still can apply hash function
- For the load factor, you should divide the number of values by the number of buckets. $Load Factor = Number of Entries / Number of Buckets$
- string turn into ASCLL, A=65, B=66,C=67
example: Hash Value = (ASCII Value of First Letter * 100) + ASCII Value of Second Letter
- tree is an extension of linked-list
- the following figure, different color has its meaning.
DFS(depth first search):
post-order (right first)
Q&A: D, F, E, B, C, A
BFS(Breadth first search): check the same level node first
Search and Delete, insert. perfect tree, each node has two child
rule to BSTS: right of node is larger than left
apply the rule to BSTS and pre-order Traversal. Search and insert
def __init__(self, root):
self.root = Node(root)
def insert(self, new_val):
def insert_helper(self, current, new_val):
if current.value < new_val:
current.right = Node(new_val)
current.left = Node(new_val)
def search(self, find_val):
return self.search_helper(self.root, find_val)
def search_helper(self, current, find_val):
if current.value == find_val:
elif current.value < find_val:
return self.search_helper(current.right, find_val)
return self.search_helper(current.left, find_val)
- also a kind of tree has its own rule
- Max heaps and min heaps. max: parent always big than child; min vice versa. link
- heaps can have several trees, not like binary
- heaps binary trees search big 0 time is $O(n)$
- heapify: insert in the child leaf first, then compare it to the parent, if bigger then swap their value. The big O time is $O(logn)$
- heap implementation: heap often store as sorted array. as shown below, array only store value and index, but tree will store bunch of pointers, which mean array save space.
definition: minimize the level of tree
Red Black-Tree an extension of Binary tree. 5 rules
- every node is red or black
- Every leaf (NULL) is black. Every node in your tree doesn’t otherwise have two leaves, must have null children(black)
- If a node is red, then both its children are black.
- the root node is black
- Every simple path from a node to a descendant leaf contains the same number of black nodes.
directions like this figure show
avoid graphs has cycles, infinite loop
DAG: directed Acyclic(no cycles) Graph. type show up often
Disconnected graphs are very similar whether the graph’s directed or undirected—there is some vertex or group of vertices that have no connection with the rest of the graph.
A directed graph is weakly connected when only replacing all of the directed edges with undirected edges can cause it to be connected. Imagine that your graph has several vertices with one outbound edge, meaning an edge that points from it to some other vertex in the graph. There’s no way to reach all of those vertices from any other vertex in the graph, but if those edges were changed to be undirected all vertices would be easily accessible.
Here we only use “connected graph” to refer to undirected graphs. In a connected graph, there is some path between one vertex and every other vertex.
Strongly connected directed graphs must have a path from every node and every other node. So, there must be a path from A to B AND B to A.
vertex orbject can have a list of edges it’s connected to and vice versa.
There are other graph represent ways:
Edge list, could be 2D, and 3D
Adjacency list, store the node that adjacent to index
Adjacency Matrix, store a matrix, if node 0 connect to node 1, then show 1, otherwise show 0 in matrix.
code from udacity link
- just like tree traversal
- DFS(depth first search)
- BFS(breadth first search)
code visualization: https://visualgo.net/en
def binarySearch(listData, value):
low = 0
high = len(listData)-1
while (low <= high):
mid = (low+high)/2
if (listData[mid] = value):
elif(listData[mid] < value):
low = mid+1
high = mid-1
- 1, call it self; 2, base case(get out the loop); 3, alter the input parament
the Fibonacci Sequence.
if position == 0 or position == 1:
return get_fib(position - 1) + get_fib(position - 2)
- in plce or not, in place: lest space; not: lest time. Tradeoff
Big O is $(n-1)(n-1)=n^2$, best case is $O(n)$, the array is already sorted.
divide the arrray into one by one, then combine every two element except the first element. Then the combined two element compare first, then compare to others just as following images.
Merge sort efficiency: $O(n*logn)$
- random choose one of element, then compare and move as following figure show. - efficiency: the worst case is O(n^2), the average and the best case is $O(nlogn)$
for each (unsorted) partition