Home TechnologyCoding Learn Data Science: How to use Python Lists, sets and dictionaries

Learn Data Science: How to use Python Lists, sets and dictionaries

by Ivan
Learn Data Science for free: How to use Python data structures Lists, sets, tuples and dictionaries Ivan Ocampo lists tuples sets dictionaries

To continue our journey to learn Data Science, we need to know how to use Python Data Structures: Lists, sets, tuples and dictionaries.

We will get to the exciting parts of data science soon, but first please familiarise yourselves with these topics as they will guide the foundation to your data science journey!

Learn Data Science

A thorough introduction into list, sets, dictionaries and tuples in Python 3.

Prerequisite: Knowledge of basic Python syntax.

Please familiarise yourself with my first post on Learning Data Science: Python fundementals

Python Data Structures: Lists, Tuples, Sets and Dictionaries

In this workshop you will be introduced to the four main data structures in python of listtupleset and dictionary. We will take a closer at what makes these data structures unique, how we can use them and their unique methods and functions, covering a thorough introduction of each data structure. Even if you are familiar with the data stcutures, this workshop might be able to give you some insight into new methods or techniques and why certain data structures may be used in certain situations.

Software Prerequisite

  • Python 3
  • Anaconda
  • Jupyter Notebook

Sequences: Lists, Tuples and dictionaries

For this, we will take a closer look into the methods, arrtibutes, their usage and some implementation details of these structures. This should provide you with a thorough introduction into these data structures. even if you are familier with the data structures, the workshop might be able to give you some useful insight into why things are done in a certain way or learning a new method that you could use in your own implementation.

For this, some key phrases are useful to understand:

  • Mutable: This means that the item can be changed. The opposite is immutable which means once the object has been created it cannot be changed
  • Ordered: This means that the way in which the items are stored will not be changed and can be accessed by knowing which order the item is stored in. The opposite of this is unordered where the items stored cannot be accessed by the order that they were placed in.
  • Indexable: This means that items can be accessed based on the order that they were introduced into the item using the index. This is only applicable for ordered objects.

It is useful to note the key attributes of data structures and what this can mean for their usefulness in term of both Python application and their presence in data science applications. While the majority of data science applications will use some form of the Pandas Dataframe that you will learn about in a future workshop, it is nonetheless useful to note what can be done with these other datatypes.

List

The first datastructure that we encounter is that of a list.

Lists can be used to store multiple items in a single variable and effectively act as you would expect a normal written list to behave. They are one of the 4 in-built data types in Python that can be used to store collections of data, alongside Tuples, Sets and dictionaries.

The key charactersitics of lists are that they are:

  • Mutable
  • Ordered
  • Indexable

and they can contain duplicate records. These characteristics are important for how lists are actually used in a programming sense as we will see later on.

Firstly, we can understand how to create a list. As we have already encountered in the fundamentals lecture we can assign variables using the = sign. To create a list, we can use two main methods: using [] to enclose all that we want to contain, or using the list() method as can be seen:

#create a list of Fruit using the [] notation fruit_list = ["Apple", "Banana", "Peach"] #creating a list of vegetables using the list() nottation vegetable_list = list(["Pepper", "Courgette", "Aubergine"])

While it is notable that using the list() notation we had to use [] anyway as the function can only take one argument, this is useful for when we want to convert other datatypes to lists, for example other sequences or results from functions.

We can check the results of these to ensure that they are lists by using the type() function, and printing out the results of the lists themselves:

#examine the fruit list print(type(fruit_list)) print(fruit_list) #print a seperate line print("n") #print the vegetable list print(type(vegetable_list)) print(vegetable_list)

We can see here that the class attribute for both of these is given as lists. We can also see that when we print out the lists that they are printed in the square brackets and in the same order that we inputted them in, indicating that these are indeed lists and that they are ordered.

If you try to create a list without the [] brackets you will see that errors appear:

#create a list wrong sandwhich_ingredients = list("Ham", "Egg", "Cheese")
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) ~AppDataLocalTemp/ipykernel_31012/2572559944.py in <module> 1 #create a list wrong ----> 2 sandwhich_ingredients = list("Ham", "Egg", "Cheese") TypeError: list expected at most 1 argument, got 3

We can see here that this printed out a TypeError, noting that it said the list function expected at most 1 argument, got 3. This is because square brackets were not used to tell it that all the items were part of a single list.

Another key aspect of lists is that they can contain all different types of datatypes. Although we only used used strings above, we can also input numbers or floats into lists:

#create a list of just numbers num_list = [1, 2, 3, 4] #create a list of just floats float_list = [1.2, 2.3, 4.5, 6.8] #print the results print(type(num_list)) print(num_list) print("n") print(type(float_list)) print(float_list) 

We can also input different datatypes within lists, they don’t all have to be the same, so that we can even put lists within lists as seen below:

#different list random_list = ["Hello", 3, "Cheese", 6.2, [1,2,3]] #print the result print(type(random_list)) print(random_list) 

An important part of lists is that they are ordered collections of data, and that the values within each list are changeable and duplicate values are allowed.

Firstly, in terms of order, we say lists are ordered in that they have a clearly defined order and that order will not change unless we tell it to change or be changed. If you decide to add things to list, they will be placed at the end for example.

This order allows us to access values from the list that we know are in a set position of that order. For example, if for our fruit list we ordered them by where we would pass them on our weekly shopping trip and we know the first fruit will appear first but not what it is, we can access this using the index of the list. Of course, since this is Python, everything begins with an index of 0, so that we can access the first item with the following notation:

#access the first item from the list print(fruit_list[0]) 

For this we can see that square notation was used to put the number of the index in [0] which is how we access the first index.

Following this example, and counting up, the second item in the list can be accessed using [1] with the third being accessed using [2]. So you must remember that when you want to access an item from the list, the index that you use would be one less than the actual number of the item. For example

#print the fruit from the list print(fruit_list[1]) #print the third vegetable from the list print(fruit_list[2]) 

For this, anything that resolves to a number can be used to access something from a list, as long as that index belongs in the list you are trying to access. So in our example of fruit list we only have three fruit, but if you tried to use the index of 3 you would get:

third_fruit = fruit_list[3] 

An index error, as the list index is out of the range. This is another informative error as if you get this error it tells you that your list is not as long as you think it is or something is missing from your list.

Something interesting about this however is that not only can we count forward, we can also count backward over lists. This means that not only can we access lists from the beginning using indexes we can also access lists from the end. For example, if you created a list that went up in scores but you were interested in the second largest score, you could access this as follows:

#create a list of scores scores = [12,42,62,65,73,84,89,91,94] #extract the second highest score second_highest_score = scores[-2] #rpint the result print(second_highest_score) 

Of course, in doing so instead of also starting at 0, which would create confusion as to whether you wanted to access the first or last entry of the list, you start from -1 and then increase the further from the end you want to access.

Finally, in terms of using indexes to access things in a list, you can also access more than one element at a time using a slice. This is important as a slice allows you to acccess a range of items in a list by using the notation list[start index: end index] where it is important to note that the end index will not actually be returned in the slice.

#second lowest to fifth lowest print(scores[1:5]) #print second lowest print(scores[1:2]) #print the fifth lowest to the highest print(scores[5:]) #print the third highest to the highest print(scores[-3:]) #print from beginning to end print(scores[:]) #print every 2nd  print(scores[::2]) #print the list in reverse print(scores[::-1]) 

We can see from above several different rules that apply to this:

  • When printing the second lowest item using [1:2] given that the final index is not included in the result is why only one item was returned
  • When printing [5:] the fact that we did not specify an end index is why the whole of the list after and including the fifth index was printed
  • When printing [::2] using the second : allowed us to specify the jumps between indexes that is why every 2nd item was shown

It is also important to note that a slice will always return a list, even if it is just containing one thing. This is important so that you know what type of output is produced and what you can then do with it.

Finally, in terms of accessing things from lists, what if we know what we want to access from a list but we don’t know where it was in the list? If we knew we had Bannana in the fruit list but we forgot the order of it in the list, we could find that using the index method:

#find the index of banan print(fruit_list.index("Banana")) #find the index of peach print(fruit_list.index("Peach")) 

This can be useful if you forget the location of your item, but also if you have lists where the order is related to each other. For example, if the scores list was linked to a list of names, you could find the index of the name and then use that to access their score from another list.

The only issue with this is that if you mispel the item or the item is not in the list, the method will throw an error and will stop the code from running:

print(fruit_list.index("Orange")) 

For this, there are many ways around this but one simple way is to use an if/else statement that will be explained in later lectures. For now, this is done using:

if "Banana" in fruit_list: print("Banana is at index:", fruit_list.index("Banana")) else: print("Banana not in list") 

Beyond just accessing what is in a list there are many other things lists are good at, including multiple methods we can use in conjunction with lists to make them more useful.

One of the more important things about lists is that they are “mutable” which simply means that items within a list can be changed, including changing individual results, inserting new things into the middle of lists or even sorting them:

scores = [12,42,62,65,73,84,89,91,94] #we can change the score at the second index #print the second lowest score print("Original score:", scores[1]) #reassign the score scores[1] = 52 #check the reassignment print("Changed score:", scores[1]) 

#we can add a new score at the end using the append function print("Original scores", scores) #add new score scores.append(67) #print the new scores print("New scores", scores) #or add new scores in a specific position scores.insert(3, 48) #print the newer scores print("Newer scores", scores) 
#we can remove a score from the list print("Original scores", scores) #remove the score of 89 scores.remove(89) #print the new score print("New scores", scores) #alternative methods for removal include: # the pop() method removes the specified index # scores.pop(1) # If you do not specify an index the pop() method removes the last item # scores.pop() # we can also completely clear the list # scores.clear() 

At this point, after changing scores, adding new ones and removing some, the list of scores is no longer the same to what we had it before. This includes how long it is and the fact that scores are no longer in smallest to larger order.

We can rectify the first issue by finding the length of the list using the len() function, which tells us how long the list is:

print(len(scores)) 

So now we know we have ten scores, but they are no longer in the same order as they were before from smallest to largest. Again, we can rectify this using either the sort() function or the sorted() method:

#print current scores print(scores) #we can assign the new sorted list to a new list as follows: new_sorted_scores = sorted(scores) print(new_sorted_scores) #or we can sort the list itself scores.sort() print(scores) #we can even sort it in descending order scores.sort(reverse = True) print(scores) #or by using scores.reverse() 

Finally, we can add lists together by simply using the add method, or using the extend method to add to an existing list:

Names1 = ["Peter", "Geneva", "John"] Names2 = ["Katie", "Suzie", "Scott"] #add lists together using + added_names = Names1 + Names2 print(added_names) #add lists together by extending one list Names1.extend(Names2) print(Names1) #add the same list together by multiplying it by itself double_names = Names2 * 2 print(double_names) 

Thus, what we now know about lists are that they are ordered, they are changeable and they can contain duplicate values. This means that lists are very versatile and act as the basis for a lot of data storage methods because of this.

Challenges

In [ ]:

#create a list called names with: "Juliet", "James", "Steven", "Sarah", "Suzie" #extract the third name from the list #extract the second to fourth name using slicing #extract the last name from teh list #print the length of the string #add "Sasha" to the list 

Tuples

The second sequence in the list that we will explore is that of the Tuple.

Tuples are similar objects to that of lists in that they are ordered and indexable meaning that information from a tuple can be accssed in the same way that lists can be. However the main differences between tuples are lists is that they are immutable, meaning that they cannot be changed, and they are created using () instead of [].

We can create our first tuple as follows:In [ ]:

#create the tuple cars = ("Ford", "Hyundai", "Toyata", "Kia") #create a second tuple fruits_tuple = tuple(("Strawberry", "peach", "tomato")) #create the second tuple vegetable_tuple = tuple(["potato", "onion", "celery"]) #print the result print(cars) print(type(cars)) print(fruits_tuple) print(type(fruits_tuple)) print(vegetable_tuple) print(type(vegetable_tuple)) 

As already mentioned, like lists they are ordered and do allow duplicate values. This means that we can access information from tuples in the same way we would with lists uinsg the index:

#get the first item from the tuple print(cars[0]) #get the last item from the tuple print(cars[-1]) #get the second and third from the tuple print(cars[1:3]) #get all from the first index print(cars[1:]) #get all until the fourth one print(cars[:3]) 

Also like lists, whenever a slice is taken the type of the slice will be the same as the type of object you are taking a slice of. Here, because we are taken a slice of a tuple, a tuple is returned.

Finally, since they are indexed, we can include duplicate values because we can identify them with the index value as follows:In [ ]:

cars2 = ("Ford", "Hyundai", "Toyota", "Kia", "Ford") #print for print(cars2[0]) print(cars2[-1]) print(cars2) 

Again, like lists, if we don’t know the order but we know the value we can find the value using the index() method:

#get the index for Hyundai print(cars2.index("Hyundai")) #get the index of the ford print(cars2.index("Ford")) 

Although it is important to note that when accessing the index for the duplicate value, the index() method will only return the index of the first index of that value.

Given that tuples are immutable meaning that they cannot be changed when they have been created there is one way around this. This is done by firstly converting them into a list, update the value and then convert it back to a tuple:

#print the tuple print(cars) #change it to a list tuple_list = list(cars) #change the value tuple_list[0] = "Maserati" #reassign back to the tuple cars = tuple(tuple_list) #print the result print(cars) 

Of course, if you wanted to do this then you should have created a list in the first place.

The only other way to change a tuple is to join two tuples together to form a new tuple. This of course means that the only change you can make to a tuple is adding things on at the end or the beginning, not changing any values inside the tuple itself:

#create new tuples tuple1 = ("a", "b", "c") tuple2 = (1,2,3) #add together using the + tuple3 = tuple1 + tuple2 print(tuple3) #multiply an existing tuple together  tuple4 = tuple1*2 print(tuple4) 

Finally, we have seversal inbuilt functions as part of tuples, just like we do for lists:In [ ]:

#print the length of the tuple print(len(tuple1)) #print the count of values within a tuple print(tuple4.count("a")) #print the maximum value from a tuple print(max(tuple2)) #print the minimum value from a tuple print(min(tuple2)) 

The tuple is clearly similar to a list in its nature in that it is ordered so that data can be accessed using the index value. However it is primarily different to a list in that it cannot be changed. This means that it can be used in instances where you don’t want any information to be changed after it has already been created. An example of this may be when you don’t want results from an experiment to be overwritten, for initial values to be changed or for security reasons.

Challenges

### Create a tuple called companies with: "Apple", "Microsoft", "Google", "Facebook", "Amazon" ### extract the second to the fourth from the tuple ###extract the len of the tupe ### mutliply the tuple by 3 

Sets

Sets are another data structure that you can use to store multiple items in a single variable, just like lists. However there are four main differences:

  • Sets are created using curly brackets or using the set() constructor
  • They are unordered
  • They are unindexed
  • They cannot allow duplicate values

This is important for how they are used. For example, lets create a set of fruits:

#create a set using curly brackets fruits = {"apple", "banana", "cherry"} #create a set using the set constructor vegetables = set(("courgette", "potato", "aubergine")) #print the results print(fruits) print(vegetables) 

From this we can see that we created the set using the {} notation. We can also see that when printing the set, it did not appear in the same order as what the data was inputted. This relates to the fact that it is unordered so the items in a set will not always appear in the same order you see them.

This then brings us onto the fact that they are unindexed. The fact that they are unindexed means that they cannot be accessed in the same way that they would be with a list because we have no guaranteed that they would stay in the same position. Thus, there are two main ways to check whether an item is in the set or not:

#use a loop to iteratre over the set for x in fruits: print(x) #or check whether the fruit you want is in the set print("apple" in fruits) #which acts the same way as if it were in a list 

What this means is that while lists are changeable, sets are not, because we cannot access thing in the same way that we would otherwise. Instead, the only way to change the set is to add or remove items:

#we can add using the add method fruits.add("cherry") #check the updated set print(fruits) #we can add another set to the original set tropical = {"pineapple", "mango", "papaya"} fruits.update(tropical) #print the updated set print(fruits) #we can also use the update method to add any iterable object (tuples, lists, dictionaries etc.) new_veg = ["onion", "celery"] vegetables.update(new_veg) print(vegetables) 

There are also several ways of removing items from sets as well:

#we can use the remove method fruits.remove("apple") print(fruits) #the issue with this is if the item does not exist remove() will raise an error #or the discard method fruits.discard("mango") #this does not raise an error print(fruits) #finally we can also use the pop method #but since this is unordered it will remove the last item #and we also don't know which item will be removed fruit_removed = fruits.pop() print(fruit_removed) print(fruits) #finally we can clear the set using teh cleaer method fruits.clear() print(fruits) #or delete the set completely del fruits print(fruits) 

Finally, the last important thing about sets is that they cannot contain duplicate values. This is beneficial when we don’t want to contain duplicates like names, and can be used to find the unique values contained within given information. If we try to add duplicates:

cars = {"Ford", "Chevrolet", "Toyota", "Hyundai", "Volvo", "Ford"} print(cars) 

It will simply remove the duplicate from the set and will show only unique items.

This has important implications for when we want to join two sets and there are multiple methods of doing so:In [ ]:

set1 = {1, 2, 3} set2 = {"one", "two", "three"} #we can use union to return a new set with all items from both sets set3 = set1.union(set2) print(set3) #or we can use update to insert items in set2 into set 1 set1.update(set2) print(set1) 

In merging, we can also make sure we keep only the duplicates:

fruits = {"apple", "banana", "cherry"} companies = {"google", "microsoft", "apple"} #y creating a new set that contains only the duplicates both = fruits.intersection(companies) print(both) #or keep only items that are present in both sets fruits.intersection_update(companies) print(fruits) 

Of we can do the reverse and extract anything bu duplicates

fruits = {"apple", "banana", "cherry"} companies = {"google", "microsoft", "apple"} #y creating a new set that contains no duplicate both = fruits.symmetric_difference(companies) print(both) #or keep only items that are present in both sets fruits.symmetric_difference_update(companies) print(fruits) 

Thus we can see that sets are unique in that they are unordered, unindexed and do not allow duplicate values. The latter is an important characteristic as they can be used when we want to extract only the unique items from something, rather than having multiple instances of it such as names, but cannot be used when we may want to retain a certain order within the dataset.

Challenges

# Create a set called sports with: "Basketball", "Football", "Netball", "Baseball", "Ice Hockey" # check to see if "Athletics" in set # Add "Hockey" to the set 

Dictionaries

Dictionaries the final data stcuture that you can use to store information, like the lists, tuples and sets already introduced. They are known as a collection which is ordered, changeable and does not allow duplicate values (at least in the keys)

The primary difference between the previous data structures is that data is stored in key:value pairs and are written with curly brackets rather than square or normal brackets. We can create a dictionary as follows:

new_dict = {"Name":"Peter Jones", "Age":28, "Occupation":"Data Scientist"} print(new_dict) 

What we can see here is that we have the “key” which can be used to access the “values”. For example, if we wanted to know the name of the person stored in this dictionary we can access it using the “key”:

#the first way is as we would with a list print(new_dict["Name"]) #however we can also use .get() print(new_dict.get("Name")) #the difference between the two is that for get if the key #does not exist an error will not be triggered, while for  #the first method an error will be #try for yourself: print(new_dict.get("colour")) #print(new_dict["colour"]) 

Accessing information this way means that we can’t have duplicates in the dataset as we wouldn’t know what we would be accessing

second_dict = {"Name":"William", "Name":"Jessica"} print(second_dict["Name"]) 

As we can see here we set two "Name" keys and when trying to access the information it only prints the second value, not the first. This is because the second key overwites the first key value.

As with lists and set but unlike for tuples, dictionaries are mutable meaning that we can change, add or remove items after the dictionary has been created. We can do this in a similar way to lists and how we access individual items. For example:

#create the dictionary car1 = {"Make":"Ford", "Model":"Focus", "year":2012} #print the original year print(car1["year"]) #change the year car1["year"] = 2013 #print the new car year print(car1["year"]) #add new information key car1["Owner"] = "Jake Hargreave" #print updated car ifnormation print(car1) #or we can add another dictionary to the existing dictionary using the update function #this will be added to the end of the existing dictionary car1.update({"color":"yellow"}) #this can also be used to update an existing key:value pair #print updated versino print(car1) 

Thus, we can see that we can see that we can change the information contained in a dictionary. We can also remove information from a dictionary in a similar way that we would for a list:

scores = {"Steve":68, "Juliet":74, "William":52, "Jessica":48, "Peter":82, "Holly":90} #we can use the del method del scores["Steve"] #although be careful as if you don't specify the key you can delete the whole dictionary print(scores) #we can also use the pop method scores.pop("William") print(scores) #or popitem removes the last time (although in versinos before Python 3.7 the removes a random item) scores.popitem() print(scores) #or we could empty the entire dictionary scores.clear() print(scores) 

Dictionaries, as lists, can also contain any datatype you want it to contain. As we’ve already seen it can take a string or an integer, but dictionaries can also take floats, lists or even dictionaries, along with different types within the same dictionary:

mixed_dict = {"number":52, "float":3.49, "string":"Hello world", "list":[12, "Cheese", "Orange", 52], "Dictionary":{"Name":"Jemma", "Age":23, "Job":"Scientist"}} #can you figure out how to access each of these? #accesing the float? #accessing the second value in the list? #accessing the age from the dictionary? 

Finally, as with lists, we have methods that can be used for dictionaries as well:

dictionary = {"Score1":12, "Score2":53, "Score3":74, "Score4":62, "Score5":88, "Score6":34} #access all the keys from the dictionary print(dictionary.keys()) #access all the values form the dictionary print(dictionary.values()) #access a tuple for each key value pair print(dictionary.items()) #get the length of the dictionary print(len(dictionary)) 

Thus, we have covered the main parts of a dictionary. The benefits of these are that you can assign information to them based on an individual key, for example if you had linked lists of names, scores and ages you could create a dictionary with each of these keys and lists for each. Alternatively if you had many cars and there was defined information for them you could create dictionaries for each of them with the keys representing the basic information. They also lay the foundation for more complex data storage methods such as pandas dataframes, JSON or others.

You may also like

Leave a Reply

[script_16]