Selecting Subsets of Data from Pandas
One of the most common jobs you will have to do during the Data Analysis process is selecting subsets within the data. This basically is extracting specific information from cells in both Rows or Columns from our Dataframe or Series.
Selections Types:
Now you can select from both Columns and rows
You can select using just columns or rows, or even a combination of both
The example I will use is a Dataset that contains information about certain students
Now the main ways we actually select subsets of data from a data frame is by using the three indexers
- []
- loc
- iloc
- []
We can use the first indexer by selecting a single or multiple columns
df['color'] - Image A
(TIP)- if you use a double [] then it will return it in a dataframe layout which is seen in image Bdf[['color']] - Image B
df[['name', 'color']]
If you are selecting multiple columns you must put the columns in a list first and then use the indexer otherwise it will not return anything
you can also simplify it by assigning a variable name to the columns and then using the indexer to call that variable like so:
cols = ['color', 'age', 'score']
df[cols]
- loc
the indexer loc will select subsets by the label of the rows or columns. With loc, you can call upon both rows and columns simultaneously
df.loc[['Dean', 'Cornelia'], ['age', 'state', 'score']]rows and columns must be in seperate lists and also seperated by a commaAgain we can clean this up by using the same method above and get the same resultsrows = ['Dean', 'Cornelia']
cols = ['age', 'state', 'score']
df.loc[rows,cols]
when we are selecting subsets we can also use pythons slice notation. This allows us to select data between two data points and choose a step
for example:
‘Niko’ : ‘Christina’ ← will start from Niko, stop at Christina and if no step is mentioned will default to 1.
cols = ['state', 'color']
df.loc['Jane':'Penelope', cols]
NOTE: if you use ‘:’ it will default and select all the data
cols = ['food', 'color']
df.loc[:, cols]
- iloc
Finally, we have the iloc, which is the same as loc but only uses integer locations which is what the i stands for
rows = [2, 4]
cols = [0, -1]
df.iloc[rows, cols]
in this case, we are only getting rows from 2( Remember Python starts counting from 0 so 2 would actually be the 3rd name) to 4 including 4 and we are getting columns 0 to -1 where 0 refers to the first column and -1 refers to the last column( and -2 would be 2nd last column, etc)
We can also use slicing for iloc
rows = [1, 3, 5]
df.iloc[rows, 3:]
Here we are selecting rows 1,3,5 and selecting the 3rd column onwards( you could also select only the first 3 columns by using ‘:3’ instead of ‘3:’
I hope this helps you in your Data Analysis works and If you liked what you read feel free to follow my blogs!