Selecting Subsets of Data from Pandas

Khalid Gharib
4 min readNov 1, 2020

--

One of the most common jobs you will have to do during the Data Analysis process is selecting subsets within the data. This basically is extracting specific information from cells in both Rows or Columns from our Dataframe or Series.

Selections Types:

Now you can select from both Columns and rows

You can select using just columns or rows, or even a combination of both

The example I will use is a Dataset that contains information about certain students

Now the main ways we actually select subsets of data from a data frame is by using the three indexers

  1. []
  2. loc
  3. iloc

- []

We can use the first indexer by selecting a single or multiple columns

df['color'] - Image A
(TIP)- if you use a double [] then it will return it in a dataframe layout which is seen in image B
df[['color']] - Image B
Image A
Image B
df[['name', 'color']] 

If you are selecting multiple columns you must put the columns in a list first and then use the indexer otherwise it will not return anything

you can also simplify it by assigning a variable name to the columns and then using the indexer to call that variable like so:

cols = ['color', 'age', 'score']
df[cols]

- loc

the indexer loc will select subsets by the label of the rows or columns. With loc, you can call upon both rows and columns simultaneously

df.loc[['Dean', 'Cornelia'], ['age', 'state', 'score']]rows and columns must be in seperate lists and also seperated by a commaAgain we can clean this up by using the same method above and get the same resultsrows = ['Dean', 'Cornelia']
cols = ['age', 'state', 'score']
df.loc[rows,cols]

when we are selecting subsets we can also use pythons slice notation. This allows us to select data between two data points and choose a step

for example:
‘Niko’ : ‘Christina’ ← will start from Niko, stop at Christina and if no step is mentioned will default to 1.

cols = ['state', 'color']
df.loc['Jane':'Penelope', cols]

NOTE: if you use ‘:’ it will default and select all the data

cols = ['food', 'color']
df.loc[:, cols]

- iloc

Finally, we have the iloc, which is the same as loc but only uses integer locations which is what the i stands for

rows = [2, 4]
cols = [0, -1]
df.iloc[rows, cols]

in this case, we are only getting rows from 2( Remember Python starts counting from 0 so 2 would actually be the 3rd name) to 4 including 4 and we are getting columns 0 to -1 where 0 refers to the first column and -1 refers to the last column( and -2 would be 2nd last column, etc)

We can also use slicing for iloc

rows = [1, 3, 5]
df.iloc[rows, 3:]

Here we are selecting rows 1,3,5 and selecting the 3rd column onwards( you could also select only the first 3 columns by using ‘:3’ instead of ‘3:’

I hope this helps you in your Data Analysis works and If you liked what you read feel free to follow my blogs!

--

--

No responses yet