如何使用索引标签选择数据子集?

pandasserver side programmingprogramming更新于 2025/6/25 10:07:17

Pandas 具有双重选择功能，可以使用索引位置或索引标签选择数据子集。在本文中，我将向您展示如何使用索引标签"使用索引标签选择数据子集"。

请记住，Python 字典和列表是内置数据结构，它们可以使用索引标签或索引位置选择数据。字典的键必须是字符串、整数或元组，而列表必须使用整数(位置)或切片对象进行选择。

Pandas 具有 .loc 和 .iloc 属性，它们以其独特的方式执行索引操作。使用 .iloc 属性，Pandas 仅按位置选择，其工作方式类似于 Python 列表。 .loc 属性仅通过索引标签进行选择，这与 Python 字典的工作方式类似。

使用 .loc[] 的索引标签选择数据子集

loc 和 iloc 属性在 Series 和 DataFrame 上均可用

导入电影数据集，并以标题作为索引。

import pandas as pd
movies = pd.read_csv(
   "movies_data.csv",
   index_col="title",
   usecols=["title","budget","vote_average","vote_count"]
)

我始终建议对索引进行排序，尤其是在索引由字符串组成的情况下。如果您处理的是海量数据集，那么在索引排序后，您会注意到其中的区别。

输入

movies.sort_index(inplace = True)
movies.head(3)

输出

title	budget	vote_average	vote_count
(500) Days of Summer	7500000	7.2	2904
10 Cloverfield Lane	15000000	6.8	2468
10 Days in a Madhouse	1200000	4.3	5

我已使用 sort_index 和"inplace = True"参数对索引进行了排序。

loc 方法的语法中有一个有趣的地方:它不使用括号 ()，而是使用方括号 []。我认为(也可能是错误的)这是因为他们想要保持一致性，例如，在 Series 上使用 [] 可以提取行，而在 Dataframe 上应用则会提取列。

输入

# extract "Spider-Man 3" ( I'm not a big fan of spidy)
movies.loc["Spider-Man 3"]

输出

budget 258000000.0
vote_average 5.9
vote_count 3576.0
Name: Spider-Man 3, dtype: float64

使用切片提取多个值。我要提取我还没看过的电影。因为这是一个字符串标签，所以我们将获取符合搜索条件的所有数据，包括"阿凡达"。

记住，如果使用 Python 列表，最后一个值会被排除在外，但由于我们处理的是字符串，所以它包含所有值。

movies.loc["Alien":"Avatar" ]

title	budget	vote_average	vote_count
Alien	11000000	7.9	4470
Alien Zone	0	4.0	3
Alien: Resurrection	70000000	5.9	1365
Aliens	18500000	7.7	3220
Aliens in the Attic	45000000	5.3	244
-	-	-
Australia	130000000	6.3	694
Auto Focus	7000000	6.1	56
Automata	7000000	5.6	670
Autumn in New York	65000000	5.7	135
Avatar	237000000	7.2	11800

167 rows × 3 columns

我可以获取任意两部或两部以上不相邻的随机影片吗?当然可以，但你需要花更多精力来传递你需要的影片列表。

我的意思是你需要在方括号内加上方括号。

输入

movies.loc[["Avatar", "Avengers: Age of Ultron"]]

title	budget	vote_average	vote_count
Avatar	237000000	7.2	11800
Avengers: Age of Ultron	280000000	7.3	6767

我可以更改选择的顺序吗?当然，您可以通过按顺序指定所需的标签列表来实现。

虽然指定要提取的标签列表看起来很酷，但您知道如果拼写错误会发生什么吗?Pandas 本来会为拼写错误的标签保留缺失值 (NaN)。但这种日子已经一去不复返了，在最新的更新中，它会引发异常。

输入

movies.loc[["Avengers: Age of Ultron","Avatar","When is Avengers next movie?"]]

---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-21-ebe975264840> in <module>
----> 1 movies.loc[["Avengers: Age of Ultron","Avatar","When is Avengers next movie?"]]

∽\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
1766
1767 maybe_callable = com.apply_if_callable(key, self.obj)
−> 1768 return self._getitem_axis(maybe_callable, axis=axis)
1769
1770 def _is_scalar_access(self, key: Tuple):

∽\anaconda3\lib\site−packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1952 raise ValueError("Cannot index with multidimensional key")
1953
−> 1954 return self._getitem_iterable(key, axis=axis)
1955
1956 # nested tuple slicing

∽\anaconda3\lib\site−packages\pandas\core\indexing.py in _getitem_iterable(self, key, axis)
1593 else:
1594 # A collection of keys
−> 1595 keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
1596 return self.obj._reindex_with_indexers(
1597 {axis: [keyarr, indexer]}, copy=True, allow_dups=True

∽\anaconda3\lib\site−packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1550 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1551
−> 1552 self._validate_read_indexer(
1553 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
1554 )

∽\anaconda3\lib\site−packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1652 # just raising
1653 if not (ax.is_categorical() or ax.is_interval()):
−> 1654 raise KeyError(
1655 "Passing list−likes to .loc or [] with any missing labels "
1656 "is no longer supported, see "

KeyError: 'Passing list−likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas−docs/stable/user_guide/indexing.html#deprecate−loc−reindex−listlike'

一种处理方法是直接检查索引中的值。

输入

"When is Avengers next movie?" in movies.index

如果您想忽略错误并继续，您可以使用以下方法

movies.query("title in ('Avatar','When is Avengers next Movie?')")

title	budget	vote_average	vote_count
Avatar	237000000	7.2	11800

技术文章和资源

热门类别

如何使用索引标签选择数据子集?

使用 .loc[] 的索引标签选择数据子集

输入

输出

输入

输出

输入

输入

输入

相关文章

颜色选择器

读后有收获微信请站长喝咖啡

错误报告

您的建议:

感谢您的帮助！