Internal version of the take method that sets the _is_copy attribute to keep track of the parent dataframe (using in indexing for the SettingWithCopyWarning).
Construct and returns axes if supplied in args/kwargs.
If require_all, raise if all axis arguments are not supplied
return a tuple of (axes, kwargs).
sentinel specifies the default parameter when an axis is not
supplied; useful to distinguish when a user explicitly passes None
in scenarios where None has special meaning.
Create DataFrame from a list of arrays corresponding to the columns.
Parameters:
arrays (list-like of arrays) – Each array in the list corresponds to one column, in order.
columns (list-like, Index) – The column names for the resulting DataFrame.
index (list-like, Index) – The rows labels for the resulting DataFrame.
dtype (dtype, optional) – Optional dtype to enforce for all arrays.
verify_integrity (bool, default True) – Validate and homogenize all input. If set to False, it is assumed
that all elements of arrays are actual arrays how they will be
stored in a block (numpy ndarray or ExtensionArray), have the same
length as and are aligned with the index, and that columns and
index are ensured to be an Index object.
Test whether a key is a label reference for a given axis.
To be considered a label reference, key must be a string that:
(axis=0): Matches a column label
(axis=1): Matches an index label
Parameters:
key (Hashable) – Potential label name, i.e. Index entry.
axis (int, default 0) – Axis perpendicular to the axis that labels are associated with
(0 means search for column labels, 1 means search for index labels)
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to set the label. The value 0 or ‘index’ specifies index,
and the value 1 or ‘columns’ specifies columns.
inplace (bool, default False) – If True, do operation inplace and return None.
Returns:
The same type as the caller or None if inplace is True.
Internal version of the take method that sets the _is_copy
attribute to keep track of the parent dataframe (using in indexing
for the SettingWithCopyWarning).
See the docstring of take for full explanation of the parameters.
Get Addition of dataframe and other, element-wise (binary operator add).
Equivalent to dataframe+other, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, radd.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Method to use for filling holes in reindexed Series:
pad / ffill: propagate last valid observation forward to next valid.
backfill / bfill: use NEXT valid observation to fill gap.
limit (int, default None) – If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is
a gap with more than this number of consecutive NaNs, it will only
be partially filled. If method is not specified, this is the
maximum number of entries along the entire axis where NaNs will be
filled. Must be greater than 0 if not None.
fill_axis ({0 or 'index', 1 or 'columns'}, default 0) – Filling axis, method and limit.
broadcast_axis ({0 or 'index', 1 or 'columns'}, default None) – Broadcast values along this axis, if aligning two objects of
different dimensions.
Returns:
(left, right) – Aligned objects.
Return type:
(DataFrame, type of other)
Examples
>>> df=pd.DataFrame(... [[1,2,3,4],[6,7,8,9]],columns=["D","B","E","A"],index=[1,2]... )>>> other=pd.DataFrame(... [[10,20,30,40],[60,70,80,90],[600,700,800,900]],... columns=["A","B","C","D"],... index=[2,3,4],... )>>> df D B E A1 1 2 3 42 6 7 8 9>>> other A B C D2 10 20 30 403 60 70 80 904 600 700 800 900
Align on columns:
>>> left,right=df.align(other,join="outer",axis=1)>>> left A B C D E1 4 2 NaN 1 32 9 7 NaN 6 8>>> right A B C D E2 10 20 30 40 NaN3 60 70 80 90 NaN4 600 700 800 900 NaN
We can also align on the index:
>>> left,right=df.align(other,join="outer",axis=0)>>> left D B E A1 1.0 2.0 3.0 4.02 6.0 7.0 8.0 9.03 NaN NaN NaN NaN4 NaN NaN NaN NaN>>> right A B C D1 NaN NaN NaN NaN2 10.0 20.0 30.0 40.03 60.0 70.0 80.0 90.04 600.0 700.0 800.0 900.0
Finally, the default axis=None will align on both index and columns:
>>> left,right=df.align(other,join="outer",axis=None)>>> left A B C D E1 4.0 2.0 NaN 1.0 3.02 9.0 7.0 NaN 6.0 8.03 NaN NaN NaN NaN NaN4 NaN NaN NaN NaN NaN>>> right A B C D E1 NaN NaN NaN NaN NaN2 10.0 20.0 30.0 40.0 NaN3 60.0 70.0 80.0 90.0 NaN4 600.0 700.0 800.0 900.0 NaN
Return whether all elements are True, potentially over an axis.
Returns True unless there at least one element within a series or
along a Dataframe axis that is False or equivalent (e.g. zero or
empty).
Parameters:
axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced. For Series this parameter
is unused and defaults to 0.
0 / ‘index’ : reduce the index, return a Series whose index is the
original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the
original index.
None : reduce all axes, return a scalar.
bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything,
then use only boolean data. Not implemented for Series.
skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is
True, then the result will be True, as for an empty row/column.
If skipna is False, then NA are treated as True, because these are not
equal to zero.
level (int or level name, default None) –
If the axis is a MultiIndex (hierarchical), count along a
particular level, collapsing into a Series.
Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.
**kwargs (any, default None) – Additional keywords have no effect but might be accepted for
compatibility with NumPy.
Returns:
If level is specified, then, DataFrame is returned; otherwise, Series
is returned.
Return whether any element is True, potentially over an axis.
Returns False unless there is at least one element within a series or
along a Dataframe axis that is True or equivalent (e.g. non-zero or
non-empty).
Parameters:
axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced. For Series this parameter
is unused and defaults to 0.
0 / ‘index’ : reduce the index, return a Series whose index is the
original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the
original index.
None : reduce all axes, return a scalar.
bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything,
then use only boolean data. Not implemented for Series.
skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is
True, then the result will be False, as for an empty row/column.
If skipna is False, then NA are treated as True, because these are not
equal to zero.
level (int or level name, default None) –
If the axis is a MultiIndex (hierarchical), count along a
particular level, collapsing into a Series.
Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.
**kwargs (any, default None) – Additional keywords have no effect but might be accepted for
compatibility with NumPy.
Returns:
If level is specified, then, DataFrame is returned; otherwise, Series
is returned.
Return type:
Series or DataFrame
See also
numpy.any
Numpy version of this method.
Series.any
Return whether any element is True.
Series.all
Return whether all elements are True.
DataFrame.any
Return whether any element is True over requested axis.
DataFrame.all
Return whether all elements are True over requested axis.
Examples
Series
For Series input, the output is a scalar indicating whether any element
is True.
Append rows of other to the end of caller, returning a new object.
Deprecated since version 1.4.0: Use concat() instead. For further details see
whatsnew_140.deprecations.frame_series_append
Columns in other that are not in the caller are added as new columns.
Parameters:
other (DataFrame or Series/dict-like object, or list of these) – The data to append.
ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.
verify_integrity (bool, default False) – If True, raise ValueError on creating index with duplicates.
sort (bool, default False) –
Sort columns if the columns of self and other are not aligned.
Changed in version 1.0.0: Changed to not sort by default.
Returns:
A new DataFrame consisting of the rows of caller and the rows of other.
Return type:
DataFrame
See also
concat
General function to concatenate DataFrame or Series objects.
Notes
If a list of dict/series is passed and the keys are all contained in
the DataFrame’s index, the order of the columns in the resulting
DataFrame will be unchanged.
Iteratively appending rows to a DataFrame can be more computationally
intensive than a single concatenate. A better solution is to append
those rows to a list and then concatenate the list with the original
DataFrame all at once.
Objects passed to the function are Series objects whose index is
either the DataFrame’s index (axis=0) or the DataFrame’s columns
(axis=1). By default (result_type=None), the final return type
is inferred from the return type of the applied function. Otherwise,
it depends on the result_type argument.
Parameters:
func (function) – Function to apply to each column or row.
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Axis along which the function is applied:
0 or ‘index’: apply function to each column.
1 or ‘columns’: apply function to each row.
raw (bool, default False) –
Determines if row or column is passed as a Series or ndarray object:
False : passes each row or column as a Series to the
function.
True : the passed function will receive ndarray objects
instead.
If you are just applying a NumPy reduction function this will
achieve much better performance.
’expand’ : list-like results will be turned into columns.
’reduce’ : returns a Series if possible rather than expanding
list-like results. This is the opposite of ‘expand’.
’broadcast’ : results will be broadcast to the original shape
of the DataFrame, the original index and columns will be
retained.
The default behaviour (None) depends on the return value of the
applied function: list-like results will be returned as a Series
of those. However if the apply function returns a Series these
are expanded to columns.
args (tuple) – Positional arguments to pass to func in addition to the
array/series.
**kwargs – Additional keyword arguments to pass as keywords arguments to
func.
Returns:
Result of applying func along the given axis of the
DataFrame.
Return type:
Series or DataFrame
See also
DataFrame.applymap
For elementwise operations.
DataFrame.aggregate
Only perform aggregating type operations.
DataFrame.transform
Only perform transforming type operations.
Notes
Functions that mutate the passed object can produce unexpected
behavior or errors and are not supported. See gotchas.udf-mutation
for more details.
Passing result_type='broadcast' will ensure the same shape
result, whether list-like or scalar is returned by the function,
and broadcast it along the axis. The resulting column names will
be the originals.
>>> df.apply(lambdax:[1,2],axis=1,result_type='broadcast') A B0 1 21 1 22 1 2
Returns the original data conformed to a new index with the specified
frequency.
If the index of this DataFrame is a PeriodIndex, the new index
is the result of transforming the original index with
PeriodIndex.asfreq (so the original index
will map one-to-one to the new index).
Otherwise, the new index will be equivalent to pd.date_range(start,end,freq=freq) where start and end are, respectively, the first and
last entries in the original index (see pandas.date_range()). The
values corresponding to any timesteps in the new index which were not present
in the original index will be null (NaN), unless a method for filling
such unknowns is provided (see the method parameter below).
The resample() method is more appropriate if an operation on each group of
timesteps (such as an aggregate) is necessary to represent the data at the new
frequency.
Parameters:
freq (DateOffset or str) – Frequency DateOffset or string.
Return the last row(s) without any NaNs before where.
The last row (for each element in where, if list) without any
NaN is taken.
In case of a DataFrame, the last row without NaN
considering only the subset of columns (if not None)
If there is no good value, NaN is returned for a Series or
a Series of NaN values for a DataFrame
Parameters:
where (date or array-like of dates) – Date(s) before which the last row(s) are returned.
subset (str or array-like of str, default None) – For DataFrame, if not None, only use these columns to
check for NaNs.
Returns:
The return can be:
scalar : when self is a Series and where is a scalar
Series: when self is a Series and where is an array-like,
or when self is a DataFrame and where is a scalar
DataFrame : when self is a DataFrame and where is an
array-like
Return scalar, Series, or DataFrame.
Return type:
scalar, Series, or DataFrame
See also
merge_asof
Perform an asof merge. Similar to left join.
Notes
Dates are assumed to be sorted. Raises if this is not the case.
Returns a new object with all original columns in addition to new ones.
Existing columns that are re-assigned will be overwritten.
Parameters:
**kwargs (dict of {str: callable or Series}) – The column names are keywords. If the values are
callable, they are computed on the DataFrame and
assigned to the new columns. The callable must not
change input DataFrame (though pandas doesn’t check it).
If the values are not callable, (e.g. a Series, scalar, or array),
they are simply assigned.
Returns:
A new DataFrame with the new columns in addition to
all the existing columns.
Return type:
DataFrame
Notes
Assigning multiple columns within the same assign is possible.
Later items in ‘**kwargs’ may refer to newly created or modified
columns in ‘df’; items are computed and assigned into ‘df’ in order.
dtype (data type, or dict of column name -> data type) – Use a numpy.dtype or Python type to cast entire pandas object to
the same type. Alternatively, use {col: dtype, …}, where col is a
column label and dtype is a numpy.dtype or Python type to cast one
or more of the DataFrame’s columns to column-specific types.
copy (bool, default True) – Return a copy when copy=True (be very careful setting
copy=False as changes to values then may propagate to other
pandas objects).
errors ({'raise', 'ignore'}, default 'raise') –
Control raising of exceptions on invalid data for provided dtype.
raise : allow exceptions to be raised
ignore : suppress exceptions. On error return original object.
self (NDFrameT) –
Returns:
casted
Return type:
same type as caller
See also
to_datetime
Convert argument to datetime.
to_timedelta
Convert argument to timedelta.
to_numeric
Convert argument to a numeric type.
numpy.ndarray.astype
Cast a numpy array to a specified type.
Notes
Deprecated since version 1.3.0: Using astype to convert from timezone-naive dtype to
timezone-aware dtype is deprecated and will raise in a
future version. Use Series.dt.tz_localize() instead.
Similar to reset_index but, index will drop if index name is None.
Parameters:
level (Optional[Union[Hashable, Sequence[Hashable]]]) – only remove the given levels from the index. Removes all
levels by default.
inplace (bool) – if to modify the DataFrame rather than creating a new one.
col_level (Hashable) – if the columns have multiple levels, determines which
level the labels are inserted into. By default, it is inserted
into the first level.
col_fill (Hashable) – if the columns have multiple levels, determines how the
other levels are named. If None then the index name is repeated.
Returns:
DataFrame or None. Changed row labels or None if inplace=True.
Select values between particular times of the day (e.g., 9:00-9:30 AM).
By setting start_time to be later than end_time,
you can get the times that are not between the two times.
Parameters:
start_time (datetime.time or str) – Initial time as a time filter limit.
end_time (datetime.time or str) – End time as a time filter limit.
include_start (bool, default True) –
Whether the start time needs to be included in the result.
Deprecated since version 1.4.0: Arguments include_start and include_end have been deprecated
to standardize boundary inputs. Use inclusive instead, to set
each bound as closed or open.
include_end (bool, default True) –
Whether the end time needs to be included in the result.
Deprecated since version 1.4.0: Arguments include_start and include_end have been deprecated
to standardize boundary inputs. Use inclusive instead, to set
each bound as closed or open.
inclusive ({"both", "neither", "left", "right"}, default "both") – Include boundaries; whether to set each bound as closed or open.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Determine range time on index or columns value.
For Series this parameter is unused and defaults to 0.
self (NDFrameT) –
Returns:
Data from the original object filtered to the specified dates range.
Return the bool of a single element Series or DataFrame.
This must be a boolean scalar value, either True or False. It will raise a
ValueError if the Series or DataFrame does not have exactly 1 element, or that
element is not boolean (integer values 0 and 1 will also raise an exception).
Returns:
The value in the Series or DataFrame.
Return type:
bool
See also
Series.astype
Change the data type of a Series, including to boolean.
DataFrame.astype
Change the data type of a DataFrame, including to boolean.
numpy.bool_
NumPy boolean data type, used by pandas for boolean values.
Examples
The method will only work for single element objects with a boolean value:
Make a box-and-whisker plot from DataFrame columns, optionally grouped
by some other columns. A box plot is a method for graphically depicting
groups of numerical data through their quartiles.
The box extends from the Q1 to Q3 quartile values of the data,
with a line at the median (Q2). The whiskers extend from the edges
of box to show the range of the data. By default, they extend no more than
1.5 * IQR (IQR = Q3 - Q1) from the edges of the box, ending at the farthest
data point within that interval. Outliers are plotted as separate dots.
For further details see
Wikipedia’s entry for boxplot.
Parameters:
column (str or list of str, optional) – Column name or list of names, or vector.
Can be any valid input to pandas.DataFrame.groupby().
by (str or array-like, optional) – Column in the DataFrame to pandas.DataFrame.groupby().
One box-plot will be done per value of columns in by.
ax (object of class matplotlib.axes.Axes, optional) – The matplotlib axes to be used by boxplot.
fontsize (float or str) – Tick label font size in points or as a string (e.g., large).
rot (int or float, default 0) – The rotation angle of labels (in degrees)
with respect to the screen coordinate system.
grid (bool, default True) – Setting this to True will show the grid.
figsize (A tuple (width, height) in inches) – The size of the figure to create in matplotlib.
layout (tuple (rows, columns), optional) – For example, (3, 5) will display the subplots
using 3 columns and 5 rows, starting from the top-left.
return_type ({'axes', 'dict', 'both'} or None, default 'axes') –
The kind of object to return. The default is axes.
’axes’ returns the matplotlib axes the boxplot is drawn on.
’dict’ returns a dictionary whose values are the matplotlib
Lines of the boxplot.
’both’ returns a namedtuple with the axes and dict.
when grouping with by, a Series mapping columns to
return_type is returned.
If return_type is None, a NumPy array
of axes with the same shape as layout is returned.
backend (str, default None) –
Backend to use instead of the backend specified in the option
plotting.backend. For instance, ‘matplotlib’. Alternatively, to
specify the plotting.backend for the whole session, set
pd.options.plotting.backend.
New in version 1.0.0.
**kwargs – All other plotting keyword arguments to be passed to
matplotlib.pyplot.boxplot().
Returns:
See Notes.
Return type:
result
See also
Series.plot.hist
Make a histogram.
matplotlib.pyplot.boxplot
Matplotlib equivalent plot.
Notes
The return type depends on the return_type parameter:
‘axes’ : object of class matplotlib.axes.Axes
‘dict’ : dict of matplotlib.lines.Line2D objects
‘both’ : a namedtuple with structure (ax, lines)
For data grouped with by, return a Series of the above or a numpy
array:
Series
array (for return_type=None)
Use return_type='dict' when you want to tweak the appearance
of the lines after plotting. In this case a dict containing the Lines
making up the boxes, caps, fliers, medians, and whiskers is returned.
Examples
Boxplots can be created for every column in the dataframe
by df.boxplot() or indicating the columns to be used:
Boxplots of variables distributions grouped by the values of a third
variable can be created using the option by. For instance:
A list of strings (i.e. ['X','Y']) can be passed to boxplot
in order to group the data by combination of the variables in the x-axis:
The layout of boxplot can be adjusted giving a tuple to layout:
Additional formatting can be done to the boxplot, like suppressing the grid
(grid=False), rotating the labels in the x-axis (i.e. rot=45)
or changing the fontsize (i.e. fontsize=15):
The parameter return_type can be used to select the type of element
returned by boxplot. When return_type='axes' is selected,
the matplotlib axes on which the boxplot is drawn are returned:
Automatically join the dataframe with an input dataframe exploiting the index column names as foreign keys.
If the cdf has a column with the same name of the self dataframe index column, that column will be used as a
foreign key and the join is performed. If self dataframe has a column with the same name of the cdf index
column, that column will be used as a foreign key and the join is performed.
Parameters:
cdf (COCODataFrame) – the COCODataFrame on which execute the join.
Perform column-wise combine with another DataFrame.
Combines a DataFrame with other DataFrame using func
to element-wise combine columns. The row and column indexes of the
resulting DataFrame will be the union of the two.
Parameters:
other (DataFrame) – The DataFrame to merge column-wise.
func (function) – Function that takes two series as inputs and return a Series or a
scalar. Used to merge the two dataframes column by columns.
fill_value (scalar value, default None) – The value to fill NaNs with prior to passing any column to the
merge func.
overwrite (bool, default True) – If True, columns in self that do not exist in other will be
overwritten with NaNs.
Returns:
Combination of the provided DataFrames.
Return type:
DataFrame
See also
DataFrame.combine_first
Combine two DataFrame objects and default to non-null values in frame calling the method.
Examples
Combine using a simple function that chooses the smaller column.
Example using a true element-wise combine function.
>>> df1=pd.DataFrame({'A':[5,0],'B':[2,4]})>>> df2=pd.DataFrame({'A':[1,1],'B':[3,3]})>>> df1.combine(df2,np.minimum) A B0 1 21 0 3
Using fill_value fills Nones prior to passing the column to the
merge function.
>>> df1=pd.DataFrame({'A':[0,0],'B':[None,4]})>>> df2=pd.DataFrame({'A':[1,1],'B':[3,3]})>>> df1.combine(df2,take_smaller,fill_value=-5) A B0 0 -5.01 0 4.0
However, if the same element in both dataframes is None, that None
is preserved
>>> df1=pd.DataFrame({'A':[0,0],'B':[None,4]})>>> df2=pd.DataFrame({'A':[1,1],'B':[None,3]})>>> df1.combine(df2,take_smaller,fill_value=-5) A B0 0 -5.01 0 3.0
Example that demonstrates the use of overwrite and behavior when
the axis differ between the dataframes.
>>> df1=pd.DataFrame({'A':[0,0],'B':[4,4]})>>> df2=pd.DataFrame({'B':[3,3],'C':[-10,1],},index=[1,2])>>> df1.combine(df2,take_smaller) A B C0 NaN NaN NaN1 NaN 3.0 -10.02 NaN 3.0 1.0
>>> df1.combine(df2,take_smaller,overwrite=False) A B C0 0.0 NaN NaN1 0.0 3.0 -10.02 NaN 3.0 1.0
Demonstrating the preference of the passed in dataframe.
>>> df2=pd.DataFrame({'B':[3,3],'C':[1,1],},index=[1,2])>>> df2.combine(df1,take_smaller) A B C0 0.0 NaN NaN1 0.0 3.0 NaN2 NaN 3.0 NaN
>>> df2.combine(df1,take_smaller,overwrite=False) A B C0 0.0 NaN NaN1 0.0 3.0 1.02 NaN 3.0 1.0
Update null elements with value in the same location in other.
Combine two DataFrame objects by filling null values in one DataFrame
with non-null values from other DataFrame. The row and column indexes
of the resulting DataFrame will be the union of the two. The resulting
dataframe contains the ‘first’ dataframe values and overrides the
second one values where both first.loc[index, col] and
second.loc[index, col] are not missing values, upon calling
first.combine_first(second).
Parameters:
other (DataFrame) – Provided DataFrame to use to fill null values.
Returns:
The result of combining the provided DataFrame with the other object.
Return type:
DataFrame
See also
DataFrame.combine
Perform series-wise operation on two DataFrames using a given function.
Examples
>>> df1=pd.DataFrame({'A':[None,0],'B':[None,4]})>>> df2=pd.DataFrame({'A':[1,1],'B':[3,3]})>>> df1.combine_first(df2) A B0 1.0 3.01 0.0 4.0
Null values still persist if the location of that null value
does not exist in other
>>> df1=pd.DataFrame({'A':[None,0],'B':[4,None]})>>> df2=pd.DataFrame({'B':[3,3],'C':[1,1]},index=[1,2])>>> df1.combine_first(df2) A B C0 NaN 4.0 NaN1 0.0 3.0 1.02 NaN 3.0 1.0
Compare to another DataFrame and show the differences.
New in version 1.1.0.
Parameters:
other (DataFrame) – Object to compare with.
align_axis ({0 or 'index', 1 or 'columns'}, default 1) –
Determine which axis to align the comparison on.
0, or ‘index’Resulting differences are stacked vertically
with rows drawn alternately from self and other.
1, or ‘columns’Resulting differences are aligned horizontally
with columns drawn alternately from self and other.
keep_shape (bool, default False) – If true, all rows and columns are kept.
Otherwise, only the ones with different values are kept.
keep_equal (bool, default False) – If true, the result keeps values that are equal.
Otherwise, equal values are shown as NaNs.
result_names (tuple, default ('self', 'other')) –
Set the dataframes names in the comparison.
New in version 1.5.0.
Returns:
DataFrame that shows the differences stacked side by side.
The resulting index will be a MultiIndex with ‘self’ and ‘other’
stacked alternately at the inner level.
Return type:
DataFrame
Raises:
ValueError – When the two DataFrames don’t have identical labels or shape.
See also
Series.compare
Compare with another Series and show differences.
DataFrame.equals
Test whether two objects contain the same elements.
Notes
Matching NaNs will not appear as a difference.
Can only compare identically-labeled
(i.e. same shape, identical row and column labels) DataFrames
Examples
>>> df=pd.DataFrame(... {... "col1":["a","a","b","b","a"],... "col2":[1.0,2.0,3.0,np.nan,5.0],... "col3":[1.0,2.0,3.0,4.0,5.0]... },... columns=["col1","col2","col3"],... )>>> df col1 col2 col30 a 1.0 1.01 a 2.0 2.02 b 3.0 3.03 b NaN 4.04 a 5.0 5.0
>>> df2=df.copy()>>> df2.loc[0,'col1']='c'>>> df2.loc[2,'col3']=4.0>>> df2 col1 col2 col30 c 1.0 1.01 a 2.0 2.02 b 3.0 4.03 b NaN 4.04 a 5.0 5.0
Align the differences on columns
>>> df.compare(df2) col1 col3 self other self other0 a c NaN NaN2 NaN NaN 3.0 4.0
Assign result_names
>>> df.compare(df2,result_names=("left","right")) col1 col3 left right left right0 a c NaN NaN2 NaN NaN 3.0 4.0
Stack the differences on rows
>>> df.compare(df2,align_axis=0) col1 col30 self a NaN other c NaN2 self NaN 3.0 other NaN 4.0
Keep the equal values
>>> df.compare(df2,keep_equal=True) col1 col3 self other self other0 a c 1.0 1.02 b b 3.0 4.0
Keep all original rows and columns
>>> df.compare(df2,keep_shape=True) col1 col2 col3 self other self other self other0 a c NaN NaN NaN NaN1 NaN NaN NaN NaN NaN NaN2 NaN NaN NaN NaN 3.0 4.03 NaN NaN NaN NaN NaN NaN4 NaN NaN NaN NaN NaN NaN
Keep all original rows and columns and also all original values
>>> df.compare(df2,keep_shape=True,keep_equal=True) col1 col2 col3 self other self other self other0 a c 1.0 1.0 1.0 1.01 a a 2.0 2.0 2.0 2.02 b b 3.0 3.0 3.0 4.03 b b NaN NaN 4.0 4.04 a a 5.0 5.0 5.0 5.0
Convert columns to best possible dtypes using dtypes supporting pd.NA.
New in version 1.0.0.
Parameters:
infer_objects (bool, default True) – Whether object dtypes should be converted to the best possible types.
convert_string (bool, default True) – Whether object dtypes should be converted to StringDtype().
convert_integer (bool, default True) – Whether, if possible, conversion can be done to integer extension types.
convert_boolean (bool, defaults True) – Whether object dtypes should be converted to BooleanDtypes().
convert_floating (bool, defaults True) –
Whether, if possible, conversion can be done to floating extension types.
If convert_integer is also True, preference will be give to integer
dtypes if the floats can be faithfully casted to integers.
By default, convert_dtypes will attempt to convert a Series (or each
Series in a DataFrame) to dtypes that support pd.NA. By using the options
convert_string, convert_integer, convert_boolean and
convert_boolean, it is possible to turn off individual conversions
to StringDtype, the integer extension types, BooleanDtype
or floating extension types, respectively.
For object-dtyped columns, if infer_objects is True, use the inference
rules as during normal Series/DataFrame construction. Then, if possible,
convert to StringDtype, BooleanDtype or an appropriate integer
or floating extension type, otherwise leave as object.
If the dtype is integer, convert to an appropriate integer extension type.
If the dtype is numeric, and consists of all integers, convert to an
appropriate integer extension type. Otherwise, convert to an
appropriate floating extension type.
Changed in version 1.2: Starting with pandas 1.2, this method also converts float columns
to the nullable floating extension type.
In the future, as new dtypes are added that support pd.NA, the results
of this method will change to support those new dtypes.
Compute pairwise correlation of columns, excluding NA/null values.
Parameters:
method ({'pearson', 'kendall', 'spearman'} or callable) –
Method of correlation:
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
callable: callable with input two 1d ndarrays
and returning a float. Note that the returned matrix from corr
will have 1 along the diagonals and will be symmetric
regardless of the callable’s behavior.
min_periods (int, optional) – Minimum number of observations required per pair of columns
to have a valid result. Currently only available for Pearson
and Spearman correlation.
numeric_only (bool, default True) –
Include only float, int or boolean data.
New in version 1.5.0.
Deprecated since version 1.5.0: The default value of numeric_only will be False in a future
version of pandas.
Returns:
Correlation matrix.
Return type:
DataFrame
See also
DataFrame.corrwith
Compute pairwise correlation with another DataFrame or Series.
Series.corr
Compute the correlation between two Series.
Notes
Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.
Pairwise correlation is computed between rows or columns of
DataFrame with rows or columns of Series or DataFrame. DataFrames
are first aligned along both axes before computing the
correlations.
Parameters:
other (DataFrame, Series) – Object with which to compute correlations.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ to compute row-wise, 1 or ‘columns’ for
column-wise.
drop (bool, default False) – Drop missing indices from result.
method ({'pearson', 'kendall', 'spearman'} or callable) –
Method of correlation:
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
callable: callable with input two 1d ndarrays
and returning a float.
numeric_only (bool, default True) –
Include only float, int or boolean data.
New in version 1.5.0.
Deprecated since version 1.5.0: The default value of numeric_only will be False in a future
version of pandas.
The values None, NaN, NaT, and optionally numpy.inf (depending
on pandas.options.mode.use_inf_as_na) are considered NA.
Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column.
If 1 or ‘columns’ counts are generated for each row.
level (int or str, optional) – If the axis is a MultiIndex (hierarchical), count along a
particular level, collapsing into a DataFrame.
A str specifies the level name.
numeric_only (bool, default False) – Include only float, int or boolean data.
Returns:
For each column/row the number of non-NA/null entries.
If level is specified returns a DataFrame.
Return type:
Series or DataFrame
See also
Series.count
Number of non-NA elements in a Series.
DataFrame.value_counts
Count unique combinations of columns.
DataFrame.shape
Number of DataFrame rows and columns (including NA elements).
DataFrame.isna
Boolean same-sized DataFrame showing places of NA elements.
Examples
Constructing DataFrame from a dictionary:
>>> df=pd.DataFrame({"Person":... ["John","Myla","Lewis","John","Myla"],... "Age":[24.,np.nan,21.,33,26],... "Single":[False,True,True,True,False]})>>> df Person Age Single0 John 24.0 False1 Myla NaN True2 Lewis 21.0 True3 John 33.0 True4 Myla 26.0 False
Compute pairwise covariance of columns, excluding NA/null values.
Compute the pairwise covariance among the series of a DataFrame.
The returned data frame is the covariance matrix of the columns
of the DataFrame.
Both NA and null values are automatically excluded from the
calculation. (See the note below about bias from missing values.)
A threshold can be set for the minimum number of
observations for each value created. Comparisons with observations
below this threshold will be returned as NaN.
This method is generally used for the analysis of time series data to
understand the relationship between different measures
across time.
Parameters:
min_periods (int, optional) – Minimum number of observations required per pair of columns
to have a valid result.
ddof (int, default 1) –
Delta degrees of freedom. The divisor used in calculations
is N-ddof, where N represents the number of elements.
New in version 1.1.0.
numeric_only (bool, default True) –
Include only float, int or boolean data.
New in version 1.5.0.
Deprecated since version 1.5.0: The default value of numeric_only will be False in a future
version of pandas.
Returns:
The covariance matrix of the series of the DataFrame.
Return type:
DataFrame
See also
Series.cov
Compute covariance with another Series.
core.window.ewm.ExponentialMovingWindow.cov
Exponential weighted sample covariance.
core.window.expanding.Expanding.cov
Expanding sample covariance.
core.window.rolling.Rolling.cov
Rolling sample covariance.
Notes
Returns the covariance matrix of the DataFrame’s time series.
The covariance is normalized by N-ddof.
For DataFrames that have Series that are missing data (assuming that
data is missing at random)
the returned covariance matrix will be an unbiased estimate
of the variance and covariance between the member Series.
However, for many applications this estimate may not be acceptable
because the estimate covariance matrix is not guaranteed to be positive
semi-definite. This could lead to estimate correlations having
absolute values which are greater than one, and/or a non-invertible
covariance matrix. See Estimation of covariance matrices for more details.
>>> np.random.seed(42)>>> df=pd.DataFrame(np.random.randn(1000,5),... columns=['a','b','c','d','e'])>>> df.cov() a b c d ea 0.998438 -0.020161 0.059277 -0.008943 0.014144b -0.020161 1.059352 -0.008543 -0.024738 0.009826c 0.059277 -0.008543 1.010670 -0.001486 -0.000271d -0.008943 -0.024738 -0.001486 0.921297 -0.013692e 0.014144 0.009826 -0.000271 -0.013692 0.977795
Minimum number of periods
This method also supports an optional min_periods keyword
that specifies the required minimum number of non-NA observations for
each column pair in order to have a valid result:
>>> np.random.seed(42)>>> df=pd.DataFrame(np.random.randn(20,3),... columns=['a','b','c'])>>> df.loc[df.index[:5],'a']=np.nan>>> df.loc[df.index[5:10],'b']=np.nan>>> df.cov(min_periods=12) a b ca 0.316741 NaN -0.150812b NaN 1.248003 0.191417c -0.150812 0.191417 0.895202
Return cumulative maximum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative
maximum.
Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’.
For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result
will be NA.
*args – Additional keywords have no effect but might be accepted for
compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for
compatibility with NumPy.
Return cumulative minimum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative
minimum.
Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’.
For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result
will be NA.
*args – Additional keywords have no effect but might be accepted for
compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for
compatibility with NumPy.
Return cumulative product over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative
product.
Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’.
For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result
will be NA.
*args – Additional keywords have no effect but might be accepted for
compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for
compatibility with NumPy.
Return cumulative sum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative
sum.
Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’.
For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result
will be NA.
*args – Additional keywords have no effect but might be accepted for
compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for
compatibility with NumPy.
Descriptive statistics include those that summarize the central
tendency, dispersion and shape of a
dataset’s distribution, excluding NaN values.
Analyzes both numeric and object series, as well
as DataFrame column sets of mixed data types. The output
will vary depending on what is provided. Refer to the notes
below for more detail.
Parameters:
percentiles (list-like of numbers, optional) – The percentiles to include in the output. All should
fall between 0 and 1. The default is
[.25,.5,.75], which returns the 25th, 50th, and
75th percentiles.
include ('all', list-like of dtypes or None (default), optional) –
A white list of data types to include in the result. Ignored
for Series. Here are the options:
’all’ : All columns of the input will be included in the output.
A list-like of dtypes : Limits the results to the
provided data types.
To limit the result to numeric types submit
numpy.number. To limit it instead to object columns submit
the numpy.object data type. Strings
can also be used in the style of
select_dtypes (e.g. df.describe(include=['O'])). To
select pandas categorical columns, use 'category'
None (default) : The result will include all numeric columns.
exclude (list-like of dtypes or None (default), optional,) –
A black list of data types to omit from the result. Ignored
for Series. Here are the options:
A list-like of dtypes : Excludes the provided data types
from the result. To exclude numeric types submit
numpy.number. To exclude object columns submit the data
type numpy.object. Strings can also be used in the style of
select_dtypes (e.g. df.describe(exclude=['O'])). To
exclude pandas categorical columns, use 'category'
None (default) : The result will exclude nothing.
datetime_is_numeric (bool, default False) –
Whether to treat datetime dtypes as numeric. This affects statistics
calculated for the column. For DataFrame input, this also
controls whether datetime columns are included by default.
New in version 1.1.0.
self (NDFrameT) –
Returns:
Summary statistics of the Series or Dataframe provided.
Return type:
Series or DataFrame
See also
DataFrame.count
Count number of non-NA/null observations.
DataFrame.max
Maximum of the values in the object.
DataFrame.min
Minimum of the values in the object.
DataFrame.mean
Mean of the values.
DataFrame.std
Standard deviation of the observations.
DataFrame.select_dtypes
Subset of a DataFrame including/excluding columns based on their dtype.
Notes
For numeric data, the result’s index will include count,
mean, std, min, max as well as lower, 50 and
upper percentiles. By default the lower percentile is 25 and the
upper percentile is 75. The 50 percentile is the
same as the median.
For object data (e.g. strings or timestamps), the result’s index
will include count, unique, top, and freq. The top
is the most common value. The freq is the most common value’s
frequency. Timestamps also include the first and last items.
If multiple object values have the highest count, then the
count and top results will be arbitrarily chosen from
among those with the highest count.
For mixed data types provided via a DataFrame, the default is to
return only an analysis of numeric columns. If the dataframe consists
only of object and categorical data without any numeric columns, the
default is to return an analysis of both the object and categorical
columns. If include='all' is provided as an option, the result
will include a union of attributes of each type.
The include and exclude parameters can be used to limit
which columns in a DataFrame are analyzed for the output.
The parameters are ignored when analyzing a Series.
Describing all columns of a DataFrame regardless of data type.
>>> df.describe(include='all') categorical numeric objectcount 3 3.0 3unique 3 NaN 3top f NaN afreq 1 NaN 1mean NaN 2.0 NaNstd NaN 1.0 NaNmin NaN 1.0 NaN25% NaN 1.5 NaN50% NaN 2.0 NaN75% NaN 2.5 NaNmax NaN 3.0 NaN
Describing a column from a DataFrame by accessing it as
an attribute.
Excluding object columns from a DataFrame description.
>>> df.describe(exclude=[object]) categorical numericcount 3 3.0unique 3 NaNtop f NaNfreq 1 NaNmean NaN 2.0std NaN 1.0min NaN 1.025% NaN 1.550% NaN 2.075% NaN 2.5max NaN 3.0
Calculates the difference of a DataFrame element compared with another
element in the DataFrame (default is element in previous row).
Parameters:
periods (int, default 1) – Periods to shift for calculating difference, accepts negative
values.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Take difference over rows (0) or columns (1).
Returns:
First differences of the Series.
Return type:
DataFrame
See also
DataFrame.pct_change
Percent change over given number of periods.
DataFrame.shift
Shift index by desired number of periods with an optional time freq.
Series.diff
First discrete difference of object.
Notes
For boolean dtypes, this uses operator.xor() rather than
operator.sub().
The result is calculated according to current dtype in DataFrame,
however dtype of the result is always float64.
Get Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to dataframe/other, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, rtruediv.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Get Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to dataframe/other, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, rtruediv.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Compute the matrix multiplication between the DataFrame and other.
This method computes the matrix product between the DataFrame and the
values of an other Series, DataFrame or a numpy array.
It can also be called using self@other in Python >= 3.5.
Parameters:
other (Series, DataFrame or array-like) – The other object to compute the matrix product with.
Returns:
If other is a Series, return the matrix product between self and
other as a Series. If other is a DataFrame or a numpy.array, return
the matrix product of self and other in a DataFrame of a np.array.
Return type:
Series or DataFrame
See also
Series.dot
Similar method for Series.
Notes
The dimensions of DataFrame and other must be compatible in order to
compute the matrix multiplication. In addition, the column names of
DataFrame and the index of other must contain the same values, as they
will be aligned prior to the multiplication.
The dot method for Series computes the inner product, instead of the
matrix product here.
Remove rows or columns by specifying label names and corresponding
axis, or by specifying directly index or column names. When using a
multi-index, labels on different levels can be removed by specifying
the level. See the user guide <advanced.shown_levels>
for more information about the now unused levels.
Parameters:
labels (single label or list-like) – Index or column labels to drop. A tuple will be used as a single
label and not treated as a list-like.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Whether to drop labels from the index (0 or ‘index’) or
columns (1 or ‘columns’).
index (single label or list-like) – Alternative to specifying axis (labels,axis=0
is equivalent to index=labels).
columns (single label or list-like) – Alternative to specifying axis (labels,axis=1
is equivalent to columns=labels).
level (int or level name, optional) – For MultiIndex, level from which the labels will be removed.
inplace (bool, default False) – If False, return a copy. Otherwise, do operation
inplace and return None.
errors ({'ignore', 'raise'}, default 'raise') – If ‘ignore’, suppress error and only existing labels are
dropped.
Returns:
DataFrame without the removed index or column labels or
None if inplace=True.
Return type:
DataFrame or None
Raises:
KeyError – If any of the labels is not found in the selected axis.
See also
DataFrame.loc
Label-location based indexer for selection by label.
DataFrame.dropna
Return DataFrame with labels on given axis omitted where (all or any) data are missing.
DataFrame.drop_duplicates
Return DataFrame with duplicate rows removed, optionally only considering certain columns.
Series.drop
Return Series with specified index labels removed.
Examples
>>> df=pd.DataFrame(np.arange(12).reshape(3,4),... columns=['A','B','C','D'])>>> df A B C D0 0 1 2 31 4 5 6 72 8 9 10 11
Drop columns
>>> df.drop(['B','C'],axis=1) A D0 0 31 4 72 8 11
>>> df.drop(columns=['B','C']) A D0 0 31 4 72 8 11
Drop a specific index combination from the MultiIndex
DataFrame, i.e., drop the combination 'falcon' and
'weight', which deletes only the corresponding row
Considering certain columns is optional. Indexes, including time indexes
are ignored.
Parameters:
subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by
default use all of the columns.
keep ({'first', 'last', False}, default 'first') – Determines which duplicates (if any) to keep.
- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
ignore_index (bool, default False) –
If True, the resulting axis will be labeled 0, 1, …, n - 1.
New in version 1.0.0.
Returns:
DataFrame with duplicates removed or None if inplace=True.
Return type:
DataFrame or None
See also
DataFrame.value_counts
Count unique combinations of columns.
Examples
Consider dataset containing ramen rating.
>>> df=pd.DataFrame({... 'brand':['Yum Yum','Yum Yum','Indomie','Indomie','Indomie'],... 'style':['cup','cup','cup','pack','pack'],... 'rating':[4,4,3.5,15,5]... })>>> df brand style rating0 Yum Yum cup 4.01 Yum Yum cup 4.02 Indomie cup 3.53 Indomie pack 15.04 Indomie pack 5.0
By default, it removes duplicate rows based on all columns.
>>> df.drop_duplicates() brand style rating0 Yum Yum cup 4.02 Indomie cup 3.53 Indomie pack 15.04 Indomie pack 5.0
To remove duplicates on specific column(s), use subset.
>>> df.drop_duplicates(subset=['brand']) brand style rating0 Yum Yum cup 4.02 Indomie cup 3.5
To remove duplicates and keep last occurrences, use keep.
>>> df.drop_duplicates(subset=['brand','style'],keep='last') brand style rating1 Yum Yum cup 4.02 Indomie cup 3.54 Indomie pack 5.0
Return Series/DataFrame with requested index / column level(s) removed.
Parameters:
level (int, str, or list-like) – If a string is given, must be the name of a level
If list-like, elements must be names or positional indexes
of levels.
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Axis along which the level(s) is removed:
0 or ‘index’: remove level(s) in column.
1 or ‘columns’: remove level(s) in row.
For Series this parameter is unused and defaults to 0.
self (NDFrameT) –
Returns:
Series/DataFrame with requested index / column level(s) removed.
See the User Guide for more on which values are
considered missing, and how to work with missing data.
Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Determine if rows or columns which contain missing values are
removed.
0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.
Changed in version 1.0.0: Pass tuple or list to drop on multiple axes.
Only a single axis is allowed.
how ({'any', 'all'}, default 'any') –
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.
’any’ : If any NA values are present, drop that row or column.
’all’ : If all values are NA, drop that row or column.
thresh (int, optional) – Require that many non-NA values. Cannot be combined with how.
subset (column label or sequence of labels, optional) – Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
Returns:
DataFrame with NA entries dropped from it or None if inplace=True.
Return type:
DataFrame or None
See also
DataFrame.isna
Indicate missing values.
DataFrame.notna
Indicate existing (non-missing) values.
DataFrame.fillna
Replace missing values.
Series.dropna
Drop missing values.
Index.dropna
Drop missing indices.
Examples
>>> df=pd.DataFrame({"name":['Alfred','Batman','Catwoman'],... "toy":[np.nan,'Batmobile','Bullwhip'],... "born":[pd.NaT,pd.Timestamp("1940-04-25"),... pd.NaT]})>>> df name toy born0 Alfred NaN NaT1 Batman Batmobile 1940-04-252 Catwoman Bullwhip NaT
Drop the rows where at least one element is missing.
>>> df.dropna() name toy born1 Batman Batmobile 1940-04-25
Drop the columns where at least one element is missing.
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
Test whether two objects contain the same elements.
This function allows two Series or DataFrames to be compared against
each other to see if they have the same shape and elements. NaNs in
the same location are considered equal.
The row/column index do not need to have the same type, as long
as the values are considered equal. Corresponding columns must be of
the same dtype.
Parameters:
other (Series or DataFrame) – The other Series or DataFrame to be compared with the first.
Returns:
True if all elements are the same in both objects, False
otherwise.
Return type:
bool
See also
Series.eq
Compare two Series objects of the same length and return a Series where each element is True if the element in each Series is equal, False otherwise.
DataFrame.eq
Compare two DataFrame objects of the same shape and return a DataFrame where each element is True if the respective element in each DataFrame is equal, False otherwise.
testing.assert_series_equal
Raises an AssertionError if left and right are not equal. Provides an easy interface to ignore inequality in dtypes, indexes and precision among others.
testing.assert_frame_equal
Like assert_series_equal, but targets DataFrames.
numpy.array_equal
Return True if two arrays have the same shape and elements, False otherwise.
DataFrames df and different_column_type have the same element
types and values, but have different types for the column labels,
which will still return True.
DataFrames df and different_data_type have different types for the
same values for their elements, and will return False even though
their column labels are the same values and types.
Evaluate a string describing operations on DataFrame columns.
Operates on columns only, not specific rows or elements. This allows
eval to run arbitrary code, which can make you vulnerable to code
injection if you pass user input to this function.
Parameters:
expr (str) – The expression string to evaluate.
inplace (bool, default False) – If the expression contains an assignment, whether to perform the
operation inplace and mutate the existing DataFrame. Otherwise,
a new DataFrame is returned.
**kwargs – See the documentation for eval() for complete details
on the keyword arguments accepted by
query().
Returns:
The result of the evaluation or None if inplace=True.
Return type:
ndarray, scalar, pandas object, or None
See also
DataFrame.query
Evaluates a boolean expression to query the columns of a frame.
DataFrame.assign
Can evaluate an expression or function to create new values for a column.
Exactly one of com, span, halflife, or alpha must be
provided if times is not provided. If times is provided,
halflife and one of com, span or alpha may be provided.
If times is specified, a timedelta convertible unit over which an
observation decays to half its value. Only applicable to mean(),
and halflife value will not apply to the other functions.
New in version 1.1.0.
alpha (float, optional) –
Specify smoothing factor \(\alpha\) directly
\(0 < \alpha \leq 1\).
min_periods (int, default 0) – Minimum number of observations in window required to have a value;
otherwise, result is np.nan.
adjust (bool, default True) –
Divide by decaying adjustment factor in beginning periods to account
for imbalance in relative weightings (viewing EWMA as a moving average).
When adjust=True (default), the EW function is calculated using weights
\(w_i = (1 - \alpha)^i\). For example, the EW moving average of the series
[\(x_0, x_1, ..., x_t\)] would be:
When ignore_na=False (default), weights are based on absolute positions.
For example, the weights of \(x_0\) and \(x_2\) used in calculating
the final weighted average of [\(x_0\), None, \(x_2\)] are
\((1-\alpha)^2\) and \(1\) if adjust=True, and
\((1-\alpha)^2\) and \(\alpha\) if adjust=False.
When ignore_na=True, weights are based
on relative positions. For example, the weights of \(x_0\) and \(x_2\)
used in calculating the final weighted average of
[\(x_0\), None, \(x_2\)] are \(1-\alpha\) and \(1\) if
adjust=True, and \(1-\alpha\) and \(\alpha\) if adjust=False.
axis ({0, 1}, default 0) –
If 0 or 'index', calculate across the rows.
If 1 or 'columns', calculate across the columns.
For Series this parameter is unused and defaults to 0.
times (str, np.ndarray, Series, default None) –
New in version 1.1.0.
Only applicable to mean().
Times corresponding to the observations. Must be monotonically increasing and
datetime64[ns] dtype.
If 1-D array like, a sequence with the same shape as the observations.
Deprecated since version 1.4.0: If str, the name of the column in the DataFrame representing the times.
Transform each element of a list-like to a row, replicating index values.
New in version 0.25.0.
Parameters:
column (IndexLabel) –
Column(s) to explode.
For multiple columns, specify a non-empty list with each element
be str or tuple, and all specified columns their list-like data
on same row of the frame must have matching length.
New in version 1.3.0: Multi-column explode
ignore_index (bool, default False) –
If True, the resulting index will be labeled 0, 1, …, n - 1.
New in version 1.1.0.
Returns:
Exploded lists to rows of the subset columns;
index will be duplicated for these rows.
Return type:
DataFrame
Raises:
ValueError : –
If columns of the frame are not unique.
* If specified columns to explode is empty list.
* If specified columns to explode have not matching count of
elements rowwise in the frame.
See also
DataFrame.unstack
Pivot a level of the (necessarily hierarchical) index labels.
DataFrame.melt
Unpivot a DataFrame from wide format to long format.
Series.explode
Explode a DataFrame from list-like columns to long format.
Notes
This routine will explode list-likes including lists, tuples, sets,
Series, and np.ndarray. The result dtype of the subset rows will
be object. Scalars will be returned unchanged, and empty list-likes will
result in a np.nan for that row. In addition, the ordering of rows in the
output will be non-deterministic when exploding sets.
value (scalar, dict, Series, or DataFrame) – Value to use to fill holes (e.g. 0), alternately a
dict/Series/DataFrame of values specifying which value to use for
each index (for a Series) or column (for a DataFrame). Values not
in the dict/Series/DataFrame will not be filled. This value cannot
be a list.
method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) – Method to use for filling holes in reindexed Series
pad / ffill: propagate last valid observation forward to next valid
backfill / bfill: use next valid observation to fill gap.
axis ({0 or 'index', 1 or 'columns'}) – Axis along which to fill missing values. For Series
this parameter is unused and defaults to 0.
inplace (bool, default False) – If True, fill in-place. Note: this will modify any
other views on this object (e.g., a no-copy slice for a column in a
DataFrame).
limit (int, default None) – If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is
a gap with more than this number of consecutive NaNs, it will only
be partially filled. If method is not specified, this is the
maximum number of entries along the entire axis where NaNs will be
filled. Must be greater than 0 if not None.
downcast (dict, default is None) – A dict of item->dtype of what to downcast if possible,
or the string ‘infer’ which will try to downcast to an appropriate
equal type (e.g. float64 to int64 if possible).
Returns:
Object with missing values filled or None if inplace=True.
>>> df=pd.DataFrame([[np.nan,2,np.nan,0],... [3,4,np.nan,1],... [np.nan,np.nan,np.nan,np.nan],... [np.nan,3,np.nan,4]],... columns=list("ABCD"))>>> df A B C D0 NaN 2.0 NaN 0.01 3.0 4.0 NaN 1.02 NaN NaN NaN NaN3 NaN 3.0 NaN 4.0
Replace all NaN elements with 0s.
>>> df.fillna(0) A B C D0 0.0 2.0 0.0 0.01 3.0 4.0 0.0 1.02 0.0 0.0 0.0 0.03 0.0 3.0 0.0 4.0
We can also propagate non-null values forward or backward.
>>> df.fillna(method="ffill") A B C D0 NaN 2.0 NaN 0.01 3.0 4.0 NaN 1.02 3.0 4.0 NaN 1.03 3.0 3.0 NaN 4.0
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1,
2, and 3 respectively.
>>> values={"A":0,"B":1,"C":2,"D":3}>>> df.fillna(value=values) A B C D0 0.0 2.0 2.0 0.01 3.0 4.0 2.0 1.02 0.0 1.0 2.0 3.03 0.0 3.0 2.0 4.0
Only replace the first NaN element.
>>> df.fillna(value=values,limit=1) A B C D0 0.0 2.0 2.0 0.01 3.0 4.0 NaN 1.02 NaN 1.0 NaN 3.03 NaN 3.0 NaN 4.0
When filling using a DataFrame, replacement happens along
the same column names and same indices
>>> df2=pd.DataFrame(np.zeros((4,4)),columns=list("ABCE"))>>> df.fillna(df2) A B C D0 0.0 2.0 0.0 0.01 3.0 4.0 0.0 1.02 0.0 0.0 0.0 NaN3 0.0 3.0 0.0 4.0
Note that column D is not affected since it is not present in df2.
Subset the dataframe rows or columns according to the specified index labels.
Note that this routine does not filter a dataframe on its
contents. The filter is applied to the labels of the index.
Parameters:
items (list-like) – Keep labels from axis which are in items.
like (str) – Keep labels from axis for which “like in label == True”.
regex (str (regular expression)) – Keep labels from axis for which re.search(regex, label) == True.
axis ({0 or ‘index’, 1 or ‘columns’, None}, default None) – The axis to filter on, expressed either as an index (int)
or axis name (str). By default this is the info axis, ‘columns’ for
DataFrame. For Series this parameter is unused and defaults to None.
self (NDFrameT) –
Return type:
same type as input object
See also
DataFrame.loc
Access a group of rows and columns by label(s) or a boolean array.
Notes
The items, like, and regex parameters are
enforced to be mutually exclusive.
axis defaults to the info axis that is used when indexing
with [].
Examples
>>> df=pd.DataFrame(np.array(([1,2,3],[4,5,6])),... index=['mouse','rabbit'],... columns=['one','two','three'])>>> df one two threemouse 1 2 3rabbit 4 5 6
>>> # select columns by name>>> df.filter(items=['one','three']) one threemouse 1 3rabbit 4 6
>>> # select columns by regular expression>>> df.filter(regex='e$',axis=1) one threemouse 1 3rabbit 4 6
>>> # select rows containing 'bbi'>>> df.filter(like='bbi',axis=0) one two threerabbit 4 5 6
Select initial periods of time series data based on a date offset.
When having a DataFrame with dates as index, this function can
select the first few rows based on a date offset.
Parameters:
offset (str, DateOffset or dateutil.relativedelta) – The offset length of the data that will be selected. For instance,
‘1M’ will display all the rows having their index within the first month.
Notice the data for 3 first calendar days were returned, not the first
3 days observed in the dataset, and therefore data for 2018-04-13 was
not returned.
Get Integer division of dataframe and other, element-wise (binary operator floordiv).
Equivalent to dataframe//other, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, rfloordiv.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
The “orientation” of the data. If the keys of the passed dict
should be the columns of the resulting DataFrame, pass ‘columns’
(default). Otherwise if the keys should be rows, pass ‘index’.
If ‘tight’, assume a dict with keys [‘index’, ‘columns’, ‘data’,
‘index_names’, ‘column_names’].
New in version 1.4.0: ‘tight’ as an allowed value for the orient argument
dtype (dtype, default None) – Data type to force, otherwise infer.
columns (list, default None) – Column labels to use when orient='index'. Raises a ValueError
if used with orient='columns' or orient='tight'.
Return type:
DataFrame
See also
DataFrame.from_records
DataFrame from structured ndarray, sequence of tuples or dicts, or DataFrame.
DataFrame
DataFrame object creation using constructor.
DataFrame.to_dict
Convert the DataFrame to a dictionary.
Examples
By default the keys of the dict become the DataFrame columns:
Convert structured or record ndarray to DataFrame.
Creates a DataFrame object from a structured ndarray, sequence of
tuples or dicts, or DataFrame.
Parameters:
data (structured ndarray, sequence of tuples or dicts, or DataFrame) – Structured input data.
index (str, list of fields, array-like) – Field of array to use as the index, alternately a specific set of
input labels to use.
exclude (sequence, default None) – Columns or fields to exclude.
columns (sequence, default None) – Column names to use. If the passed data do not have names
associated with them, this argument provides names for the
columns. Otherwise this argument indicates the order of the columns
in the result (any names not found in the data will become all-NA
columns).
coerce_float (bool, default False) – Attempt to convert values of non-string, non-numeric objects (like
decimal.Decimal) to floating point, useful for SQL result sets.
nrows (int, default None) – Number of rows to read if data is an iterator.
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
Group DataFrame using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the
object, applying a function, and combining the results. This can be
used to group large amounts of data and compute operations on these
groups.
Parameters:
by (mapping, function, label, or list of labels) – Used to determine the groups for the groupby.
If by is a function, it’s called on each value of the object’s
index. If a dict or Series is passed, the Series or dict VALUES
will be used to determine the groups (the Series’ values are first
aligned; see .align() method). If a list or ndarray of length
equal to the selected axis is passed (see the groupby user guide),
the values are used as-is to determine the groups. A label or list
of labels may be passed to group by the columns in self.
Notice that a tuple is interpreted as a (single) key.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Split along rows (0) or columns (1). For Series this parameter
is unused and defaults to 0.
level (int, level name, or sequence of such, default None) – If the axis is a MultiIndex (hierarchical), group by a particular
level or levels. Do not specify both by and level.
as_index (bool, default True) – For aggregated output, return object with group labels as the
index. Only relevant for DataFrame input. as_index=False is
effectively “SQL-style” grouped output.
sort (bool, default True) – Sort group keys. Get better performance by turning this off.
Note this does not influence the order of observations within each
group. Groupby preserves the order of rows within each group.
group_keys (bool, optional) –
When calling apply and the by argument produces a like-indexed
(i.e. a transform) result, add group keys to
index to identify pieces. By default group keys are not included
when the result’s index (and column) labels match the inputs, and
are included otherwise. This argument has no effect if the result produced
is not like-indexed with respect to the input.
Changed in version 1.5.0: Warns that group_keys will no longer be ignored when the
result from apply is a like-indexed Series or DataFrame.
Specify group_keys explicitly to include the group keys or
not.
squeeze (bool, default False) –
Reduce the dimensionality of the return type if possible,
otherwise return a consistent type.
Deprecated since version 1.1.0.
observed (bool, default False) – This only applies if any of the groupers are Categoricals.
If True: only show observed values for categorical groupers.
If False: show all values for categorical groupers.
dropna (bool, default True) –
If True, and if group keys contain NA values, NA values together
with row/column will be dropped.
If False, NA values will also be treated as the key in groups.
New in version 1.1.0.
Returns:
Returns a groupby object that contains information about the groups.
Convenience method for frequency conversion and resampling of time series.
Notes
See the user guide for more
detailed usage and examples, including splitting an object into groups,
iterating through groups, selecting a group, aggregation, and more.
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
This function returns the first n rows for the object based
on position. It is useful for quickly testing if your object
has the right type of data in it.
For negative values of n, this function returns all rows except
the last |n| rows, equivalent to df[:n].
If n is larger than the number of rows, this function returns all rows.
A histogram is a representation of the distribution of data.
This function calls matplotlib.pyplot.hist(), on each series in
the DataFrame, resulting in one histogram per column.
Parameters:
data (DataFrame) – The pandas object holding the data.
column (str or sequence, optional) – If passed, will be used to limit data to a subset of columns.
by (object, optional) – If passed, then used to form histograms for separate groups.
grid (bool, default True) – Whether to show axis grid lines.
xlabelsize (int, default None) – If specified changes the x-axis label size.
xrot (float, default None) – Rotation of x axis labels. For example, a value of 90 displays the
x labels rotated 90 degrees clockwise.
ylabelsize (int, default None) – If specified changes the y-axis label size.
yrot (float, default None) – Rotation of y axis labels. For example, a value of 90 displays the
y labels rotated 90 degrees clockwise.
ax (Matplotlib axes object, default None) – The axes to plot the histogram on.
sharex (bool, default True if ax is None else False) – In case subplots=True, share x axis and set some x axis labels to
invisible; defaults to True if ax is None otherwise False if an ax
is passed in.
Note that passing in both an ax and sharex=True will alter all x axis
labels for all subplots in a figure.
sharey (bool, default False) – In case subplots=True, share y axis and set some y axis labels to
invisible.
figsize (tuple, optional) – The size in inches of the figure to create. Uses the value in
matplotlib.rcParams by default.
layout (tuple, optional) – Tuple of (rows, columns) for the layout of the histograms.
bins (int or sequence, default 10) – Number of histogram bins to be used. If an integer is given, bins + 1
bin edges are calculated and returned. If bins is a sequence, gives
bin edges, including left edge of first bin and right edge of last
bin. In this case, bins is returned unmodified.
backend (str, default None) –
Backend to use instead of the backend specified in the option
plotting.backend. For instance, ‘matplotlib’. Alternatively, to
specify the plotting.backend for the whole session, set
pd.options.plotting.backend.
New in version 1.0.0.
legend (bool, default False) –
Whether to show the legend.
New in version 1.1.0.
**kwargs – All other plotting keyword arguments to be passed to
matplotlib.pyplot.hist().
Return type:
matplotlib.AxesSubplot or numpy.ndarray of them
See also
matplotlib.pyplot.hist
Plot a histogram using matplotlib.
Examples
This example draws a histogram based on the length and width of
some animals, displayed in three bins
Attempt to infer better dtypes for object columns.
Attempts soft conversion of object-dtyped
columns, leaving non-object and unconvertible
columns unchanged. The inference rules are the
same as during normal Series/DataFrame construction.
This method prints information about a DataFrame including
the index dtype and columns, non-null values and memory usage.
Parameters:
verbose (bool, optional) – Whether to print the full summary. By default, the setting in
pandas.options.display.max_info_columns is followed.
buf (writable buffer, defaults to sys.stdout) – Where to send the output. By default, the output is printed to
sys.stdout. Pass a writable buffer if you need to further process
the output. max_cols : int, optional
When to switch from the verbose to the truncated output. If the
DataFrame has more than max_cols columns, the truncated output
is used. By default, the setting in
pandas.options.display.max_info_columns is used.
memory_usage (bool, str, optional) –
Specifies whether total memory usage of the DataFrame
elements (including the index) should be displayed. By default,
this follows the pandas.options.display.memory_usage setting.
True always show memory usage. False never shows memory usage.
A value of ‘deep’ is equivalent to “True with deep introspection”.
Memory usage is shown in human-readable units (base-2
representation). Without deep introspection a memory estimation is
made based in column dtype and number of rows assuming values
consume the same memory amount for corresponding dtypes. With deep
memory introspection, a real memory usage calculation is performed
at the cost of computational resources. See the
Frequently Asked Questions for more
details.
show_counts (bool, optional) – Whether to show the non-null counts. By default, this is shown
only if the DataFrame is smaller than
pandas.options.display.max_info_rows and
pandas.options.display.max_info_columns. A value of True always
shows the counts, and False never shows the counts.
null_counts (bool, optional) –
Deprecated since version 1.2.0: Use show_counts instead.
max_cols (int | None) –
Returns:
This method prints a summary of a DataFrame and returns None.
Return type:
None
See also
DataFrame.describe
Generate descriptive statistics of DataFrame columns.
Whether each element in the DataFrame is contained in values.
Parameters:
values (iterable, Series, DataFrame or dict) – The result will only be true at a location if all the
labels match. If values is a Series, that’s the index. If
values is a dict, the keys must be the column names,
which must match. If values is a DataFrame,
then both the index and column labels must match.
Returns:
DataFrame of booleans showing whether each element in the DataFrame
is contained in values.
Return type:
DataFrame
See also
DataFrame.eq
Equality test for DataFrame.
Series.isin
Equivalent method on Series.
Series.str.contains
Test if pattern or regex is contained within a string of a Series or Index.
Return a boolean same-sized object indicating if the values are NA.
NA values, such as None or numpy.NaN, gets mapped to True
values.
Everything else gets mapped to False values. Characters such as empty
strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na=True).
Returns:
Mask of bool values for each element in DataFrame that
indicates whether an element is an NA value.
>>> df=pd.DataFrame(dict(age=[5,6,np.NaN],... born=[pd.NaT,pd.Timestamp('1939-05-27'),... pd.Timestamp('1940-04-25')],... name=['Alfred','Batman',''],... toy=[None,'Batmobile','Joker']))>>> df age born name toy0 5.0 NaT Alfred None1 6.0 1939-05-27 Batman Batmobile2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy0 False True False True1 False False False False2 True False False False
Return a boolean same-sized object indicating if the values are NA.
NA values, such as None or numpy.NaN, gets mapped to True
values.
Everything else gets mapped to False values. Characters such as empty
strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na=True).
Returns:
Mask of bool values for each element in DataFrame that
indicates whether an element is an NA value.
>>> df=pd.DataFrame(dict(age=[5,6,np.NaN],... born=[pd.NaT,pd.Timestamp('1939-05-27'),... pd.Timestamp('1940-04-25')],... name=['Alfred','Batman',''],... toy=[None,'Batmobile','Joker']))>>> df age born name toy0 5.0 NaT Alfred None1 6.0 1939-05-27 Batman Batmobile2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy0 False True False True1 False False False False2 True False False False
Iterate over DataFrame rows as namedtuples of the values.
DataFrame.items
Iterate over (column name, Series) pairs.
Notes
Because iterrows returns a Series for each row,
it does not preserve dtypes across the rows (dtypes are
preserved across columns for DataFrames). For example,
To preserve dtypes while iterating over the rows, it is better
to use itertuples() which returns namedtuples of the values
and which is generally faster than iterrows.
You should never modify something you are iterating over.
This is not guaranteed to work in all cases. Depending on the
data types, the iterator returns a copy and not a view, and writing
to it will have no effect.
index (bool, default True) – If True, return the index as the first element of the tuple.
name (str or None, default "Pandas") – The name of the returned namedtuples or None to return regular
tuples.
Returns:
An object to iterate over namedtuples for each row in the
DataFrame with the first field possibly being the index and
following fields being the column values.
Return type:
iterator
See also
DataFrame.iterrows
Iterate over DataFrame rows as (index, Series) pairs.
DataFrame.items
Iterate over (column name, Series) pairs.
Notes
The column names will be renamed to positional names if they are
invalid Python identifiers, repeated, or start with an underscore.
Join columns with other DataFrame either on index or on a key
column. Efficiently join multiple DataFrame objects by index at once by
passing a list.
Parameters:
other (DataFrame, Series, or a list containing any combination of them) – Index should be similar to one of the columns in this one. If a
Series is passed, its name attribute must be set, and that will be
used as the column name in the resulting joined DataFrame.
on (str, list of str, or array-like, optional) – Column or index level name(s) in the caller to join on the index
in other, otherwise joins index-on-index. If multiple
values given, the other DataFrame must have a MultiIndex. Can
pass an array as the join key if it is not already contained in
the calling DataFrame. Like an Excel VLOOKUP operation.
how ({'left', 'right', 'outer', 'inner'}, default 'left') –
How to handle the operation of the two objects.
left: use calling frame’s index (or column if on is specified)
right: use other’s index.
outer: form union of calling frame’s index (or column if on is
specified) with other’s index, and sort it.
lexicographically.
inner: form intersection of calling frame’s index (or column if
on is specified) with other’s index, preserving the order
of the calling’s one.
cross: creates the cartesian product from both frames, preserves the order
of the left keys.
New in version 1.2.0.
lsuffix (str, default '') – Suffix to use from left frame’s overlapping columns.
rsuffix (str, default '') – Suffix to use from right frame’s overlapping columns.
sort (bool, default False) – Order result DataFrame lexicographically by the join key. If False,
the order of the join key depends on the join type (how keyword).
validate (str, optional) – If specified, checks if join is of specified type.
* “one_to_one” or “1:1”: check if join keys are unique in both left
and right datasets.
* “one_to_many” or “1:m”: check if join keys are unique in left dataset.
* “many_to_one” or “m:1”: check if join keys are unique in right dataset.
* “many_to_many” or “m:m”: allowed, but does not result in checks.
.. versionadded:: 1.5.0
Returns:
A dataframe containing columns from both the caller and other.
Return type:
DataFrame
See also
DataFrame.merge
For column(s)-on-column(s) operations.
Notes
Parameters on, lsuffix, and rsuffix are not supported when
passing a list of DataFrame objects.
Support for specifying index levels as the on parameter was added
in version 0.23.0.
>>> df.join(other,lsuffix='_caller',rsuffix='_other') key_caller A key_other B0 K0 A0 K0 B01 K1 A1 K1 B12 K2 A2 K2 B23 K3 A3 NaN NaN4 K4 A4 NaN NaN5 K5 A5 NaN NaN
If we want to join using the key columns, we need to set key to be
the index in both df and other. The joined DataFrame will have
key as its index.
>>> df.set_index('key').join(other.set_index('key')) A BkeyK0 A0 B0K1 A1 B1K2 A2 B2K3 A3 NaNK4 A4 NaNK5 A5 NaN
Another option to join using the key columns is to use the on
parameter. DataFrame.join always uses other’s index but we can use
any column in df. This method preserves the original DataFrame’s
index in the result.
Select final periods of time series data based on a date offset.
For a DataFrame with a sorted DatetimeIndex, this function
selects the last few rows based on a date offset.
Parameters:
offset (str, DateOffset, dateutil.relativedelta) – The offset length of the data that will be selected. For instance,
‘3D’ will display all the rows having their index within the last 3 days.
Notice the data for 3 last calendar days were returned, not the last
3 observed days in the dataset, and therefore data for 2018-04-11 was
not returned.
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
Label-based “fancy indexing” function for DataFrame.
Deprecated since version 1.2.0: DataFrame.lookup is deprecated,
use pandas.factorize and NumPy indexing instead.
For further details see
Looking up values by index/column labels.
Given equal-length arrays of row and column labels, return an
array of the values corresponding to each (row, col) pair.
Parameters:
row_labels (sequence) – The row labels to use for lookup.
col_labels (sequence) – The column labels to use for lookup.
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
This function is useful to massage a DataFrame into a format where one
or more columns are identifier variables (id_vars), while all other
columns, considered measured variables (value_vars), are “unpivoted” to
the row axis, leaving just two non-identifier columns, ‘variable’ and
‘value’.
Parameters:
id_vars (tuple, list, or ndarray, optional) – Column(s) to use as identifier variables.
value_vars (tuple, list, or ndarray, optional) – Column(s) to unpivot. If not specified, uses all columns that
are not set as id_vars.
var_name (scalar) – Name to use for the ‘variable’ column. If None it uses
frame.columns.name or ‘variable’.
value_name (scalar, default 'value') – Name to use for the ‘value’ column.
col_level (int or str, optional) – If columns are a MultiIndex then use this level to melt.
ignore_index (bool, default True) –
If True, original index is ignored. If False, the original index is retained.
Index labels will be repeated as necessary.
The memory usage can optionally include the contribution of
the index and elements of object dtype.
This value is displayed in DataFrame.info by default. This can be
suppressed by setting pandas.options.display.memory_usage to False.
Parameters:
index (bool, default True) – Specifies whether to include the memory usage of the DataFrame’s
index in returned Series. If index=True, the memory usage of
the index is the first item in the output.
deep (bool, default False) – If True, introspect the data deeply by interrogating
object dtypes for system-level memory consumption, and include
it in the returned values.
Returns:
A Series whose index is the original column names and whose values
is the memory usage of each column in bytes.
Return type:
Series
See also
numpy.ndarray.nbytes
Total bytes consumed by the elements of an ndarray.
Series.memory_usage
Bytes consumed by a Series.
Categorical
Memory-efficient array for string values with many repeated values.
DataFrame.info
Concise summary of a DataFrame.
Notes
See the Frequently Asked Questions for more
details.
Merge DataFrame or named Series objects with a database-style join.
A named Series object is treated as a DataFrame with a single named column.
The join is done on columns or indexes. If joining columns on
columns, the DataFrame indexes will be ignored. Otherwise if joining indexes
on indexes or indexes on a column or columns, the index will be passed on.
When performing a cross merge, no column specifications to merge on are
allowed.
Warning
If both key columns contain rows where the key is a null value, those
rows will be matched against each other. This is different from usual SQL
join behaviour and can lead to unexpected results.
Parameters:
right (DataFrame or named Series) – Object to merge with.
how ({'left', 'right', 'outer', 'inner', 'cross'}, default 'inner') –
Type of merge to be performed.
left: use only keys from left frame, similar to a SQL left outer join;
preserve key order.
right: use only keys from right frame, similar to a SQL right outer join;
preserve key order.
outer: use union of keys from both frames, similar to a SQL full outer
join; sort keys lexicographically.
inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys.
cross: creates the cartesian product from both frames, preserves the order
of the left keys.
New in version 1.2.0.
on (label or list) – Column or index level names to join on. These must be found in both
DataFrames. If on is None and not merging on indexes then this defaults
to the intersection of the columns in both DataFrames.
left_on (label or list, or array-like) – Column or index level names to join on in the left DataFrame. Can also
be an array or list of arrays of the length of the left DataFrame.
These arrays are treated as if they are columns.
right_on (label or list, or array-like) – Column or index level names to join on in the right DataFrame. Can also
be an array or list of arrays of the length of the right DataFrame.
These arrays are treated as if they are columns.
left_index (bool, default False) – Use the index from the left DataFrame as the join key(s). If it is a
MultiIndex, the number of keys in the other DataFrame (either the index
or a number of columns) must match the number of levels.
right_index (bool, default False) – Use the index from the right DataFrame as the join key. Same caveats as
left_index.
sort (bool, default False) – Sort the join keys lexicographically in the result DataFrame. If False,
the order of the join keys depends on the join type (how keyword).
suffixes (list-like, default is ("_x", "_y")) – A length-2 sequence where each element is optionally a string
indicating the suffix to add to overlapping column names in
left and right respectively. Pass a value of None instead
of a string to indicate that the column name from left or
right should be left as-is, with no suffix. At least one of the
values must not be None.
copy (bool, default True) – If False, avoid copy if possible.
indicator (bool or str, default False) – If True, adds a column to the output DataFrame called “_merge” with
information on the source of each row. The column can be given a different
name by providing a string argument. The column will have a Categorical
type with the value of “left_only” for observations whose merge key only
appears in the left DataFrame, “right_only” for observations
whose merge key only appears in the right DataFrame, and “both”
if the observation’s merge key is found in both DataFrames.
validate (str, optional) –
If specified, checks if merge is of specified type.
”one_to_one” or “1:1”: check if merge keys are unique in both
left and right datasets.
”one_to_many” or “1:m”: check if merge keys are unique in left
dataset.
”many_to_one” or “m:1”: check if merge keys are unique in right
dataset.
”many_to_many” or “m:m”: allowed, but does not result in checks.
Returns:
A DataFrame of the two merged objects.
Return type:
DataFrame
See also
merge_ordered
Merge with optional filling/interpolation.
merge_asof
Merge on nearest keys.
DataFrame.join
Similar method using indices.
Notes
Support for specifying index levels as the on, left_on, and
right_on parameters was added in version 0.23.0
Support for merging named Series objects was added in version 0.24.0
Get Modulo of dataframe and other, element-wise (binary operator mod).
Equivalent to dataframe%other, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, rmod.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Get the mode(s) of each element along the selected axis.
The mode of a set of values is the value that appears most often.
It can be multiple values.
Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) –
The axis to iterate over while searching for the mode:
0 or ‘index’ : get mode of each column
1 or ‘columns’ : get mode of each row.
numeric_only (bool, default False) – If True, only apply to numeric columns.
dropna (bool, default True) – Don’t consider counts of NaN/NaT.
Returns:
The modes of each column or row.
Return type:
DataFrame
See also
Series.mode
Return the highest frequency value in a Series.
Series.value_counts
Return the counts of values in a Series.
Examples
>>> df=pd.DataFrame([('bird',2,2),... ('mammal',4,np.nan),... ('arthropod',8,0),... ('bird',2,np.nan)],... index=('falcon','horse','spider','ostrich'),... columns=('species','legs','wings'))>>> df species legs wingsfalcon bird 2 2.0horse mammal 4 NaNspider arthropod 8 0.0ostrich bird 2 NaN
By default, missing values are not considered, and the mode of wings
are both 0 and 2. Because the resulting DataFrame has two rows,
the second row of species and legs contains NaN.
>>> df.mode() species legs wings0 bird 2.0 0.01 NaN NaN 2.0
Setting dropna=FalseNaN values are considered and they can be
the mode (like for wings).
>>> df.mode(dropna=False) species legs wings0 bird 2 NaN
Setting numeric_only=True, only the mode of numeric columns is
computed, and columns of other types are ignored.
>>> df.mode(numeric_only=True) legs wings0 2.0 0.01 NaN 2.0
To compute the mode over columns and not rows, use the axis parameter:
Get Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to dataframe*other, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, rmul.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Get Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to dataframe*other, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, rmul.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
Return the first n rows ordered by columns in descending order.
Return the first n rows with the largest values in columns, in
descending order. The columns that are not specified are returned as
well, but not used for ordering.
This method is equivalent to
df.sort_values(columns,ascending=False).head(n), but more
performant.
Parameters:
n (int) – Number of rows to return.
columns (label or list of labels) – Column label(s) to order by.
Return a boolean same-sized object indicating if the values are not NA.
Non-missing values get mapped to True. Characters such as empty
strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na=True).
NA values, such as None or numpy.NaN, get mapped to False
values.
Returns:
Mask of bool values for each element in DataFrame that
indicates whether an element is not an NA value.
>>> df=pd.DataFrame(dict(age=[5,6,np.NaN],... born=[pd.NaT,pd.Timestamp('1939-05-27'),... pd.Timestamp('1940-04-25')],... name=['Alfred','Batman',''],... toy=[None,'Batmobile','Joker']))>>> df age born name toy0 5.0 NaT Alfred None1 6.0 1939-05-27 Batman Batmobile2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy0 True False True False1 True True True True2 False True True True
DataFrame.notnull is an alias for DataFrame.notna.
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA.
Non-missing values get mapped to True. Characters such as empty
strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na=True).
NA values, such as None or numpy.NaN, get mapped to False
values.
Returns:
Mask of bool values for each element in DataFrame that
indicates whether an element is not an NA value.
>>> df=pd.DataFrame(dict(age=[5,6,np.NaN],... born=[pd.NaT,pd.Timestamp('1939-05-27'),... pd.Timestamp('1940-04-25')],... name=['Alfred','Batman',''],... toy=[None,'Batmobile','Joker']))>>> df age born name toy0 5.0 NaT Alfred None1 6.0 1939-05-27 Batman Batmobile2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy0 True False True False1 True True True True2 False True True True
Return the first n rows ordered by columns in ascending order.
Return the first n rows with the smallest values in columns, in
ascending order. The columns that are not specified are returned as
well, but not used for ordering.
This method is equivalent to
df.sort_values(columns,ascending=True).head(n), but more
performant.
Parameters:
n (int) – Number of items to retrieve.
columns (list or str) – Column name or names to order by.
Percentage change between the current and a prior element.
Computes the percentage change from the immediately previous row by
default. This is useful in comparing the percentage of change in a time
series of elements.
Parameters:
periods (int, default 1) – Periods to shift for forming percent change.
fill_method (str, default 'pad') – How to handle NAs before computing percent changes.
limit (int, default None) – The number of consecutive NAs to fill before stopping.
freq (DateOffset, timedelta, or str, optional) – Increment to use from time series API (e.g. ‘M’ or BDay()).
**kwargs – Additional keyword arguments are passed into
DataFrame.shift or Series.shift.
self (NDFrameT) –
Returns:
chg – The same type as the calling object.
Return type:
Series or DataFrame
See also
Series.diff
Compute the difference of two elements in a Series.
DataFrame.diff
Compute the difference of two elements in a DataFrame.
Apply chainable functions that expect Series or DataFrames.
Parameters:
func (function) – Function to apply to the Series/DataFrame.
args, and kwargs are passed into func.
Alternatively a (callable,data_keyword) tuple where
data_keyword is a string indicating the keyword of
callable that expects the Series/DataFrame.
args (iterable, optional) – Positional arguments passed into func.
kwargs (mapping, optional) – A dictionary of keyword arguments passed into func.
Returns:
object
Return type:
the return type of func.
See also
DataFrame.apply
Apply a function along input axis of DataFrame.
DataFrame.applymap
Apply a function elementwise on a whole DataFrame.
Series.map
Apply a mapping correspondence on a Series.
Notes
Use .pipe when chaining together functions that expect
Series, DataFrames or GroupBy objects. Instead of writing
If you have a function that takes the data as (say) the second
argument, pass a tuple indicating which keyword expects the
data. For example, suppose f takes its data as arg2:
Return reshaped DataFrame organized by given index / column values.
Reshape data (produce a “pivot” table) based on column values. Uses
unique values from specified index / columns to form axes of the
resulting DataFrame. This function does not support data
aggregation, multiple values will result in a MultiIndex in the
columns. See the User Guide for more on reshaping.
Parameters:
index (str or object or a list of str, optional) –
Column to use to make new frame’s index. If None, uses
existing index.
Changed in version 1.1.0: Also accept list of index names.
columns (str or object or a list of str) –
Column to use to make new frame’s columns.
Changed in version 1.1.0: Also accept list of columns names.
values (str, object or a list of the previous, optional) – Column(s) to use for populating new frame’s values. If not
specified, all remaining columns will be used and the result will
have hierarchically indexed columns.
Returns:
Returns reshaped DataFrame.
Return type:
DataFrame
Raises:
ValueError: – When there are any index, columns combinations with multiple
values. DataFrame.pivot_table when you need to aggregate.
See also
DataFrame.pivot_table
Generalization of pivot that can handle duplicate values for one index/column pair.
DataFrame.unstack
Pivot based on the index values instead of a column.
wide_to_long
Wide panel to long format. Less flexible but more user-friendly than melt.
Notes
For finer-tuned control, see hierarchical indexing documentation along
with the related stack/unstack methods.
Reference the user guide for more examples.
Examples
>>> df=pd.DataFrame({'foo':['one','one','one','two','two',... 'two'],... 'bar':['A','B','C','A','B','C'],... 'baz':[1,2,3,4,5,6],... 'zoo':['x','y','z','q','w','t']})>>> df foo bar baz zoo0 one A 1 x1 one B 2 y2 one C 3 z3 two A 4 q4 two B 5 w5 two C 6 t
>>> df.pivot(index='foo',columns='bar',values='baz')bar A B Cfooone 1 2 3two 4 5 6
>>> df.pivot(index='foo',columns='bar')['baz']bar A B Cfooone 1 2 3two 4 5 6
>>> df.pivot(index='foo',columns='bar',values=['baz','zoo']) baz zoobar A B C A B Cfooone 1 2 3 x y ztwo 4 5 6 q w t
You could also assign a list of column names or a list of index names.
>>> df.pivot(index=["lev1","lev2"],columns=["lev3"],values="values") lev3 1 2lev1 lev2 1 1 0.0 1.0 2 2.0 NaN 2 1 4.0 3.0 2 NaN 5.0
A ValueError is raised if there are any duplicates.
>>> df=pd.DataFrame({"foo":['one','one','two','two'],... "bar":['A','A','B','C'],... "baz":[1,2,3,4]})>>> df foo bar baz0 one A 11 one A 22 two B 33 two C 4
Notice that the first two rows are the same for our index
and columns arguments.
Create a spreadsheet-style pivot table as a DataFrame.
The levels in the pivot table will be stored in MultiIndex objects
(hierarchical indexes) on the index and columns of the result DataFrame.
Parameters:
values (column to aggregate, optional) –
index (column, Grouper, array, or list of the previous) – If an array is passed, it must be the same length as the data. The
list can contain any of the other types (except list).
Keys to group by on the pivot table index. If an array is passed,
it is being used as the same manner as column values.
columns (column, Grouper, array, or list of the previous) – If an array is passed, it must be the same length as the data. The
list can contain any of the other types (except list).
Keys to group by on the pivot table column. If an array is passed,
it is being used as the same manner as column values.
aggfunc (function, list of functions, dict, default numpy.mean) – If list of functions passed, the resulting pivot table will have
hierarchical columns whose top level are the function names
(inferred from the function objects themselves)
If dict is passed, the key is column to aggregate and value
is function or list of functions.
fill_value (scalar, default None) – Value to replace missing values with (in the resulting pivot table,
after aggregation).
margins (bool, default False) – Add all row / columns (e.g. for subtotal / grand totals).
dropna (bool, default True) – Do not include columns whose entries are all NaN. If True,
rows with a NaN value in any column will be omitted before
computing margins.
margins_name (str, default 'All') – Name of the row / column that will contain the totals
when margins is True.
observed (bool, default False) –
This only applies if any of the groupers are Categoricals.
If True: only show observed values for categorical groupers.
If False: show all values for categorical groupers.
Changed in version 0.25.0.
sort (bool, default True) –
Specifies if the result should be sorted.
New in version 1.3.0.
Returns:
An Excel style pivot table.
Return type:
DataFrame
See also
DataFrame.pivot
Pivot without aggregation that can handle non-numeric data.
DataFrame.melt
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
wide_to_long
Wide panel to long format. Less flexible but more user-friendly than melt.
Notes
Reference the user guide for more examples.
Examples
>>> df=pd.DataFrame({"A":["foo","foo","foo","foo","foo",... "bar","bar","bar","bar"],... "B":["one","one","one","two","two",... "one","one","two","two"],... "C":["small","large","large","small",... "small","large","small","small",... "large"],... "D":[1,2,2,3,3,4,5,6,7],... "E":[2,4,5,5,6,6,8,9,9]})>>> df A B C D E0 foo one small 1 21 foo one large 2 42 foo one large 2 53 foo two small 3 54 foo two small 3 65 bar one large 4 66 bar one small 5 87 bar two small 6 98 bar two large 7 9
This first example aggregates values by taking the sum.
>>> table=pd.pivot_table(df,values='D',index=['A','B'],... columns=['C'],aggfunc=np.sum)>>> tableC large smallA Bbar one 4.0 5.0 two 7.0 6.0foo one 4.0 1.0 two NaN 6.0
We can also fill missing values using the fill_value parameter.
>>> table=pd.pivot_table(df,values='D',index=['A','B'],... columns=['C'],aggfunc=np.sum,fill_value=0)>>> tableC large smallA Bbar one 4 5 two 7 6foo one 4 1 two 0 6
The next example aggregates by taking the mean across multiple columns.
>>> table=pd.pivot_table(df,values=['D','E'],index=['A','C'],... aggfunc={'D':np.mean,... 'E':np.mean})>>> table D EA Cbar large 5.500000 7.500000 small 5.500000 8.500000foo large 2.000000 4.500000 small 2.333333 4.333333
We can also calculate multiple types of aggregations for any given
value column.
>>> table=pd.pivot_table(df,values=['D','E'],index=['A','C'],... aggfunc={'D':np.mean,... 'E':[min,max,np.mean]})>>> table D E mean max mean minA Cbar large 5.500000 9 7.500000 6 small 5.500000 9 8.500000 8foo large 2.000000 5 4.500000 4 small 2.333333 6 4.333333 2
Get Exponential power of dataframe and other, element-wise (binary operator pow).
Equivalent to dataframe**other, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, rpow.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Return the product of the values over the requested axis.
Parameters:
axis ({index (0), columns (1)}) – Axis for the function to be applied on.
For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
level (int or level name, default None) –
If the axis is a MultiIndex (hierarchical), count along a
particular level, collapsing into a Series.
Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.
numeric_only (bool, default None) –
Include only float, int, boolean columns. If None, will attempt to use
everything, then use only numeric data. Not implemented for Series.
Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be
False in a future version of pandas.
min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_count non-NA values are present the result will be NA.
**kwargs – Additional keyword arguments to be passed to the function.
Return type:
Series or DataFrame (if level specified)
See also
Series.sum
Return the sum.
Series.min
Return the minimum.
Series.max
Return the maximum.
Series.idxmin
Return the index of the minimum.
Series.idxmax
Return the index of the maximum.
DataFrame.sum
Return the sum over the requested axis.
DataFrame.min
Return the minimum over the requested axis.
DataFrame.max
Return the maximum over the requested axis.
DataFrame.idxmin
Return the index of the minimum over the requested axis.
DataFrame.idxmax
Return the index of the maximum over the requested axis.
Examples
By default, the product of an empty or all-NA Series is 1
>>> pd.Series([],dtype="float64").prod()1.0
This can be controlled with the min_count parameter
Return the product of the values over the requested axis.
Parameters:
axis ({index (0), columns (1)}) – Axis for the function to be applied on.
For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
level (int or level name, default None) –
If the axis is a MultiIndex (hierarchical), count along a
particular level, collapsing into a Series.
Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.
numeric_only (bool, default None) –
Include only float, int, boolean columns. If None, will attempt to use
everything, then use only numeric data. Not implemented for Series.
Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be
False in a future version of pandas.
min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_count non-NA values are present the result will be NA.
**kwargs – Additional keyword arguments to be passed to the function.
Return type:
Series or DataFrame (if level specified)
See also
Series.sum
Return the sum.
Series.min
Return the minimum.
Series.max
Return the maximum.
Series.idxmin
Return the index of the minimum.
Series.idxmax
Return the index of the maximum.
DataFrame.sum
Return the sum over the requested axis.
DataFrame.min
Return the minimum over the requested axis.
DataFrame.max
Return the maximum over the requested axis.
DataFrame.idxmin
Return the index of the minimum over the requested axis.
DataFrame.idxmax
Return the index of the maximum over the requested axis.
Examples
By default, the product of an empty or all-NA Series is 1
>>> pd.Series([],dtype="float64").prod()1.0
This can be controlled with the min_count parameter
This optional parameter specifies the interpolation method to use,
when the desired quantile lies between two data points i and j:
linear: i + (j - i) * fraction, where fraction is the
fractional part of the index surrounded by i and j.
lower: i.
higher: j.
nearest: i or j whichever is nearest.
midpoint: (i + j) / 2.
method ({'single', 'table'}, default 'single') – Whether to compute quantiles per-column (‘single’) or over all columns
(‘table’). When ‘table’, the only allowed interpolation methods are
‘nearest’, ‘lower’, and ‘higher’.
Returns:
If q is an array, a DataFrame will be returned where the
index is q, the columns are the columns of self, and the
values are the quantiles.
If q is a float, a Series will be returned where the
index is the columns of self and the values are the quantiles.
Query the columns of a DataFrame with a boolean expression.
Parameters:
expr (str) –
The query string to evaluate.
You can refer to variables
in the environment by prefixing them with an ‘@’ character like
@a+b.
You can refer to column names that are not valid Python variable names
by surrounding them in backticks. Thus, column names containing spaces
or punctuations (besides underscores) or starting with digits must be
surrounded by backticks. (For example, a column named “Area (cm^2)” would
be referenced as `Area(cm^2)`). Column names which are Python keywords
(like “list”, “for”, “import”, etc) cannot be used.
For example, if one of your columns is called aa and you want
to sum it with b, your query should be `aa`+b.
New in version 0.25.0: Backtick quoting introduced.
New in version 1.0.0: Expanding functionality of backtick quoting for more than only spaces.
inplace (bool) – Whether to modify the DataFrame rather than creating a new one.
**kwargs – See the documentation for eval() for complete details
on the keyword arguments accepted by DataFrame.query().
Returns:
DataFrame resulting from the provided query expression or
None if inplace=True.
Evaluate a string describing operations on DataFrame columns.
DataFrame.eval
Evaluate a string describing operations on DataFrame columns.
Notes
The result of the evaluation of this expression is first passed to
DataFrame.loc and if that fails because of a
multidimensional key (e.g., a DataFrame) then the result will be passed
to DataFrame.__getitem__().
This method uses the top-level eval() function to
evaluate the passed query.
The query() method uses a slightly
modified Python syntax by default. For example, the & and |
(bitwise) operators have the precedence of their boolean cousins,
and and or. This is syntactically valid Python,
however the semantics are different.
You can change the semantics of the expression by passing the keyword
argument parser='python'. This enforces the same semantics as
evaluation in Python space. Likewise, you can pass engine='python'
to evaluate an expression using Python itself as a backend. This is not
recommended as it is inefficient compared to using numexpr as the
engine.
The DataFrame.index and
DataFrame.columns attributes of the
DataFrame instance are placed in the query namespace
by default, which allows you to treat both the index and columns of the
frame as a column in the frame.
The identifier index is used for the frame index; you can also
use the name of the index to identify it in a query. Please note that
Python keywords may not be used as identifiers.
For further details and examples see the query documentation in
indexing.
Backtick quoted variables
Backtick quoted variables are parsed as literal Python code and
are converted internally to a Python valid identifier.
This can lead to the following problems.
During parsing a number of disallowed characters inside the backtick
quoted string are replaced by strings that are allowed as a Python identifier.
These characters include all operators in Python, the space character, the
question mark, the exclamation mark, the dollar sign, and the euro sign.
For other characters that fall outside the ASCII range (U+0001..U+007F)
and those that are not further specified in PEP 3131,
the query parser will raise an error.
This excludes whitespace different than the space character,
but also the hashtag (as it is used for comments) and the backtick
itself (backtick can also not be escaped).
In a special case, quotes that make a pair around a backtick can
confuse the parser.
For example, `it's`>`that's` will raise an error,
as it forms a quoted string ('s>`that') with a backtick inside.
Get Addition of dataframe and other, element-wise (binary operator radd).
Equivalent to other+dataframe, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, add.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
The following example shows how the method behaves with the above
parameters:
default_rank: this is the default behaviour obtained without using
any parameter.
max_rank: setting method='max' the records that have the
same values are ranked using the highest rank (e.g.: since ‘cat’
and ‘dog’ are both in the 2nd and 3rd position, rank 3 is assigned.)
NA_bottom: choosing na_option='bottom', if there are records
with NaN values they are placed at the bottom of the ranking.
pct_rank: when setting pct=True, the ranking is expressed as
percentile rank.
>>> df['default_rank']=df['Number_legs'].rank()>>> df['max_rank']=df['Number_legs'].rank(method='max')>>> df['NA_bottom']=df['Number_legs'].rank(na_option='bottom')>>> df['pct_rank']=df['Number_legs'].rank(pct=True)>>> df Animal Number_legs default_rank max_rank NA_bottom pct_rank0 cat 4.0 2.5 3.0 2.5 0.6251 penguin 2.0 1.0 1.0 1.0 0.2502 dog 4.0 2.5 3.0 2.5 0.6253 spider 8.0 4.0 4.0 4.0 1.0004 snake NaN NaN NaN 5.0 NaN
Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to other/dataframe, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, truediv.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Conform Series/DataFrame to new index with optional filling logic.
Places NA/NaN in locations having no value in the previous index. A new object
is produced unless the new index is equivalent to the current one and
copy=False.
Parameters:
axes (keywords for) – New labels / index to conform to, should be specified using
keywords. Preferably an Index object to avoid duplicating data.
Method to use for filling holes in reindexed DataFrame.
Please note: this is only applicable to DataFrames/Series with a
monotonically increasing/decreasing index.
None (default): don’t fill gaps
pad / ffill: Propagate last valid observation forward to next
valid.
backfill / bfill: Use next valid observation to fill gap.
nearest: Use nearest valid observations to fill gap.
copy (bool, default True) – Return a new object, even if the passed indexes are the same.
level (int or name) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any
“compatible” value.
limit (int, default None) – Maximum number of consecutive elements to forward or backward fill.
tolerance (optional) –
Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations most
satisfy the equation abs(index[indexer]-target)<=tolerance.
Tolerance may be a scalar value, which applies the same tolerance
to all values, or list-like, which applies variable tolerance per
element. List-like includes list, tuple, array, Series, and must be
the same size as the index and its dtype must exactly match the
index’s type.
Return type:
Series/DataFrame with changed index.
See also
DataFrame.set_index
Set row labels.
DataFrame.reset_index
Remove row labels or move them to new columns.
DataFrame.reindex_like
Change to same indices as other DataFrame.
Examples
DataFrame.reindex supports two calling conventions
(index=index_labels,columns=column_labels,...)
(labels,axis={'index','columns'},...)
We highly recommend using keyword arguments to clarify your
intent.
Create a new index and reindex the dataframe. By default
values in the new index that do not have corresponding
records in the dataframe are assigned NaN.
>>> new_index=['Safari','Iceweasel','Comodo Dragon','IE10',... 'Chrome']>>> df.reindex(new_index) http_status response_timeSafari 404.0 0.07Iceweasel NaN NaNComodo Dragon NaN NaNIE10 404.0 0.08Chrome 200.0 0.02
We can fill in the missing values by passing a value to
the keyword fill_value. Because the index is not monotonically
increasing or decreasing, we cannot use arguments to the keyword
method to fill the NaN values.
To further illustrate the filling functionality in
reindex, we will create a dataframe with a
monotonically increasing index (for example, a sequence
of dates).
The index entries that did not have a value in the original data frame
(for example, ‘2009-12-29’) are by default filled with NaN.
If desired, we can fill in the missing values using one of several
options.
For example, to back-propagate the last valid value to fill the NaN
values, pass bfill as an argument to the method keyword.
Please note that the NaN value present in the original dataframe
(at index value 2010-01-03) will not be filled by any of the
value propagation schemes. This is because filling while reindexing
does not look at dataframe values, but only compares the original and
desired indexes. If you do want to fill in the NaN values present
in the original dataframe, use the fillna() method.
Return an object with matching indices as other object.
Conform the object to the same index on all axes. Optional
filling logic, placing NaN in locations having no value
in the previous index. A new object is produced unless the
new index is equivalent to the current one and copy=False.
Parameters:
other (Object of the same data type) – Its row and column indices are used to define the new indices
of this object.
Method to use for filling holes in reindexed DataFrame.
Please note: this is only applicable to DataFrames/Series with a
monotonically increasing/decreasing index.
None (default): don’t fill gaps
pad / ffill: propagate last valid observation forward to next
valid
backfill / bfill: use next valid observation to fill gap
nearest: use nearest valid observations to fill gap.
copy (bool, default True) – Return a new object, even if the passed indexes are the same.
limit (int, default None) – Maximum number of consecutive labels to fill for inexact matches.
tolerance (optional) –
Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations must
satisfy the equation abs(index[indexer]-target)<=tolerance.
Tolerance may be a scalar value, which applies the same tolerance
to all values, or list-like, which applies variable tolerance per
element. List-like includes list, tuple, array, Series, and must be
the same size as the index and its dtype must exactly match the
index’s type.
self (NDFrameT) –
Returns:
Same type as caller, but with changed indices on each axis.
Return type:
Series or DataFrame
See also
DataFrame.set_index
Set row labels.
DataFrame.reset_index
Remove row labels or move them to new columns.
DataFrame.reindex
Change to new indices or expand indices.
Notes
Same as calling
.reindex(index=other.index,columns=other.columns,...).
>>> df2 temp_celsius windspeed2014-02-12 28.0 low2014-02-13 30.0 low2014-02-15 35.1 medium
>>> df2.reindex_like(df1) temp_celsius temp_fahrenheit windspeed2014-02-12 28.0 NaN low2014-02-13 30.0 NaN low2014-02-14 NaN NaN NaN2014-02-15 35.1 NaN medium
Function / dict values must be unique (1-to-1). Labels not contained in
a dict / Series will be left as-is. Extra labels listed don’t throw an
error.
See the user guide for more.
Parameters:
mapper (dict-like or function) – Dict-like or function transformations to apply to
that axis’ values. Use either mapper and axis to
specify the axis to target with mapper, or index and
columns.
index (dict-like or function) – Alternative to specifying axis (mapper,axis=0
is equivalent to index=mapper).
columns (dict-like or function) – Alternative to specifying axis (mapper,axis=1
is equivalent to columns=mapper).
axis ({0 or 'index', 1 or 'columns'}, default 0) – Axis to target with mapper. Can be either the axis name
(‘index’, ‘columns’) or number (0, 1). The default is ‘index’.
copy (bool, default True) – Also copy underlying data.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
If True then value of copy is ignored.
level (int or level name, default None) – In case of a MultiIndex, only rename labels in the specified
level.
errors ({'ignore', 'raise'}, default 'ignore') – If ‘raise’, raise a KeyError when a dict-like mapper, index,
or columns contains labels that are not present in the Index
being transformed.
If ‘ignore’, existing keys will be renamed and extra keys will be
ignored.
Returns:
DataFrame with the renamed axis labels or None if inplace=True.
Return type:
DataFrame or None
Raises:
KeyError – If any of the labels is not found in the selected axis and
“errors=’raise’”.
See also
DataFrame.rename_axis
Set the name of the axis.
Examples
DataFrame.rename supports two calling conventions
(index=index_mapper,columns=columns_mapper,...)
(mapper,axis={'index','columns'},...)
We highly recommend using keyword arguments to clarify your
intent.
Set the name of the axis for the index or columns.
Parameters:
mapper (scalar, list-like, optional) – Value to set the axis name attribute.
index (scalar, list-like, dict-like or function, optional) –
A scalar, list-like, dict-like or functions transformations to
apply to that axis’ values.
Note that the columns parameter is not allowed if the
object is a Series. This parameter only apply for DataFrame
type objects.
Use either mapper and axis to
specify the axis to target with mapper, or index
and/or columns.
columns (scalar, list-like, dict-like or function, optional) –
A scalar, list-like, dict-like or functions transformations to
apply to that axis’ values.
Note that the columns parameter is not allowed if the
object is a Series. This parameter only apply for DataFrame
type objects.
Use either mapper and axis to
specify the axis to target with mapper, or index
and/or columns.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to rename. For Series this parameter is unused and defaults to 0.
copy (bool, default True) – Also copy underlying data.
inplace (bool, default False) – Modifies the object directly, instead of creating a new Series
or DataFrame.
self (NDFrameT) –
Returns:
The same type as the caller or None if inplace=True.
Return type:
Series, DataFrame, or None
See also
Series.rename
Alter Series index labels or name.
DataFrame.rename
Alter DataFrame index labels or name.
Index.rename
Set new names on index.
Notes
DataFrame.rename_axis supports two calling conventions
(index=index_mapper,columns=columns_mapper,...)
(mapper,axis={'index','columns'},...)
The first calling convention will only modify the names of
the index and/or the names of the Index object that is the columns.
In this case, the parameter copy is ignored.
The second calling convention will modify the names of the
corresponding index if mapper is a list or a scalar.
However, if mapper is dict-like or a function, it will use the
deprecated behavior of modifying the axis labels.
We highly recommend using keyword arguments to clarify your
intent.
numeric: numeric values equal to to_replace will be
replaced with value
str: string exactly matching to_replace will be replaced
with value
regex: regexs matching to_replace will be replaced with
value
list of str, regex, or numeric:
First, if to_replace and value are both lists, they
must be the same length.
Second, if regex=True then all of the strings in both
lists will be interpreted as regexs otherwise they will match
directly. This doesn’t matter much for value since there
are only a few possible substitution regexes you can use.
str, regex and numeric rules apply as above.
dict:
Dicts can be used to specify different replacement values
for different existing values. For example,
{'a':'b','y':'z'} replaces the value ‘a’ with ‘b’ and
‘y’ with ‘z’. To use a dict in this way, the optional value
parameter should not be given.
For a DataFrame a dict can specify that different values
should be replaced in different columns. For example,
{'a':1,'b':'z'} looks for the value 1 in column ‘a’
and the value ‘z’ in column ‘b’ and replaces these values
with whatever is specified in value. The value parameter
should not be None in this case. You can treat this as a
special case of passing two lists except that you are
specifying the column to search in.
For a DataFrame nested dictionaries, e.g.,
{'a':{'b':np.nan}}, are read as follows: look in column
‘a’ for the value ‘b’ and replace it with NaN. The optional value
parameter should not be specified to use a nested dict in this
way. You can nest regular expressions as well. Note that
column names (the top-level dictionary keys in a nested
dictionary) cannot be regular expressions.
None:
This means that the regex argument must be a string,
compiled regular expression, or list, dict, ndarray or
Series of such elements. If value is also None then
this must be a nested dictionary or Series.
See the examples section for examples of each of these.
value (scalar, dict, list, str, regex, default None) – Value to replace any values matching to_replace with.
For a DataFrame a dict of values can be used to specify which
value to use for each column (columns not in the dict will not be
filled). Regular expressions, strings and lists or dicts of such
objects are also allowed.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
limit (int, default None) – Maximum size gap to forward or backward fill.
regex (bool or same types as to_replace, default False) – Whether to interpret to_replace and/or value as regular
expressions. If this is True then to_replace must be a
string. Alternatively, this could be a regular expression or a
list, dict, or array of regular expressions in which case
to_replace must be None.
method ({'pad', 'ffill', 'bfill'}) –
The method to use when for replacement, when to_replace is a
scalar, list or tuple and value is None.
Changed in version 0.23.0: Added to DataFrame.
Returns:
Object after replacement.
Return type:
DataFrame
Raises:
AssertionError –
If regex is not a bool and to_replace is not
None.
TypeError –
If to_replace is not a scalar, array-like, dict, or None
* If to_replace is a dict and value is not a list,
dict, ndarray, or Series
* If to_replace is None and regex is not compilable
into a regular expression or is a list, dict, ndarray, or
Series.
* When replacing multiple bool or datetime64 objects and
the arguments to to_replace does not match the type of the
value being replaced
ValueError –
If a list or an ndarray is passed to to_replace and
value but they are not the same length.
See also
DataFrame.fillna
Fill NA values.
DataFrame.where
Replace values based on boolean condition.
Series.str.replace
Simple string replacement.
Notes
Regex substitution is performed under the hood with re.sub. The
rules for substitution for re.sub are the same.
Regular expressions will only substitute on strings, meaning you
cannot provide, for example, a regular expression matching floating
point numbers and expect the columns in your frame that have a
numeric dtype to be matched. However, if those floating point
numbers are strings, then you can do this.
This method has a lot of options. You are encouraged to experiment
and play with this method to gain intuition about how it works.
When dict is used as the to_replace value, it is like
key(s) in the dict are the to_replace part and
value(s) in the dict are the value parameter.
>>> df.replace({0:10,1:100}) A B C0 10 5 a1 100 6 b2 2 7 c3 3 8 d4 4 9 e
>>> df.replace({'A':0,'B':5},100) A B C0 100 100 a1 1 6 b2 2 7 c3 3 8 d4 4 9 e
>>> df.replace({'A':{0:100,4:400}}) A B C0 100 5 a1 1 6 b2 2 7 c3 3 8 d4 400 9 e
Regular expression `to_replace`
>>> df=pd.DataFrame({'A':['bat','foo','bait'],... 'B':['abc','bar','xyz']})>>> df.replace(to_replace=r'^ba.$',value='new',regex=True) A B0 new abc1 foo new2 bait xyz
>>> df.replace({'A':r'^ba.$'},{'A':'new'},regex=True) A B0 new abc1 foo bar2 bait xyz
>>> df.replace(regex=r'^ba.$',value='new') A B0 new abc1 foo new2 bait xyz
>>> df.replace(regex={r'^ba.$':'new','foo':'xyz'}) A B0 new abc1 xyz new2 bait xyz
>>> df.replace(regex=[r'^ba.$','foo'],value='new') A B0 new abc1 new new2 bait xyz
Compare the behavior of s.replace({'a':None}) and
s.replace('a',None) to understand the peculiarities
of the to_replace parameter:
>>> s=pd.Series([10,'a','a','b','a'])
When one uses a dict as the to_replace value, it is like the
value(s) in the dict are equal to the value parameter.
s.replace({'a':None}) is equivalent to
s.replace(to_replace={'a':None},value=None,method=None):
When value is not explicitly passed and to_replace is a scalar, list
or tuple, replace uses the method parameter (default ‘pad’) to do the
replacement. So this is why the ‘a’ values are being replaced by 10
in rows 1 and 2 and ‘b’ in row 4 in this case.
>>> s.replace('a')0 101 102 103 b4 bdtype: object
On the other hand, if None is explicitly passed for value, it will
be respected:
Convenience method for frequency conversion and resampling of time series.
The object must have a datetime-like index (DatetimeIndex, PeriodIndex,
or TimedeltaIndex), or the caller must pass the label of a datetime-like
series/index to the on/level keyword parameter.
Parameters:
rule (DateOffset, Timedelta or str) – The offset string or object representing target conversion.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Which axis to use for up- or down-sampling. For Series this parameter
is unused and defaults to 0. Must be
DatetimeIndex, TimedeltaIndex or PeriodIndex.
closed ({'right', 'left'}, default None) – Which side of bin interval is closed. The default is ‘left’
for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’,
‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
label ({'right', 'left'}, default None) – Which bin edge label to label bucket with. The default is ‘left’
for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’,
‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
convention ({'start', 'end', 's', 'e'}, default 'start') – For PeriodIndex only, controls whether to use the start or
end of rule.
kind ({'timestamp', 'period'}, optional, default None) – Pass ‘timestamp’ to convert the resulting index to a
DateTimeIndex or ‘period’ to convert it to a PeriodIndex.
By default the input representation is retained.
loffset (timedelta, default None) –
Adjust the resampled time labels.
Deprecated since version 1.1.0: You should add the loffset to the df.index after the resample.
See below.
base (int, default 0) –
For frequencies that evenly subdivide 1 day, the “origin” of the
aggregated intervals. For example, for ‘5min’ frequency, base could
range from 0 through 4. Defaults to 0.
Deprecated since version 1.1.0: The new arguments that you should use are ‘offset’ or ‘origin’.
on (str, optional) – For a DataFrame, column to use instead of index for resampling.
Column must be datetime-like.
level (str or int, optional) – For a MultiIndex, level (name or number) to use for
resampling. level must be datetime-like.
origin (Timestamp or str, default 'start_day') –
The timestamp on which to adjust the grouping. The timezone of origin
must match the timezone of the index.
If string, must be one of the following:
’epoch’: origin is 1970-01-01
’start’: origin is the first value of the timeseries
’start_day’: origin is the first day at midnight of the timeseries
New in version 1.1.0.
’end’: origin is the last value of the timeseries
’end_day’: origin is the ceiling midnight of the last day
New in version 1.3.0.
offset (Timedelta or str, default is None) –
An offset timedelta added to the origin.
New in version 1.1.0.
group_keys (bool, optional) –
Whether to include the group keys in the result index when using
.apply() on the resampled object. Not specifying group_keys
will retain values-dependent behavior from pandas 1.4
and earlier (see pandas 1.5.0 Release notes
for examples). In a future version of pandas, the behavior will
default to the same as specifying group_keys=False.
Downsample the series into 3 minute bins as above, but label each
bin using the right edge instead of the left. Please note that the
value in the bucket used as the label is not included in the bucket,
which it labels. For example, in the original series the
bucket 2000-01-0100:03:00 contains the value 3, but the summed
value in the resampled bucket with the label 2000-01-0100:03:00
does not include 3 (if it did, the summed value would be 6, not 3).
To include this value close the right side of the bin interval as
illustrated in the example below this one.
In contrast with the start_day, you can use end_day to take the ceiling
midnight of the largest Timestamp as the end of the bins and drop the bins
not containing data:
Reset the index of the DataFrame, and use the default one instead.
If the DataFrame has a MultiIndex, this method can remove one or more
levels.
Parameters:
level (int, str, tuple, or list, default None) – Only remove the given levels from the index. Removes all levels by
default.
drop (bool, default False) – Do not try to insert index into dataframe columns. This resets
the index to the default integer index.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
col_level (int or str, default 0) – If the columns have multiple levels, determines which level the
labels are inserted into. By default it is inserted into the first
level.
col_fill (object, default '') – If the columns have multiple levels, determines how the other
levels are named. If None then the index name is repeated.
names (int, str or 1-dimensional list, default None) –
Using the given string, rename the DataFrame column which contains the
index data. If the DataFrame has a MultiIndex, this has to be a list or
tuple with length equal to the number of levels.
New in version 1.5.0.
Returns:
DataFrame with the new index or None if inplace=True.
Return type:
DataFrame or None
See also
DataFrame.set_index
Opposite of reset_index.
DataFrame.reindex
Change to new indices or expand indices.
DataFrame.reindex_like
Change to same indices as other DataFrame.
Examples
>>> df=pd.DataFrame([('bird',389.0),... ('bird',24.0),... ('mammal',80.5),... ('mammal',np.nan)],... index=['falcon','parrot','lion','monkey'],... columns=('class','max_speed'))>>> df class max_speedfalcon bird 389.0parrot bird 24.0lion mammal 80.5monkey mammal NaN
When we reset the index, the old index is added as a column, and a
new sequential index is used:
>>> df.reset_index() index class max_speed0 falcon bird 389.01 parrot bird 24.02 lion mammal 80.53 monkey mammal NaN
We can use the drop parameter to avoid the old index being added as
a column:
>>> df.reset_index(drop=True) class max_speed0 bird 389.01 bird 24.02 mammal 80.53 mammal NaN
You can also use reset_index with MultiIndex.
>>> index=pd.MultiIndex.from_tuples([('bird','falcon'),... ('bird','parrot'),... ('mammal','lion'),... ('mammal','monkey')],... names=['class','name'])>>> columns=pd.MultiIndex.from_tuples([('speed','max'),... ('species','type')])>>> df=pd.DataFrame([(389.0,'fly'),... (24.0,'fly'),... (80.5,'run'),... (np.nan,'jump')],... index=index,... columns=columns)>>> df speed species max typeclass namebird falcon 389.0 fly parrot 24.0 flymammal lion 80.5 run monkey NaN jump
Using the names parameter, choose a name for the index column:
>>> df.reset_index(names=['classes','names']) classes names speed species max type0 bird falcon 389.0 fly1 bird parrot 24.0 fly2 mammal lion 80.5 run3 mammal monkey NaN jump
If the index has multiple levels, we can reset a subset of them:
>>> df.reset_index(level='class') class speed species max typenamefalcon bird 389.0 flyparrot bird 24.0 flylion mammal 80.5 runmonkey mammal NaN jump
If we are not dropping the index, by default, it is placed in the top
level. We can place it in another level:
>>> df.reset_index(level='class',col_level=1) speed species class max typenamefalcon bird 389.0 flyparrot bird 24.0 flylion mammal 80.5 runmonkey mammal NaN jump
When the index is inserted under another level, we can specify under
which one with the parameter col_fill:
>>> df.reset_index(level='class',col_level=1,col_fill='species') species speed species class max typenamefalcon bird 389.0 flyparrot bird 24.0 flylion mammal 80.5 runmonkey mammal NaN jump
If we specify a nonexistent level for col_fill, it is created:
>>> df.reset_index(level='class',col_level=1,col_fill='genus') genus speed species class max typenamefalcon bird 389.0 flyparrot bird 24.0 flylion mammal 80.5 runmonkey mammal NaN jump
Get Integer division of dataframe and other, element-wise (binary operator rfloordiv).
Equivalent to other//dataframe, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, floordiv.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Get Modulo of dataframe and other, element-wise (binary operator rmod).
Equivalent to other%dataframe, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, mod.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Get Multiplication of dataframe and other, element-wise (binary operator rmul).
Equivalent to other*dataframe, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, mul.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
If an integer, the fixed number of observations used for
each window.
If an offset, the time period of each window. Each
window will be a variable sized based on the observations included in
the time-period. This is only valid for datetimelike indexes.
To learn more about the offsets & frequency strings, please see this link.
If a BaseIndexer subclass, the window boundaries
based on the defined get_window_bounds method. Additional rolling
keyword arguments, namely min_periods, center, closed and
step will be passed to get_window_bounds.
min_periods (int, default None) –
Minimum number of observations in window required to have a value;
otherwise, result is np.nan.
For a window that is specified by an offset, min_periods will default to 1.
For a window that is specified by an integer, min_periods will default
to the size of the window.
center (bool, default False) –
If False, set the window labels as the right edge of the window index.
If True, set the window labels as the center of the window index.
Certain Scipy window types require additional parameters to be passed
in the aggregation function. The additional parameters must match
the keywords specified in the Scipy window type method signature.
on (str, optional) –
For a DataFrame, a column label or Index level on which
to calculate the rolling window, rather than the DataFrame’s index.
Provided integer column is ignored and excluded from result since
an integer index is not used to calculate the rolling window.
axis (int or str, default 0) –
If 0 or 'index', roll across the rows.
If 1 or 'columns', roll across the columns.
For Series this parameter is unused and defaults to 0.
closed (str, default None) –
If 'right', the first point in the window is excluded from calculations.
If 'left', the last point in the window is excluded from calculations.
If 'both', the no points in the window are excluded from calculations.
If 'neither', the first and last points in the window are excluded
from calculations.
Default None ('right').
Changed in version 1.2.0: The closed parameter with fixed windows is now supported.
step (int, default None) –
New in version 1.5.0.
Evaluate the window at every step result, equivalent to slicing as
[::step]. window must be an integer. Using a step argument other
than None or 1 will produce a result with a different shape than the input.
Round a DataFrame to a variable number of decimal places.
Parameters:
decimals (int, dict, Series) – Number of decimal places to round each column to. If an int is
given, round each column to the same number of places.
Otherwise dict and Series round to variable numbers of places.
Column names should be in the keys if decimals is a
dict-like, or in the index if decimals is a Series. Any
columns not included in decimals will be left as is. Elements
of decimals which are not columns of the input will be
ignored.
*args – Additional keywords have no effect but might be accepted for
compatibility with numpy.
**kwargs – Additional keywords have no effect but might be accepted for
compatibility with numpy.
Returns:
A DataFrame with the affected columns rounded to the specified
number of decimal places.
Return type:
DataFrame
See also
numpy.around
Round a numpy array to the given number of decimals.
Get Exponential power of dataframe and other, element-wise (binary operator rpow).
Equivalent to other**dataframe, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, pow.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Get Subtraction of dataframe and other, element-wise (binary operator rsub).
Equivalent to other-dataframe, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, sub.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to other/dataframe, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, truediv.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Return a random sample of items from an axis of object.
You can use random_state for reproducibility.
Parameters:
n (int, optional) – Number of items from axis to return. Cannot be used with frac.
Default = 1 if frac = None.
frac (float, optional) – Fraction of axis items to return. Cannot be used with n.
replace (bool, default False) – Allow or disallow sampling of the same row more than once.
weights (str or ndarray-like, optional) – Default ‘None’ results in equal probability weighting.
If passed a Series, will align with target object on index. Index
values in weights not found in sampled object will be ignored and
index values in sampled object not in weights will be assigned
weights of zero.
If called on a DataFrame, will accept the name of a column
when axis = 0.
Unless weights are a Series, weights must be same length as axis
being sampled.
If weights do not sum to 1, they will be normalized to sum to 1.
Missing values in the weights column will be treated as zero.
Infinite values not allowed.
If int, array-like, or BitGenerator, seed for random number generator.
If np.random.RandomState or np.random.Generator, use as given.
Changed in version 1.1.0: array-like and BitGenerator object now passed to np.random.RandomState()
as seed
Changed in version 1.4.0: np.random.Generator objects now accepted
axis ({0 or ‘index’, 1 or ‘columns’, None}, default None) – Axis to sample. Accepts axis number or name. Default is stat axis
for given data type. For Series this parameter is unused and defaults to None.
ignore_index (bool, default False) –
If True, the resulting index will be labeled 0, 1, …, n - 1.
New in version 1.3.0.
self (NDFrameT) –
Returns:
A new object of same type as caller containing n items randomly
sampled from the caller object.
Return type:
Series or DataFrame
See also
DataFrameGroupBy.sample
Generates random samples from each group of a DataFrame object.
SeriesGroupBy.sample
Generates random samples from each group of a Series object.
numpy.random.choice
Generates a random sample from a given 1-D numpy array.
Indexes for column or row labels can be changed by assigning
a list-like or Index.
Parameters:
labels (list-like, Index) – The values for the new index.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to update. The value 0 identifies the rows. For Series
this parameter is unused and defaults to 0.
inplace (bool, default False) –
Whether to return a new DataFrame instance.
Deprecated since version 1.5.0.
copy (bool, default True) –
Whether to make a copy of the underlying data.
New in version 1.5.0.
Returns:
renamed – An object of type DataFrame or None if inplace=True.
Return type:
DataFrame or None
See also
DataFrame.rename_axis
Alter the name of the index or columns. Examples ——– >>> df = pd.DataFrame({“A”: [1, 2, 3], “B”: [4, 5, 6]}) Change the row labels. >>> df.set_axis([‘a’, ‘b’, ‘c’], axis=’index’) A B a 1 4 b 2 5 c 3 6 Change the column labels. >>> df.set_axis([‘I’, ‘II’], axis=’columns’) I II 0 1 4 1 2 5 2 3 6 Now, update the labels without copying the underlying data. >>> df.set_axis([‘i’, ‘ii’], axis=’columns’, copy=False) i ii 0 1 4 1 2 5 2 3 6
allows_duplicate_labels (bool, optional) – Whether the returned object allows duplicate labels.
self (NDFrameT) –
copy (bool) –
Returns:
The same type as the caller.
Return type:
Series or DataFrame
See also
DataFrame.attrs
Global metadata applying to this dataset.
DataFrame.flags
Global flags applying to this object.
Notes
This method returns a new object that’s a view on the same data
as the input. Mutating the input or the output values will be reflected
in the other.
This method is intended to be used in method chains.
“Flags” differ from “metadata”. Flags reflect properties of the
pandas object (the Series or DataFrame). Metadata refer to properties
of the dataset, and should be stored in DataFrame.attrs.
Set the DataFrame index (row labels) using one or more existing
columns or arrays (of the correct length). The index can replace the
existing index or expand on it.
Parameters:
keys –
label or array-like or list of labels/arrays This parameter can be either a single column key, a
single array of the same length as the calling DataFrame, or a list containing an arbitrary combination
of column keys and arrays. Here, “array” encompasses Series, Index, np.ndarray,
and instances of Iterator.
drop (bool) – bool, default True. Delete columns to be used as the new index.
append (bool) – bool, default False. Whether to append columns to existing index.
inplace (bool) – bool, default False. Whether to modify the DataFrame rather than creating a new one.
verify_integrity (bool) – bool, default False. Check the new index for duplicates. Otherwise, defer the check until
necessary. Setting to False will improve the performance of this method.
Returns:
DataFrame or None. Changed row labels or None if inplace=True.
Shift index by desired number of periods with an optional time freq.
When freq is not passed, shift the index without realigning the data.
If freq is passed (in this case, the index must be date or datetime,
or it will raise a NotImplementedError), the index will be
increased using the periods and the freq. freq can be inferred
when specified as “infer” as long as either freq or inferred_freq
attribute is set in the index.
Parameters:
periods (int) – Number of periods to shift. Can be positive or negative.
freq (DateOffset, tseries.offsets, timedelta, or str, optional) – Offset to use from the tseries module or time rule (e.g. ‘EOM’).
If freq is specified then the index values are shifted but the
data is not realigned. That is, use freq if you would like to
extend the index when shifting and preserve the original data.
If freq is specified as “infer” then it will be inferred from
the freq or inferred_freq attributes of the index. If neither of
those attributes exist, a ValueError is thrown.
axis ({0 or 'index', 1 or 'columns', None}, default None) – Shift direction. For Series this parameter is unused and defaults to 0.
fill_value (object, optional) –
The scalar value to use for newly introduced missing values.
the default depends on the dtype of self.
For numeric data, np.nan is used.
For datetime, timedelta, or period data, etc. NaT is used.
For extension dtypes, self.dtype.na_value is used.
>>> df.shift(periods=3) Col1 Col2 Col32020-01-01 NaN NaN NaN2020-01-02 NaN NaN NaN2020-01-03 NaN NaN NaN2020-01-04 10.0 13.0 17.02020-01-05 20.0 23.0 27.0
>>> df.shift(periods=1,axis="columns") Col1 Col2 Col32020-01-01 NaN 10 132020-01-02 NaN 20 232020-01-03 NaN 15 182020-01-04 NaN 30 332020-01-05 NaN 45 48
Returns a new DataFrame sorted by label if inplace argument is
False, otherwise updates the original DataFrame and returns None.
Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis along which to sort. The value 0 identifies the rows,
and 1 identifies the columns.
level (int or level name or list of ints or list of level names) – If not None, sort on values in specified index level(s).
ascending (bool or list-like of bools, default True) – Sort ascending vs. descending. When the index is a MultiIndex the
sort direction can be controlled for each level individually.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also numpy.sort() for more
information. mergesort and stable are the only stable algorithms. For
DataFrames, this option is only applied when sorting on a single
column or label.
na_position ({'first', 'last'}, default 'last') – Puts NaNs at the beginning if first; last puts NaNs at the end.
Not implemented for MultiIndex.
sort_remaining (bool, default True) – If True and sorting by level and index is multilevel, sort by other
levels too (in order) after sorting by specified level.
ignore_index (bool, default False) –
If True, the resulting axis will be labeled 0, 1, …, n - 1.
New in version 1.0.0.
key (callable, optional) –
If not None, apply the key function to the index values
before sorting. This is similar to the key argument in the
builtin sorted() function, with the notable difference that
this key function should be vectorized. It should expect an
Index and return an Index of the same shape. For MultiIndex
inputs, the key is applied per level.
New in version 1.1.0.
Returns:
The original DataFrame sorted by the labels or None if inplace=True.
Choice of sorting algorithm. See also numpy.sort() for more
information. mergesort and stable are the only stable algorithms. For
DataFrames, this option is only applied when sorting on a single
column or label.
na_position{‘first’, ‘last’}, default ‘last’
Puts NaNs at the beginning if first; last puts NaNs at the
end.
ignore_indexbool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
New in version 1.0.0.
keycallable, optional
Apply the key function to the values
before sorting. This is similar to the key argument in the
builtin sorted() function, with the notable difference that
this key function should be vectorized. It should expect a
Series and return a Series with the same shape as the input.
It will be applied to each column in by independently.
New in version 1.1.0.
Returns:
DataFrame with sorted values or None if inplace=True.
Series or DataFrames with a single element are squeezed to a scalar.
DataFrames with a single column or a single row are squeezed to a
Series. Otherwise the object is unchanged.
This method is most useful when you don’t know if your
object is a Series or DataFrame, but you do know it has just a single
column. In that case you can safely call squeeze to ensure you have a
Series.
Parameters:
axis ({0 or 'index', 1 or 'columns', None}, default None) – A specific axis to squeeze. By default, all length-1 axes are
squeezed. For Series this parameter is unused and defaults to None.
Returns:
The projection after squeezing axis or all the axes.
Return type:
DataFrame, Series, or scalar
See also
Series.iloc
Integer-location based indexing for selecting scalars.
DataFrame.iloc
Integer-location based indexing for selecting Series.
Series.to_frame
Inverse of DataFrame.squeeze for a single-column DataFrame.
Examples
>>> primes=pd.Series([2,3,5,7])
Slicing might produce a Series with a single value:
Stack the prescribed level(s) from columns to index.
Return a reshaped DataFrame or Series having a multi-level
index with one or more new inner-most levels compared to the current
DataFrame. The new inner-most levels are created by pivoting the
columns of the current dataframe:
if the columns have a single level, the output is a Series;
if the columns have multiple levels, the new index
level(s) is (are) taken from the prescribed level(s) and
the output is a DataFrame.
Parameters:
level (int, str, list, default -1) – Level(s) to stack from the column axis onto the index
axis, defined as one index or label, or a list of indices
or labels.
dropna (bool, default True) – Whether to drop rows in the resulting Frame/Series with
missing values. Stacking a column level onto the index
axis can create combinations of index and column values
that are missing from the original dataframe. See Examples
section.
Returns:
Stacked dataframe or series.
Return type:
DataFrame or Series
See also
DataFrame.unstack
Unstack prescribed level(s) from index axis onto column axis.
DataFrame.pivot
Reshape dataframe from long format to wide format.
DataFrame.pivot_table
Create a spreadsheet-style pivot table as a DataFrame.
Notes
The function is named by analogy with a collection of books
being reorganized from being side by side on a horizontal
position (the columns of the dataframe) to being stacked
vertically on top of each other (in the index of the
dataframe).
It is common to have missing values when stacking a dataframe
with multi-level columns, as the stacked dataframe typically
has more values than the original dataframe. Missing values
are filled with NaNs:
>>> df_multi_level_cols2 weight height kg mcat 1.0 2.0dog 3.0 4.0>>> df_multi_level_cols2.stack() height weightcat kg NaN 1.0 m 2.0 NaNdog kg NaN 3.0 m 4.0 NaN
Prescribing the level(s) to be stacked
The first parameter controls which level or levels are stacked:
>>> df_multi_level_cols2.stack(0) kg mcat height NaN 2.0 weight 1.0 NaNdog height NaN 4.0 weight 3.0 NaN>>> df_multi_level_cols2.stack([0,1])cat height m 2.0 weight kg 1.0dog height m 4.0 weight kg 3.0dtype: float64
Note that rows where all values are missing are dropped by
default but this behaviour can be controlled via the dropna
keyword parameter:
>>> df_multi_level_cols3 weight height kg mcat NaN 1.0dog 2.0 3.0>>> df_multi_level_cols3.stack(dropna=False) height weightcat kg NaN NaN m 1.0 NaNdog kg NaN 2.0 m 3.0 NaN>>> df_multi_level_cols3.stack(dropna=True) height weightcat m 1.0 NaNdog kg NaN 2.0 m 3.0 NaN
Get Subtraction of dataframe and other, element-wise (binary operator sub).
Equivalent to dataframe-other, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, rsub.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Get Subtraction of dataframe and other, element-wise (binary operator sub).
Equivalent to dataframe-other, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, rsub.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Return the sum of the values over the requested axis.
This is equivalent to the method numpy.sum.
Parameters:
axis ({index (0), columns (1)}) – Axis for the function to be applied on.
For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
level (int or level name, default None) –
If the axis is a MultiIndex (hierarchical), count along a
particular level, collapsing into a Series.
Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.
numeric_only (bool, default None) –
Include only float, int, boolean columns. If None, will attempt to use
everything, then use only numeric data. Not implemented for Series.
Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be
False in a future version of pandas.
min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_count non-NA values are present the result will be NA.
**kwargs – Additional keyword arguments to be passed to the function.
Return type:
Series or DataFrame (if level specified)
See also
Series.sum
Return the sum.
Series.min
Return the minimum.
Series.max
Return the maximum.
Series.idxmin
Return the index of the minimum.
Series.idxmax
Return the index of the maximum.
DataFrame.sum
Return the sum over the requested axis.
DataFrame.min
Return the minimum over the requested axis.
DataFrame.max
Return the maximum over the requested axis.
DataFrame.idxmin
Return the index of the minimum over the requested axis.
DataFrame.idxmax
Return the index of the maximum over the requested axis.
Examples
>>> idx=pd.MultiIndex.from_arrays([... ['warm','warm','cold','cold'],... ['dog','falcon','fish','spider']],... names=['blooded','animal'])>>> s=pd.Series([4,2,0,8],name='legs',index=idx)>>> sblooded animalwarm dog 4 falcon 2cold fish 0 spider 8Name: legs, dtype: int64
>>> s.sum()14
By default, the sum of an empty or all-NA Series is 0.
>>> pd.Series([],dtype="float64").sum()# min_count=0 is the default0.0
This can be controlled with the min_count parameter. For example, if
you’d like the sum of an empty series to be NaN, pass min_count=1.
Default is to swap the two innermost levels of the index.
Parameters:
i (int or str) – Levels of the indices to be swapped. Can pass level name as string.
j (int or str) – Levels of the indices to be swapped. Can pass level name as string.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to swap levels on. 0 or ‘index’ for row-wise, 1 or
‘columns’ for column-wise.
Returns:
DataFrame with levels swapped in MultiIndex.
Return type:
DataFrame
Examples
>>> df=pd.DataFrame(... {"Grade":["A","B","A","C"]},... index=[... ["Final exam","Final exam","Coursework","Coursework"],... ["History","Geography","History","Geography"],... ["January","February","March","April"],... ],... )>>> df GradeFinal exam History January A Geography February BCoursework History March A Geography April C
In the following example, we will swap the levels of the indices.
Here, we will swap the levels column-wise, but levels can be swapped row-wise
in a similar manner. Note that column-wise is the default behaviour.
By not supplying any arguments for i and j, we swap the last and second to
last indices.
>>> df.swaplevel() GradeFinal exam January History A February Geography BCoursework March History A April Geography C
By supplying one argument, we can choose which index to swap the last
index with. We can for example swap the first index with the last one as
follows.
>>> df.swaplevel(0) GradeJanuary History Final exam AFebruary Geography Final exam BMarch History Coursework AApril Geography Coursework C
We can also define explicitly which indices we want to swap by supplying values
for both i and j. Here, we for example swap the first and second indices.
>>> df.swaplevel(0,1) GradeHistory Final exam January AGeography Final exam February BHistory Coursework March AGeography Coursework April C
This function returns last n rows from the object based on
position. It is useful for quickly verifying data, for example,
after sorting or appending rows.
For negative values of n, this function returns all rows except
the first |n| rows, equivalent to df[|n|:].
If n is larger than the number of rows, this function returns all rows.
Return the elements in the given positional indices along an axis.
This means that we are not indexing according to actual values in
the index attribute of the object. We are indexing according to the
actual position of the element in the object.
Parameters:
indices (array-like) – An array of ints indicating which positions to take.
axis ({0 or 'index', 1 or 'columns', None}, default 0) – The axis on which to select elements. 0 means that we are
selecting rows, 1 means that we are selecting columns.
For Series this parameter is unused and defaults to 0.
is_copy (bool) –
Before pandas 1.0, is_copy=False can be specified to ensure
that the return value is an actual copy. Starting with pandas 1.0,
take always returns a copy, and the keyword is therefore
deprecated.
Deprecated since version 1.0.0.
**kwargs – For compatibility with numpy.take(). Has no effect on the
output.
self (NDFrameT) –
Returns:
taken – An array-like containing the elements taken from the object.
Return type:
same type as caller
See also
DataFrame.loc
Select a subset of a DataFrame by labels.
DataFrame.iloc
Select a subset of a DataFrame by positions.
numpy.take
Take elements from an array along an axis.
Examples
>>> df=pd.DataFrame([('falcon','bird',389.0),... ('parrot','bird',24.0),... ('lion','mammal',80.5),... ('monkey','mammal',np.nan)],... columns=['name','class','max_speed'],... index=[0,2,3,1])>>> df name class max_speed0 falcon bird 389.02 parrot bird 24.03 lion mammal 80.51 monkey mammal NaN
Take elements at positions 0 and 3 along the axis 0 (default).
Note how the actual indices selected (0 and 1) do not correspond to
our selected indices 0 and 3. That’s because we are selecting the 0th
and 3rd rows, not rows whose indices equal 0 and 3.
>>> df.take([0,3]) name class max_speed0 falcon bird 389.01 monkey mammal NaN
Take elements at indices 1 and 2 along the axis 1 (column selection).
>>> df.take([1,2],axis=1) class max_speed0 bird 389.02 bird 24.03 mammal 80.51 mammal NaN
We may take elements using negative integers for positive indices,
starting from the end of the object, just like with Python lists.
>>> df.take([-1,-2]) name class max_speed1 monkey mammal NaN3 lion mammal 80.5
String, path object (implementing os.PathLike[str]), or file-like
object implementing a write() function. If None, the result is
returned as a string. If a non-binary file object is passed, it should
be opened with newline=’’, disabling universal newlines. If a binary
file object is passed, mode might need to contain a ‘b’.
Changed in version 1.2.0: Support for binary file objects was introduced.
sep (str, default ',') – String of length 1. Field delimiter for the output file.
na_rep (str, default '') – Missing data representation.
float_format (str, Callable, default None) – Format string for floating point numbers. If a Callable is given, it takes
precedence over other numeric formatting parameters, like decimal.
columns (sequence, optional) – Columns to write.
header (bool or list of str, default True) – Write out the column names. If a list of strings is given it is
assumed to be aliases for the column names.
index (bool, default True) – Write row names (index).
index_label (str or sequence, or False, default None) – Column label for index column(s) if desired. If None is given, and
header and index are True, then the index names are used. A
sequence should be given if the object uses MultiIndex. If
False do not print fields for index names. Use index_label=False
for easier importing in R.
mode (str, default 'w') – Python write mode. The available write modes are the same as
open().
encoding (str, optional) – A string representing the encoding to use in the output file,
defaults to ‘utf-8’. encoding is not supported if path_or_buf
is a non-binary file object.
compression (str or dict, default 'infer') –
For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buf’ is
path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’
(otherwise no compression).
Set to None for no compression.
Can also be a dict with key 'method' set
to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other
key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile,
bz2.BZ2File, zstandard.ZstdCompressor or
tarfile.TarFile, respectively.
As an example, the following could be passed for faster compression and to create
a reproducible gzip archive:
compression={'method':'gzip','compresslevel':1,'mtime':1}.
New in version 1.5.0: Added support for .tar files.
Changed in version 1.0.0: May now be a dict with key ‘method’ as compression mode
and other entries as additional compression options if
compression mode is ‘zip’.
Changed in version 1.1.0: Passing compression options as keys in dict is
supported for compression modes ‘gzip’, ‘bz2’, ‘zstd’, and ‘zip’.
Changed in version 1.2.0: Compression is supported for binary file objects.
Changed in version 1.2.0: Previous versions forwarded dict entries for ‘gzip’ to
gzip.open instead of gzip.GzipFile which prevented
setting mtime.
quoting (optional constant from csv module) – Defaults to csv.QUOTE_MINIMAL. If you have set a float_format
then floats are converted to strings and thus csv.QUOTE_NONNUMERIC
will treat them as non-numeric.
quotechar (str, default '"') – String of length 1. Character used to quote fields.
lineterminator (str, optional) –
The newline character or character sequence to use in the output
file. Defaults to os.linesep, which depends on the OS in which
this method is called (’\n’ for linux, ‘\r\n’ for Windows, i.e.).
Changed in version 1.5.0: Previously was line_terminator, changed for consistency with
read_csv and the standard library ‘csv’ module.
chunksize (int or None) – Rows to write at a time.
date_format (str, default None) – Format string for datetime objects.
doublequote (bool, default True) – Control quoting of quotechar inside a field.
escapechar (str, default None) – String of length 1. Character used to escape sep and quotechar
when appropriate.
decimal (str, default '.') – Character recognized as decimal separator. E.g. use ‘,’ for
European data.
errors (str, default 'strict') –
Specifies how encoding and decoding errors are to be handled.
See the errors argument for open() for a full list
of options.
New in version 1.1.0.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
New in version 1.2.0.
Returns:
If path_or_buf is None, returns the resulting csv format as a
string. Otherwise returns None.
To write a single object to an Excel .xlsx file it is only necessary to
specify a target file name. To write to multiple sheets it is necessary to
create an ExcelWriter object with a target file name, and specify a sheet
in the file to write to.
Multiple sheets may be written to by specifying unique sheet_name.
With all data written to the file it is necessary to save the changes.
Note that creating an ExcelWriter object with a file name that already
exists will result in the contents of the existing file being erased.
Parameters:
excel_writer (path-like, file-like, or ExcelWriter object) – File path or existing ExcelWriter.
sheet_name (str, default 'Sheet1') – Name of sheet which will contain DataFrame.
na_rep (str, default '') – Missing data representation.
float_format (str, optional) – Format string for floating point numbers. For example
float_format="%.2f" will format 0.1234 to 0.12.
columns (sequence or list of str, optional) – Columns to write.
header (bool or list of str, default True) – Write out the column names. If a list of string is given it is
assumed to be aliases for the column names.
index (bool, default True) – Write row names (index).
index_label (str or sequence, optional) – Column label for index column(s) if desired. If not specified, and
header and index are True, then the index names are used. A
sequence should be given if the DataFrame uses MultiIndex.
startrow (int, default 0) – Upper left cell row to dump data frame.
startcol (int, default 0) – Upper left cell column to dump data frame.
engine (str, optional) –
Write engine to use, ‘openpyxl’ or ‘xlsxwriter’. You can also set this
via the options io.excel.xlsx.writer, io.excel.xls.writer, and
io.excel.xlsm.writer.
Deprecated since version 1.2.0: As the xlwt package is no longer
maintained, the xlwt engine will be removed in a future version
of pandas.
merge_cells (bool, default True) – Write MultiIndex and Hierarchical Rows as merged cells.
encoding (str, optional) –
Encoding of the resulting excel file. Only necessary for xlwt,
other writers support unicode natively.
Deprecated since version 1.5.0: This keyword was not used.
inf_rep (str, default 'inf') – Representation for infinity (there is no native representation for
infinity in Excel).
verbose (bool, default True) –
Display more information in the error logs.
Deprecated since version 1.5.0: This keyword was not used.
freeze_panes (tuple of int (length 2), optional) – Specifies the one-based bottommost row and rightmost column that
is to be frozen.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
To set the library that is used to write the Excel file,
you can pass the engine keyword (the default engine is
automatically chosen depending on the file extension):
path (str, path object, file-like object) – String, path object (implementing os.PathLike[str]), or file-like
object implementing a binary write() function. If a string or a path,
it will be used as Root Directory path when writing a partitioned dataset.
**kwargs –
Additional keywords passed to pyarrow.feather.write_feather().
Starting with pyarrow 0.17, this includes the compression,
compression_level, chunksize and version keywords.
New in version 1.1.0.
Return type:
None
Notes
This function writes the dataframe as a feather file. Requires a default
index. For saving the DataFrame with your custom index use a method that
supports custom indices e.g. to_parquet.
Changed in version 1.5.0: Default value is changed to True. Google has deprecated the
auth_local_webserver=False“out of band” (copy-paste)
flow.
table_schema (list of dicts, optional) –
List of BigQuery table fields to which according DataFrame
columns conform to, e.g. [{'name':'col1','type':'STRING'},...]. If schema is not provided, it will be
generated according to dtypes of DataFrame columns. See
BigQuery API documentation on available names of a field.
New in version 0.3.1 of pandas-gbq.
location (str, optional) –
Location where the load job should run. See the BigQuery locations
documentation for a
list of available locations. The location must match that of the
target dataset.
New in version 0.5.0 of pandas-gbq.
progress_bar (bool, default True) –
Use the library tqdm to show the progress bar for the upload,
chunk by chunk.
Credentials for accessing Google APIs. Use this parameter to
override default credentials, such as to use Compute Engine
google.auth.compute_engine.Credentials or Service
Account google.oauth2.service_account.Credentials
directly.
Write the contained data to an HDF5 file using HDFStore.
Hierarchical Data Format (HDF) is self-describing, allowing an
application to interpret the structure and contents of a file with
no outside information. One HDF file can hold a mix of related objects
which can be accessed as a group or as individual objects.
In order to add another DataFrame or Series to an existing HDF file
please use append mode and a different a key.
Warning
One can store a subclass of DataFrame or Series to HDF5,
but the type of the subclass is lost upon storing.
For more information see the user guide.
Parameters:
path_or_buf (str or pandas.HDFStore) – File path or HDFStore object.
key (str) – Identifier for the group in the store.
mode ({'a', 'w', 'r+'}, default 'a') –
Mode to open file:
’w’: write, a new file is created (an existing file with
the same name would be deleted).
’a’: append, an existing file is opened for reading and
writing, and if the file does not exist it is created.
’r+’: similar to ‘a’, but the file must already exist.
complevel ({0-9}, default None) – Specifies a compression level for data.
A value of 0 or None disables compression.
complib ({'zlib', 'lzo', 'bzip2', 'blosc'}, default 'zlib') – Specifies the compression library to be used.
As of v0.20.2 these additional compressors for Blosc are supported
(default if no compressor specified: ‘blosc:blosclz’):
{‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’,
‘blosc:zlib’, ‘blosc:zstd’}.
Specifying a compression library which is not available issues
a ValueError.
append (bool, default False) – For Table formats, append the input data to the existing.
format ({'fixed', 'table', None}, default 'fixed') –
Possible values:
’fixed’: Fixed format. Fast writing/reading. Not-appendable,
nor searchable.
’table’: Table format. Write as a PyTables Table structure
which may perform worse but allow more flexible operations
like searching / selecting subsets of the data.
If None, pd.get_option(‘io.hdf.default_format’) is checked,
followed by fallback to “fixed”.
index (bool, default True) – Write DataFrame index as a column.
min_itemsize (dict or int, optional) – Map column names to minimum string sizes for columns.
nan_rep (Any, optional) – How to represent null values as str.
Not allowed with append=True.
data_columns (list of columns or True, optional) – List of columns to create as indexed data columns for on-disk
queries, or True to use all columns. By default only the axes
of the object are indexed. See
Query via data columns. for
more information.
Applicable only to format=’table’.
errors (str, default 'strict') – Specifies how encoding and decoding errors are to be handled.
See the errors argument for open() for a full list
of options.
index (bool, optional, default True) – Whether to print index (row) labels.
na_rep (str, optional, default 'NaN') – String representation of NaN to use.
formatters (list, tuple or dict of one-param. functions, optional) – Formatter functions to apply to columns’ elements by position or
name.
The result of each function must be a unicode string.
List/tuple must be of length equal to the number of columns.
Formatter function to apply to columns’ elements if they are
floats. This function must return a unicode string and will be
applied only to the non-NaN elements, with NaN being
handled by na_rep.
Changed in version 1.2.0.
sparsify (bool, optional, default True) – Set to False for a DataFrame with a hierarchical index to print
every multiindex key at each row.
index_names (bool, optional, default True) – Prints the names of the indexes.
justify (str, default None) –
How to justify the column labels. If None uses the option from
the print configuration (controlled by set_option), ‘right’ out
of the box. Valid values are
left
right
center
justify
justify-all
start
end
inherit
match-parent
initial
unset.
max_rows (int, optional) – Maximum number of rows to display in the console.
max_cols (int, optional) – Maximum number of columns to display in the console.
show_dimensions (bool, default False) – Display DataFrame dimensions (number of rows by number of columns).
decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.
bold_rows (bool, default True) – Make the row labels bold in the output.
classes (str or list or tuple, default None) – CSS class(es) to apply to the resulting html table.
escape (bool, default True) – Convert the characters <, >, and & to HTML-safe sequences.
notebook ({True, False}, default False) – Whether the generated HTML is for IPython Notebook.
border (int) – A border=border attribute is included in the opening
<table> tag. Default pd.options.display.html.border.
table_id (str, optional) – A css id is included in the opening <table> tag if specified.
render_links (bool, default False) – Convert URLs to HTML links.
encoding (str, default "utf-8") –
Set character encoding.
New in version 1.0.
Returns:
If buf is None, returns the result as a string. Otherwise returns
None.
Note NaN’s and None will be converted to null and datetime objects
will be converted to UNIX timestamps.
Parameters:
path_or_buf (str, path object, file-like object, or None, default None) – String, path object (implementing os.PathLike[str]), or file-like
object implementing a write() function. If None, the result is
returned as a string.
’records’ : list like [{column -> value}, … , {column -> value}]
’index’ : dict like {index -> {column -> value}}
’columns’ : dict like {column -> {index -> value}}
’values’ : just the values array
’table’ : dict like {‘schema’: {schema}, ‘data’: {data}}
Describing the data, where data component is like orient='records'.
date_format ({None, 'epoch', 'iso'}) – Type of date conversion. ‘epoch’ = epoch milliseconds,
‘iso’ = ISO8601. The default depends on the orient. For
orient='table', the default is ‘iso’. For all other orients,
the default is ‘epoch’.
double_precision (int, default 10) – The number of decimal places to use when encoding
floating point values.
force_ascii (bool, default True) – Force encoded string to be ASCII.
date_unit (str, default 'ms' (milliseconds)) – The time unit to encode to, governs timestamp and ISO8601
precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond,
microsecond, and nanosecond respectively.
default_handler (callable, default None) – Handler to call if object cannot otherwise be converted to a
suitable format for JSON. Should receive a single argument which is
the object to convert and return a serialisable object.
lines (bool, default False) – If ‘orient’ is ‘records’ write out line-delimited json format. Will
throw ValueError if incorrect ‘orient’ since others are not
list-like.
compression (str or dict, default 'infer') –
For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buf’ is
path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’
(otherwise no compression).
Set to None for no compression.
Can also be a dict with key 'method' set
to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other
key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile,
bz2.BZ2File, zstandard.ZstdCompressor or
tarfile.TarFile, respectively.
As an example, the following could be passed for faster compression and to create
a reproducible gzip archive:
compression={'method':'gzip','compresslevel':1,'mtime':1}.
New in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
index (bool, default True) – Whether to include the index values in the JSON string. Not
including the index (index=False) is only supported when
orient is ‘split’ or ‘table’.
indent (int, optional) –
Length of whitespace used to indent each record.
New in version 1.0.0.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
New in version 1.2.0.
Returns:
If path_or_buf is None, returns the resulting json format as a
string. Otherwise returns None.
Return type:
None or str
See also
read_json
Convert a JSON string to pandas object.
Notes
The behavior of indent=0 varies from the stdlib, which does not
indent the output but does insert newlines. Currently, indent=0
and the default indent=None are equivalent in pandas, though this
may change in a future release.
orient='table' contains a ‘pandas_version’ field under ‘schema’.
This stores the version of pandas used in the latest revision of the
schema.
Render object to a LaTeX tabular, longtable, or nested table.
Requires \usepackage{booktabs}. The output can be copy/pasted
into a main LaTeX document or read from an external file
with \input{table.tex}.
Changed in version 1.0.0: Added caption and label arguments.
Changed in version 1.2.0: Added position argument, changed meaning of caption argument.
Parameters:
buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
columns (list of label, optional) – The subset of columns to write. Writes all columns by default.
col_space (int, optional) – The minimum width of each column.
header (bool or list of str, default True) – Write out the column names. If a list of strings is given,
it is assumed to be aliases for the column names.
index (bool, default True) – Write row names (index).
na_rep (str, default 'NaN') – Missing data representation.
formatters (list of functions or dict of {str: function}, optional) – Formatter functions to apply to columns’ elements by position or
name. The result of each function must be a unicode string.
List must be of length equal to the number of columns.
float_format (one-parameter function or str, optional, default None) – Formatter for floating point numbers. For example
float_format="%.2f" and float_format="{:0.2f}".format will
both result in 0.1234 being formatted as 0.12.
sparsify (bool, optional) – Set to False for a DataFrame with a hierarchical index to print
every multiindex key at each row. By default, the value will be
read from the config module.
index_names (bool, default True) – Prints the names of the indexes.
bold_rows (bool, default False) – Make the row labels bold in the output.
column_format (str, optional) – The columns format as specified in LaTeX table format e.g. ‘rcl’ for 3
columns. By default, ‘l’ will be used for all columns except
columns of numbers, which default to ‘r’.
longtable (bool, optional) – By default, the value will be read from the pandas config
module. Use a longtable environment instead of tabular. Requires
adding a usepackage{longtable} to your LaTeX preamble.
escape (bool, optional) – By default, the value will be read from the pandas config
module. When set to False prevents from escaping latex special
characters in column names.
encoding (str, optional) – A string representing the encoding to use in the output file,
defaults to ‘utf-8’.
decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.
multicolumn (bool, default True) – Use multicolumn to enhance MultiIndex columns.
The default will be read from the config module.
multicolumn_format (str, default 'l') – The alignment for multicolumns, similar to column_format
The default will be read from the config module.
multirow (bool, default False) – Use multirow to enhance MultiIndex rows. Requires adding a
usepackage{multirow} to your LaTeX preamble. Will print
centered labels (instead of top-aligned) across the contained
rows, separating groups via clines. The default will be read
from the pandas config module.
caption (str or tuple, optional) –
Tuple (full_caption, short_caption),
which results in \caption[short_caption]{full_caption};
if a single string is passed, no short caption will be set.
New in version 1.0.0.
Changed in version 1.2.0: Optionally allow caption to be a tuple (full_caption,short_caption).
label (str, optional) –
The LaTeX label to be placed inside \label{} in the output.
This is used with \ref{} in the main .tex file.
New in version 1.0.0.
position (str, optional) –
The LaTeX positional argument for tables, to be placed after
\begin{} in the output.
New in version 1.2.0:
str or None
If buf is None, returns the result as a string. Otherwise returns
None.
Return type:
str | None
See also
io.formats.style.Styler.to_latex
Render a DataFrame to LaTeX with conditional formatting.
DataFrame.to_string
Render a DataFrame to a console-friendly tabular output.
DataFrame.to_html
Render a DataFrame as an HTML table.
Examples
>>> df=pd.DataFrame(dict(name=['Raphael','Donatello'],... mask=['red','purple'],... weapon=['sai','bo staff']))>>> print(df.to_latex(index=False))\begin{tabular}{lll} \toprule name & mask & weapon \\ \midrule Raphael & red & sai \\ Donatello & purple & bo staff \\\bottomrule\end{tabular}
buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
mode (str, optional) – Mode in which file is opened, “wt” by default.
index (bool, optional, default True) –
Add index (row) labels.
New in version 1.1.0.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
New in version 1.2.0.
**kwargs – These parameters will be passed to tabulate.
By default, the dtype of the returned array will be the common NumPy
dtype of all types in the DataFrame. For example, if the dtypes are
float16 and float32, the results dtype will be float32.
This may require copying data and coercing values, which may be
expensive.
Parameters:
dtype (str or numpy.dtype, optional) – The dtype to pass to numpy.asarray().
copy (bool, default False) – Whether to ensure that the returned value is not a view on
another array. Note that copy=False does not ensure that
to_numpy() is no-copy. Rather, copy=True ensure that
a copy is made, even if not strictly necessary.
na_value (Any, optional) –
The value to use for missing values. The default value depends
on dtype and the dtypes of the DataFrame columns.
path (str, file-like object or None, default None) – If a string, it will be used as Root Directory path
when writing a partitioned dataset. By file-like object,
we refer to objects with a write() method, such as a file handle
(e.g. via builtin open function). If path is None,
a bytes object is returned.
engine (str, default 'pyarrow') – ORC library to use. Pyarrow must be >= 7.0.0.
index (bool, optional) – If True, include the dataframe’s index(es) in the file output.
If False, they will not be written to the file.
If None, similar to infer the dataframe’s index(es)
will be saved. However, instead of being saved as values,
the RangeIndex will be stored as a range in the metadata so it
doesn’t require much space and is faster. Other indexes will
be included as columns in the file output.
engine_kwargs (dict[str, Any] or None, default None) – Additional keyword arguments passed to pyarrow.orc.write_table().
Return type:
bytes if no path argument is provided else None
Raises:
NotImplementedError – Dtype of one or more columns is category, unsigned integers, interval,
period or sparse.
ValueError – engine is not pyarrow.
See also
read_orc
Read a ORC file.
DataFrame.to_parquet
Write a parquet file.
DataFrame.to_csv
Write a csv file.
DataFrame.to_sql
Write to a sql table.
DataFrame.to_hdf
Write to hdf.
Notes
Before using this function you should read the user guide about
ORC and install optional dependencies.
If you want to get a buffer to the orc content you can write it to io.BytesIO
>>> import io
>>> b = io.BytesIO(df.to_orc()) # doctest: +SKIP
>>> b.seek(0) # doctest: +SKIP
0
>>> content = b.read() # doctest: +SKIP
This function writes the dataframe as a parquet file. You can choose different parquet
backends, and have the option of compression. See
the user guide for more details.
String, path object (implementing os.PathLike[str]), or file-like
object implementing a binary write() function. If None, the result is
returned as bytes. If a string or path, it will be used as Root Directory
path when writing a partitioned dataset.
Changed in version 1.2.0.
Previously this was “fname”
engine ({'auto', 'pyarrow', 'fastparquet'}, default 'auto') – Parquet library to use. If ‘auto’, then the option
io.parquet.engine is used. The default io.parquet.engine
behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if
‘pyarrow’ is unavailable.
compression ({'snappy', 'gzip', 'brotli', None}, default 'snappy') – Name of the compression to use. Use None for no compression.
index (bool, default None) – If True, include the dataframe’s index(es) in the file output.
If False, they will not be written to the file.
If None, similar to True the dataframe’s index(es)
will be saved. However, instead of being saved as values,
the RangeIndex will be stored as a range in the metadata so it
doesn’t require much space and is faster. Other indexes will
be included as columns in the file output.
partition_cols (list, optional, default None) – Column names by which to partition the dataset.
Columns are partitioned in the order they are given.
Must be None if path is not a string.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
New in version 1.2.0.
**kwargs – Additional arguments passed to the parquet library. See
pandas io for more details.
If you want to get a buffer to the parquet content you can use a io.BytesIO
object, as long as you don’t use partition_cols, which creates multiple files.
path (str, path object, or file-like object) – String, path object (implementing os.PathLike[str]), or file-like
object implementing a binary write() function. File path where
the pickled object will be stored.
compression (str or dict, default 'infer') –
For on-the-fly compression of the output data. If ‘infer’ and ‘path’ is
path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’
(otherwise no compression).
Set to None for no compression.
Can also be a dict with key 'method' set
to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other
key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile,
bz2.BZ2File, zstandard.ZstdCompressor or
tarfile.TarFile, respectively.
As an example, the following could be passed for faster compression and to create
a reproducible gzip archive:
compression={'method':'gzip','compresslevel':1,'mtime':1}.
New in version 1.5.0: Added support for .tar files.
protocol (int) –
Int which indicates which protocol should be used by the pickler,
default HIGHEST_PROTOCOL (see [1]_ paragraph 12.1.2). The possible
values are 0, 1, 2, 3, 4, 5. A negative value for the protocol
parameter is equivalent to setting its value to HIGHEST_PROTOCOL.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
New in version 1.2.0.
Return type:
None
See also
read_pickle
Load pickled pandas object (or any object) from file.
Index will be included as the first field of the record array if
requested.
Parameters:
index (bool, default True) – Include index in resulting record array, stored in ‘index’
field or using the index label, if set.
column_dtypes (str, type, dict, default None) – If a string or type, the data type to store all columns. If
a dictionary, a mapping of column names and indices (zero-indexed)
to specific data types.
index_dtypes (str, type, dict, default None) –
If a string or type, the data type to store all index levels. If
a dictionary, a mapping of index level names and indices
(zero-indexed) to specific data types.
This mapping is applied only if index=True.
Returns:
NumPy ndarray with the DataFrame labels as fields and each row
of the DataFrame as entries.
Return type:
numpy.recarray
See also
DataFrame.from_records
Convert structured or record ndarray to DataFrame.
numpy.recarray
An ndarray that allows field access using attributes, analogous to typed columns in a spreadsheet.
Write records stored in a DataFrame to a SQL database.
Databases supported by SQLAlchemy [1]_ are supported. Tables can be
newly created, appended to, or overwritten.
Parameters:
name (str) – Name of SQL table.
con (sqlalchemy.engine.(Engine or Connection) or sqlite3.Connection) –
Using SQLAlchemy makes it possible to use any DB supported by that
library. Legacy support is provided for sqlite3.Connection objects. The user
is responsible for engine disposal and connection closure for the SQLAlchemy
connectable See here.
schema (str, optional) – Specify the schema (if database flavor supports this). If None, use
default schema.
replace: Drop the table before inserting new values.
append: Insert new values to the existing table.
index (bool, default True) – Write DataFrame index as a column. Uses index_label as the column
name in the table.
index_label (str or sequence, default None) – Column label for index column(s). If None is given (default) and
index is True, then the index names are used.
A sequence should be given if the DataFrame uses MultiIndex.
chunksize (int, optional) – Specify the number of rows in each batch to be written at a time.
By default, all rows will be written at once.
dtype (dict or scalar, optional) – Specifying the datatype for columns. If a dictionary is used, the
keys should be the column names and the values should be the
SQLAlchemy types or strings for the sqlite3 legacy mode. If a
scalar is provided, it will be applied to all columns.
method ({None, 'multi', callable}, optional) –
Controls the SQL insertion clause used:
None : Uses standard SQL INSERT clause (one per row).
’multi’: Pass multiple values in a single INSERT clause.
callable with signature (pd_table,conn,keys,data_iter).
Details and a sample callable implementation can be found in the
section insert method.
Returns:
Number of rows affected by to_sql. None is returned if the callable
passed into method does not return an integer number of rows.
The number of returned rows affected is the sum of the rowcount
attribute of sqlite3.Cursor or SQLAlchemy connectable which may not
reflect the exact number of written rows as stipulated in the
sqlite3 or
SQLAlchemy.
New in version 1.4.0.
Return type:
None or int
Raises:
ValueError – When the table already exists and if_exists is ‘fail’ (the
default).
See also
read_sql
Read a DataFrame from a table.
Notes
Timezone aware datetime columns will be written as
Timestampwithtimezone type with SQLAlchemy if supported by the
database. Otherwise, the datetimes will be stored as timezone unaware
timestamps local to the original timezone.
Specify the dtype (especially useful for integers with missing values).
Notice that while pandas is forced to store the data as floating point,
the database supports nullable integers. When fetching the data with
Python, we get back integer scalars.
Writes the DataFrame to a Stata dataset file.
“dta” files contain a Stata dataset.
Parameters:
path (str, path object, or buffer) –
String, path object (implementing os.PathLike[str]), or file-like
object implementing a binary write() function.
Changed in version 1.0.0.
Previously this was “fname”
convert_dates (dict) – Dictionary mapping columns containing datetime types to stata
internal format to use when writing the dates. Options are ‘tc’,
‘td’, ‘tm’, ‘tw’, ‘th’, ‘tq’, ‘ty’. Column can be either an integer
or a name. Datetime columns that do not have a conversion type
specified will be converted to ‘tc’. Raises NotImplementedError if
a datetime column has timezone information.
write_index (bool) – Write the index to Stata dataset.
byteorder (str) – Can be “>”, “<”, “little”, or “big”. default is sys.byteorder.
time_stamp (datetime) – A datetime to use as file creation date. Default is the current
time.
data_label (str, optional) – A label for the data set. Must be 80 characters or smaller.
variable_labels (dict) – Dictionary containing columns as keys and variable labels as
values. Each label must be 80 characters or smaller.
version ({114, 117, 118, 119, None}, default 114) –
Version to use in the output dta file. Set to None to let pandas
decide between 118 or 119 formats depending on the number of
columns in the frame. Version 114 can be read by Stata 10 and
later. Version 117 can be read by Stata 13 or later. Version 118
is supported in Stata 14 and later. Version 119 is supported in
Stata 15 and later. Version 114 limits string variables to 244
characters or fewer while versions 117 and later allow strings
with lengths up to 2,000,000 characters. Versions 118 and 119
support Unicode characters, and version 119 supports more than
32,767 variables.
Version 119 should usually only be used when the number of
variables exceeds the capacity of dta format 118. Exporting
smaller datasets in format 119 may have unintended consequences,
and, as of November 2020, Stata SE cannot read version 119 files.
Changed in version 1.0.0: Added support for formats 118 and 119.
convert_strl (list, optional) – List of column names to convert to string columns to Stata StrL
format. Only available if version is 117. Storing strings in the
StrL format can produce smaller dta files if strings have more than
8 characters and values are repeated.
compression (str or dict, default 'infer') –
For on-the-fly compression of the output data. If ‘infer’ and ‘path’ is
path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’
(otherwise no compression).
Set to None for no compression.
Can also be a dict with key 'method' set
to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other
key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile,
bz2.BZ2File, zstandard.ZstdCompressor or
tarfile.TarFile, respectively.
As an example, the following could be passed for faster compression and to create
a reproducible gzip archive:
compression={'method':'gzip','compresslevel':1,'mtime':1}.
New in version 1.5.0: Added support for .tar files.
New in version 1.1.0.
Changed in version 1.4.0: Zstandard support.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
New in version 1.2.0.
value_labels (dict of dicts) –
Dictionary containing columns as keys and dictionaries of column value
to labels as values. Labels for a single variable must be 32,000
characters or smaller.
New in version 1.4.0.
Raises:
NotImplementedError –
If datetimes contain timezone information
* Column dtype is not representable in Stata
ValueError –
Columns listed in convert_dates are neither datetime64[ns]
or datetime.datetime
* Column listed in convert_dates is not in DataFrame
* Categorical label contains more than 32,000 characters
Render a DataFrame to a console-friendly tabular output.
Parameters:
buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
columns (sequence, optional, default None) – The subset of columns to write. Writes all columns by default.
col_space (int, list or dict of int, optional) – The minimum width of each column. If a list of ints is given every integers corresponds with one column. If a dict is given, the key references the column, while the value defines the space to use..
header (bool or sequence of str, optional) – Write out the column names. If a list of strings is given, it is assumed to be aliases for the column names.
index (bool, optional, default True) – Whether to print index (row) labels.
na_rep (str, optional, default 'NaN') – String representation of NaN to use.
formatters (list, tuple or dict of one-param. functions, optional) – Formatter functions to apply to columns’ elements by position or
name.
The result of each function must be a unicode string.
List/tuple must be of length equal to the number of columns.
Formatter function to apply to columns’ elements if they are
floats. This function must return a unicode string and will be
applied only to the non-NaN elements, with NaN being
handled by na_rep.
Changed in version 1.2.0.
sparsify (bool, optional, default True) – Set to False for a DataFrame with a hierarchical index to print
every multiindex key at each row.
index_names (bool, optional, default True) – Prints the names of the indexes.
justify (str, default None) –
How to justify the column labels. If None uses the option from
the print configuration (controlled by set_option), ‘right’ out
of the box. Valid values are
left
right
center
justify
justify-all
start
end
inherit
match-parent
initial
unset.
max_rows (int, optional) – Maximum number of rows to display in the console.
max_cols (int, optional) – Maximum number of columns to display in the console.
show_dimensions (bool, default False) – Display DataFrame dimensions (number of rows by number of columns).
decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.
line_width (int, optional) – Width to wrap a line in characters.
min_rows (int, optional) – The number of rows to display in the console in a truncated repr
(when number of rows is above max_rows).
max_colwidth (int, optional) –
Max width to truncate each column in characters. By default, no limit.
New in version 1.0.0.
encoding (str, default "utf-8") –
Set character encoding.
New in version 1.0.
Returns:
If buf is None, returns the result as a string. Otherwise returns
None.
path_or_buffer (str, path object, file-like object, or None, default None) – String, path object (implementing os.PathLike[str]), or file-like
object implementing a write() function. If None, the result is returned
as a string.
index (bool, default True) – Whether to include index in XML document.
root_name (str, default 'data') – The name of root element in XML document.
row_name (str, default 'row') – The name of row element in XML document.
na_rep (str, optional) – Missing data representation.
attr_cols (list-like, optional) – List of columns to write as attributes in row element.
Hierarchical columns will be flattened with underscore
delimiting the different levels.
elem_cols (list-like, optional) – List of columns to write as children in row element. By default,
all columns output as children of row element. Hierarchical
columns will be flattened with underscore delimiting the
different levels.
namespaces (dict, optional) –
All namespaces to be defined in root element. Keys of dict
should be prefix names and values of dict corresponding URIs.
Default namespaces should be given empty string key. For
example,
namespaces={"":"https://example.com"}
prefix (str, optional) – Namespace prefix to be used for every element and/or attribute
in document. This should be one of the keys in namespaces
dict.
encoding (str, default 'utf-8') – Encoding of the resulting document.
xml_declaration (bool, default True) – Whether to include the XML declaration at start of document.
pretty_print (bool, default True) – Whether output should be pretty printed with indentation and
line breaks.
parser ({'lxml','etree'}, default 'lxml') – Parser module to use for building of tree. Only ‘lxml’ and
‘etree’ are supported. With ‘lxml’, the ability to use XSLT
stylesheet is supported.
stylesheet (str, path object or file-like object, optional) – A URL, file-like object, or a raw string containing an XSLT
script used to transform the raw XML output. Script should use
layout of elements and attributes from original output. This
argument requires lxml to be installed. Only XSLT 1.0
scripts and not later versions is currently supported.
compression (str or dict, default 'infer') –
For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buffer’ is
path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’
(otherwise no compression).
Set to None for no compression.
Can also be a dict with key 'method' set
to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other
key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile,
bz2.BZ2File, zstandard.ZstdCompressor or
tarfile.TarFile, respectively.
As an example, the following could be passed for faster compression and to create
a reproducible gzip archive:
compression={'method':'gzip','compresslevel':1,'mtime':1}.
New in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
Returns:
If io is None, returns the resulting XML format as a
string. Otherwise returns None.
Call func on self producing a DataFrame with the same axis shape as self.
Parameters:
func (function, str, list-like or dict-like) –
Function to use for transforming the data. If a function, must either
work when passed a DataFrame or when passed to DataFrame.apply. If func
is both list-like and dict-like, dict-like behavior takes precedence.
Accepted combinations are:
function
string function name
list-like of functions and/or function names, e.g. [np.exp,'sqrt']
dict-like of axis labels -> functions, function names or list-like of such.
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column.
If 1 or ‘columns’: apply function to each row.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
Returns:
A DataFrame that must have the same length as self.
Return type:
DataFrame
:raises ValueError : If the returned DataFrame has a different length than self.:
See also
DataFrame.agg
Only perform aggregating type operations.
DataFrame.apply
Invoke function on a DataFrame.
Notes
Functions that mutate the passed object can produce unexpected
behavior or errors and are not supported. See gotchas.udf-mutation
for more details.
Examples
>>> df=pd.DataFrame({'A':range(3),'B':range(1,4)})>>> df A B0 0 11 1 22 2 3>>> df.transform(lambdax:x+1) A B0 1 21 2 32 3 4
Even though the resulting DataFrame must have the same length as the
input DataFrame, it is possible to provide several input functions:
>>> df=pd.DataFrame({... "c":[1,1,1,2,2,2,2],... "type":["m","n","o","m","m","n","n"]... })>>> df c type0 1 m1 1 n2 1 o3 2 m4 2 m5 2 n6 2 n>>> df['size']=df.groupby('c')['type'].transform(len)>>> df c type size0 1 m 31 1 n 32 1 o 33 2 m 44 2 m 45 2 n 46 2 n 4
Reflect the DataFrame over its main diagonal by writing rows as columns
and vice-versa. The property T is an accessor to the method
transpose().
Parameters:
*args (tuple, optional) – Accepted for compatibility with NumPy.
copy (bool, default False) –
Whether to copy the data after transposing, even for DataFrames
with a single dtype.
Note that a copy is always required for mixed dtype DataFrames,
or for DataFrames with any extension types.
Returns:
The transposed DataFrame.
Return type:
DataFrame
See also
numpy.transpose
Permute the dimensions of a given array.
Notes
Transposing a DataFrame with mixed dtypes will result in a homogeneous
DataFrame with the object dtype. In such a case, a copy of the data
is always made.
Get Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to dataframe/other, but with support to substitute a fill_value
for missing data in one of the inputs. With reverse version, rtruediv.
Among flexible wrappers (add, sub, mul, div, mod, pow) to
arithmetic operators: +, -, *, /, //, %, **.
Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Truncate a Series or DataFrame before and after some index value.
This is a useful shorthand for boolean indexing based on index
values above or below certain thresholds.
Parameters:
before (date, str, int) – Truncate all rows before this index value.
after (date, str, int) – Truncate all rows after this index value.
axis ({0 or 'index', 1 or 'columns'}, optional) – Axis to truncate. Truncates the index (rows) by default.
For Series this parameter is unused and defaults to 0.
copy (bool, default is True,) – Return a copy of the truncated section.
self (NDFrameT) –
Returns:
The truncated Series or DataFrame.
Return type:
type of caller
See also
DataFrame.loc
Select a subset of a DataFrame by label.
DataFrame.iloc
Select a subset of a DataFrame by position.
Notes
If the index being truncated contains only datetime values,
before and after may be specified as strings instead of
Timestamps.
Examples
>>> df=pd.DataFrame({'A':['a','b','c','d','e'],... 'B':['f','g','h','i','j'],... 'C':['k','l','m','n','o']},... index=[1,2,3,4,5])>>> df A B C1 a f k2 b g l3 c h m4 d i n5 e j o
>>> df.truncate(before=2,after=4) A B C2 b g l3 c h m4 d i n
The columns of a DataFrame can be truncated.
>>> df.truncate(before="A",after="B",axis="columns") A B1 a f2 b g3 c h4 d i5 e j
For Series, only rows can be truncated.
>>> df['A'].truncate(before=2,after=4)2 b3 c4 dName: A, dtype: object
The index values in truncate can be datetimes or string
dates.
Because the index is a DatetimeIndex containing only dates, we can
specify before and after as strings. They will be coerced to
Timestamps before truncation.
Note that truncate assumes a 0 value for any unspecified time
component (midnight). This differs from partial string slicing, which
returns any partially matching dates.
Shift the time index, using the index’s frequency if available.
Deprecated since version 1.1.0: Use shift instead.
Parameters:
periods (int) – Number of periods to move, can be positive or negative.
freq (DateOffset, timedelta, or str, default None) – Increment to use from the tseries module
or time rule expressed as a string (e.g. ‘EOM’).
axis ({0 or ‘index’, 1 or ‘columns’, None}, default 0) – Corresponds to the axis that contains the Index.
For Series this parameter is unused and defaults to 0.
self (NDFrameT) –
Returns:
shifted
Return type:
Series/DataFrame
Notes
If freq is not specified then tries to use the freq or inferred_freq
attributes of the index. If neither of those attributes exist, a
ValueError is thrown
When clocks moved backward due to DST, ambiguous times may arise.
For example in Central European Time (UTC+01), when going from
03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at
00:30:00 UTC and at 01:30:00 UTC. In such a situation, the
ambiguous parameter dictates how ambiguous times should be
handled.
’infer’ will attempt to infer fall dst-transition hours based on
order
bool-ndarray where True signifies a DST time, False designates
a non-DST time (note that this flag is only applicable for
ambiguous times)
’NaT’ will return NaT where there are ambiguous times
’raise’ will raise an AmbiguousTimeError if there are ambiguous
times.
nonexistent (str, default 'raise') –
A nonexistent time does not exist in a particular timezone
where clocks moved forward due to DST. Valid values are:
’shift_forward’ will shift the nonexistent time forward to the
closest existing time
’shift_backward’ will shift the nonexistent time backward to the
closest existing time
’NaT’ will return NaT where there are nonexistent times
timedelta objects will shift nonexistent times by the timedelta
’raise’ will raise an NonExistentTimeError if there are
nonexistent times.
self (NDFrameT) –
Returns:
Same type as the input.
Return type:
Series/DataFrame
Raises:
TypeError – If the TimeSeries is tz-aware and tz is not None.
If the DST transition causes nonexistent times, you can shift these
dates forward or backward with a timedelta object or ‘shift_forward’
or ‘shift_backward’.
Pivot a level of the (necessarily hierarchical) index labels.
Returns a DataFrame having a new level of column labels whose inner-most level
consists of the pivoted index labels.
If the index is not a MultiIndex, the output will be a Series
(the analogue of stack when the columns are not a MultiIndex).
Parameters:
level (int, str, or list of these, default -1 (last level)) – Level(s) of index to unstack, can pass level name.
fill_value (int, str or dict) – Replace NaN with this value if the unstack produces missing values.
Return type:
Series or DataFrame
See also
DataFrame.pivot
Pivot a table based on column values.
DataFrame.stack
Pivot a level of the column labels (inverse operation from unstack).
Notes
Reference the user guide for more examples.
Examples
>>> index=pd.MultiIndex.from_tuples([('one','a'),('one','b'),... ('two','a'),('two','b')])>>> s=pd.Series(np.arange(1.0,5.0),index=index)>>> sone a 1.0 b 2.0two a 3.0 b 4.0dtype: float64
>>> s.unstack(level=-1) a bone 1.0 2.0two 3.0 4.0
>>> s.unstack(level=0) one twoa 1.0 3.0b 2.0 4.0
>>> df=s.unstack(level=0)>>> df.unstack()one a 1.0 b 2.0two a 3.0 b 4.0dtype: float64
Modify in place using non-NA values from another DataFrame.
Aligns on indices. There is no return value.
Parameters:
other (DataFrame, or object coercible into a DataFrame) – Should have at least one matching index/column label
with the original DataFrame. If a Series is passed,
its name attribute must be set, and that will be
used as the column name to align with the original DataFrame.
join ({'left'}, default 'left') – Only left join is implemented, keeping the index and columns of the
original object.
overwrite (bool, default True) –
How to handle non-NA values for overlapping keys:
True: overwrite original DataFrame’s values
with values from other.
False: only update values that are NA in
the original DataFrame.
filter_func (callable(1d-array) -> bool 1d-array, optional) – Can choose to replace values other than NA. Return True for values
that should be updated.
errors ({'raise', 'ignore'}, default 'ignore') – If ‘raise’, will raise a ValueError if the DataFrame and other
both contain non-NA data in the same place.
Returns:
None
Return type:
method directly changes calling object
Raises:
ValueError –
When errors=’raise’ and there’s overlapping non-NA data.
* When errors is not either ‘ignore’ or ‘raise’
The DataFrame’s length does not increase as a result of the update,
only values at matching index/column labels are updated.
>>> df=pd.DataFrame({'A':['a','b','c'],... 'B':['x','y','z']})>>> new_df=pd.DataFrame({'B':['d','e','f','g','h','i']})>>> df.update(new_df)>>> df A B0 a d1 b e2 c f
For Series, its name attribute must be set.
>>> df=pd.DataFrame({'A':['a','b','c'],... 'B':['x','y','z']})>>> new_column=pd.Series(['d','e'],name='B',index=[0,2])>>> df.update(new_column)>>> df A B0 a d1 b y2 c e>>> df=pd.DataFrame({'A':['a','b','c'],... 'B':['x','y','z']})>>> new_df=pd.DataFrame({'B':['d','e']},index=[1,2])>>> df.update(new_df)>>> df A B0 a x1 b d2 c e
If other contains NaNs the corresponding values are not updated
in the original dataframe.
Return a Series containing counts of unique rows in the DataFrame.
New in version 1.1.0.
Parameters:
subset (list-like, optional) – Columns to use when counting unique combinations.
normalize (bool, default False) – Return proportions rather than frequencies.
sort (bool, default True) – Sort by frequencies.
ascending (bool, default False) – Sort in ascending order.
dropna (bool, default True) –
Don’t include counts of rows that contain NA values.
New in version 1.3.0.
Return type:
Series
See also
Series.value_counts
Equivalent method on Series.
Notes
The returned Series will have a MultiIndex with one level per input
column. By default, rows that contain any NA values are omitted from
the result. By default, the resulting Series will be in descending
order so that the first element is the most frequently-occurring row.
With dropna set to False we can also count rows with NA values.
>>> df=pd.DataFrame({'first_name':['John','Anne','John','Beth'],... 'middle_name':['Smith',pd.NA,pd.NA,'Louise']})>>> df first_name middle_name0 John Smith1 Anne <NA>2 John <NA>3 Beth Louise
>>> df.value_counts()first_name middle_nameBeth Louise 1John Smith 1dtype: int64
>>> df.value_counts(dropna=False)first_name middle_nameAnne NaN 1Beth Louise 1John Smith 1 NaN 1dtype: int64
This method takes a key argument to select data at a particular
level of a MultiIndex.
Parameters:
key (label or tuple of label) – Label contained in the index, or partially in a MultiIndex.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Axis to retrieve cross-section on.
level (object, defaults to first n levels (n=1 or len(key))) – In case of a key partially contained in a MultiIndex, indicate
which levels are used. Levels can be referred by label or position.
drop_level (bool, default True) – If False, returns object with same levels as self.
self (NDFrameT) –
Returns:
Cross-section from the original Series or DataFrame
corresponding to the selected index levels.
Return type:
Series or DataFrame
See also
DataFrame.loc
Access a group of rows and columns by label(s) or a boolean array.
DataFrame.iloc
Purely integer-location based indexing for selection by position.
Notes
xs can not be used to set values.
MultiIndex Slicers is a generic way to get/set values on
any level or levels.
It is a superset of xs functionality, see
MultiIndex Slicers.
Access a single value for a row/column label pair.
Similar to loc, in that both provide label-based lookups. Use
at if you only need to get or set a single value in a DataFrame
or Series.
Raises:
KeyError –
If getting a value and ‘label’ does not exist in a DataFrame or
Series.
ValueError –
If row/column label pair is not a tuple or if any label from
the pair is not a scalar for DataFrame.
* If label is list-like (excluding NamedTuple) for Series.
See also
DataFrame.at
Access a single value for a row/column pair by label.
DataFrame.iat
Access a single value for a row/column pair by integer position.
DataFrame.loc
Access a group of rows and columns by label(s).
DataFrame.iloc
Access a group of rows and columns by integer position(s).
Series.at
Access a single value by label.
Series.iat
Access a single value by integer position.
Series.loc
Access a group of rows by label(s).
Series.iloc
Access a group of rows by integer position(s).
Notes
See Fast scalar value getting and setting
for more details.
Examples
>>> df=pd.DataFrame([[0,2,3],[0,4,1],[10,20,30]],... index=[4,5,6],columns=['A','B','C'])>>> df A B C4 0 2 35 0 4 16 10 20 30
This returns a Series with the data type of each column.
The result’s index is the original DataFrame’s columns. Columns
with mixed types are stored with the object dtype. See
the User Guide for more.
Get the properties associated with this pandas object.
The available flags are
Flags.allows_duplicate_labels
See also
Flags
Flags that apply to pandas objects.
DataFrame.attrs
Global metadata applying to this dataset.
Notes
“Flags” differ from “metadata”. Flags reflect properties of the
pandas object (the Series or DataFrame). Metadata refer to properties
of the dataset, and should be stored in DataFrame.attrs.
Purely integer-location based indexing for selection by position.
.iloc[] is primarily integer position based (from 0 to
length-1 of the axis), but may also be used with a boolean
array.
Allowed inputs are:
An integer, e.g. 5.
A list or array of integers, e.g. [4,3,0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series or
DataFrame) and that returns valid output for indexing (one of the above).
This is useful in method chains, when you don’t have a reference to the
calling object, but would like to base your selection on some value.
A tuple of row and column indexes. The tuple elements consist of one of the
above inputs, e.g. (0,1).
.iloc will raise IndexError if a requested indexer is
out-of-bounds, except slice indexers which allow out-of-bounds
indexing (this conforms with python/numpy slice semantics).
See more at Selection by Position.
See also
DataFrame.iat
Fast integer location scalar accessor.
DataFrame.loc
Purely label-location based indexer for selection by label.
Series.iloc
Purely integer-location based indexing for selection by position.
Examples
>>> mydict=[{'a':1,'b':2,'c':3,'d':4},... {'a':100,'b':200,'c':300,'d':400},... {'a':1000,'b':2000,'c':3000,'d':4000}]>>> df=pd.DataFrame(mydict)>>> df a b c d0 1 2 3 41 100 200 300 4002 1000 2000 3000 4000
Slice with integer labels for rows. As mentioned above, note that both
the start and stop of the slice are included.
>>> df.loc[7:9] max_speed shield7 1 28 4 59 7 8
Getting values with a MultiIndex
A number of examples using a DataFrame with a MultiIndex
>>> tuples=[... ('cobra','mark i'),('cobra','mark ii'),... ('sidewinder','mark i'),('sidewinder','mark ii'),... ('viper','mark ii'),('viper','mark iii')... ]>>> index=pd.MultiIndex.from_tuples(tuples)>>> values=[[12,2],[0,4],[10,20],... [1,4],[7,1],[16,36]]>>> df=pd.DataFrame(values,columns=['max_speed','shield'],index=index)>>> df max_speed shieldcobra mark i 12 2 mark ii 0 4sidewinder mark i 10 20 mark ii 1 4viper mark ii 7 1 mark iii 16 36
Single label. Note this returns a DataFrame with a single index.
>>> df.loc['cobra'] max_speed shieldmark i 12 2mark ii 0 4
Single index tuple. Note this returns a Series.
>>> df.loc[('cobra','mark ii')]max_speed 0shield 4Name: (cobra, mark ii), dtype: int64
Single label for row and column. Similar to passing in a tuple, this
returns a Series.
>>> df.loc['cobra','mark i']max_speed 12shield 2Name: (cobra, mark i), dtype: int64
Single tuple. Note using [[]] returns a DataFrame.
>>> df.loc[[('cobra','mark ii')]] max_speed shieldcobra mark ii 0 4
Single tuple for the index with a single label for the column
>>> df.loc[('cobra','mark i'),'shield']2
Slice from index tuple to single label
>>> df.loc[('cobra','mark i'):'viper'] max_speed shieldcobra mark i 12 2 mark ii 0 4sidewinder mark i 10 20 mark ii 1 4viper mark ii 7 1 mark iii 16 36
Slice from index tuple to index tuple
>>> df.loc[('cobra','mark i'):('viper','mark ii')] max_speed shieldcobra mark i 12 2 mark ii 0 4sidewinder mark i 10 20 mark ii 1 4viper mark ii 7 1
Please see the user guide
for more details and explanations of advanced indexing.
Only the values in the DataFrame will be returned, the axes labels
will be removed.
Returns:
The values of the DataFrame.
Return type:
numpy.ndarray
See also
DataFrame.to_numpy
Recommended alternative to this method.
DataFrame.index
Retrieve the index labels.
DataFrame.columns
Retrieving the column names.
Notes
The dtype will be a lower-common-denominator dtype (implicit
upcasting); that is to say if the dtypes (even of numeric types)
are mixed, the one that accommodates all will be chosen. Use this
with care if you are not dealing with the blocks.
e.g. If the dtypes are float16 and float32, dtype will be upcast to
float32. If dtypes are int32 and uint8, dtype will be upcast to
int32. By numpy.find_common_type() convention, mixing int64
and uint64 will result in a float64 dtype.
Examples
A DataFrame where all columns are the same type (e.g., int64) results
in an array of the same type.
A DataFrame with mixed type columns(e.g., str/object, int64, float32)
results in an ndarray of the broadest type that accommodates these
mixed types (e.g., object).