Convert Pandas DataFrame to Numpy Array
In data analysis and machine learning tasks, we often work with two popular libraries: Pandas and Numpy. Pandas provides a high-level data manipulation tool called DataFrame, allowing us to efficiently store and manipulate structured data. On the other hand, Numpy provides support for large, multi-dimensional arrays and mathematical functions that operate on these arrays.
There are scenarios in which we may need to convert a Pandas DataFrame to a Numpy array. This conversion can be useful when we want to leverage the benefits of Numpy arrays, such as efficient mathematical computations or integration with other machine learning libraries. In this article, we will explore various ways to convert a Pandas DataFrame into a Numpy array, along with some code examples and their respective outputs.
Method of Convert Pandas DataFrame to Numpy
The simplest way to convert a Pandas DataFrame to a Numpy array is to use the values
attribute of the DataFrame. The values
attribute returns a Numpy array containing the underlying data of the DataFrame.
Example 1: Using the values
Attribute
Here’s an example:
import pandas as pd
import numpy as np
# Create a DataFrame
data = {'Col1': [1, 2, 3], 'Col2': [4, 5, 6], 'Col3': [7, 8, 9]}
df = pd.DataFrame(data)
# Convert DataFrame to NumPy array
array = df.values
print(array)
Output:
In this example, we create a simple DataFrame with three columns and three rows. We then use the values
attribute to convert the DataFrame to a NumPy array. Finally, we print the resulting array.
Example 2: Using the to_numpy()
Method
Another way to convert a Pandas DataFrame to a Numpy array is to use the to_numpy()
method. This method is available from Pandas version 0.24.0 and converts the DataFrame to a Numpy array.
Here’s an example:
import pandas as pd
import numpy as np
# Create a DataFrame
data = {'Col1': [1, 1, 1], 'Col2': [2, 2, 2], 'Col3': [3, 3, 3]}
df = pd.DataFrame(data)
# Convert DataFrame to NumPy array
array = df.to_numpy()
print(array)
Output:
In this example, we create a DataFrame and use the to_numpy()
method to convert it to a Numpy array. Finally, we print the resulting array.
Example 3: Using the to_records()
Method
The to_records()
method of a Pandas DataFrame provides another way to convert it to a Numpy array. This method returns a Numpy structured array with each row of the DataFrame as a record.
Here’s an example:
import pandas as pd
import numpy as np
# Create a DataFrame
data = {'Col1': [1, 2, 3], 'Col2': [4, 5, 6], 'Col3': [7, 8, 9]}
df = pd.DataFrame(data)
# Convert DataFrame to NumPy array
array = df.to_records()
print(array)
Output:
In this example, we create a DataFrame and use the to_records()
method to convert it to a Numpy array. Finally, we print the resulting array.
Converting Specific Columns to Numpy Arrays
In some cases, we may only be interested in converting specific columns of a DataFrame to Numpy arrays. We can achieve this by using the indexing operator []
and applying the desired column selection.
Example 4:
import pandas as pd
import numpy as np
# Create a DataFrame
data = {'Col1': [1, 2, 3], 'Col2': [4, 5, 6], 'Col3': [7, 8, 9]}
df = pd.DataFrame(data)
# Convert specific columns to NumPy arrays
col1_array = df['Col1'].values
col2_array = df['Col2'].values
print(col1_array)
print(col2_array)
Output:
In this example, we create a DataFrame and convert specific columns (Col1
and Col2
) to Numpy arrays using the values
attribute. Finally, we print the resulting arrays.
Handling Missing Values
When converting a DataFrame with missing values to a Numpy array, the missing values are represented as NaN
in the resulting array. We can handle missing values using various techniques, such as filling them with a default value or dropping the rows or columns containing missing values.
Example 5:
import pandas as pd
import numpy as np
# Create a DataFrame with missing values
data = {'Col1': [1, np.nan, 3], 'Col2': [4, 5, np.nan], 'Col3': [7, 8, 9]}
df = pd.DataFrame(data)
# Convert DataFrame to NumPy array with missing values
array_with_nan = df.to_numpy()
print(array_with_nan)
Output:
In this example, we create a DataFrame with missing values (represented as NaN
) and convert it to a Numpy array using the to_numpy()
method. Finally, we print the resulting array.
Filling Missing Values with a Default Value
To fill missing values with a default value before converting the DataFrame to a Numpy array, we can use the fillna()
method of the DataFrame.
Example 6:
import pandas as pd
import numpy as np
# Create a DataFrame with missing values
data = {'Col1': [1, np.nan, 3], 'Col2': [4, 5, np.nan], 'Col3': [7, 8, 9]}
df = pd.DataFrame(data)
# Fill missing values with a default value
df_filled = df.fillna(0)
# Convert DataFrame with filled missing values to NumPy array
array_filled = df_filled.to_numpy()
print(array_filled)
Output:
In this example, we create a DataFrame with missing values and use the fillna()
method to fill the missing values with a default value of 0
. We then convert the DataFrame with filled missing values to a Numpy array using the to_numpy()
method. Finally, we print the resulting array.
Dropping Rows or Columns with Missing Values
Alternatively, we can drop rows or columns containing missing values from the DataFrame before converting it to a Numpy array. We can usethe dropna()
method of the DataFrame to accomplish this.
Example 7:
import pandas as pd
import numpy as np
# Create a DataFrame with missing values
data = {'Col1': [1, np.nan, 3], 'Col2': [4, 5, np.nan], 'Col3': [7, 8, 9]}
df = pd.DataFrame(data)
# Drop rows or columns with missing values
df_dropped_rows = df.dropna() # Drops rows with missing values
df_dropped_columns = df.dropna(axis=1) # Drops columns with missing values
# Convert DataFrames without missing values to NumPy arrays
array_dropped_rows = df_dropped_rows.to_numpy()
array_dropped_columns = df_dropped_columns.to_numpy()
print(array_dropped_rows)
print(array_dropped_columns)
Output:
In this example, we create a DataFrame with missing values and use the dropna()
method to drop rows with missing values and columns with missing values. We then convert the resulting DataFrames without missing values to Numpy arrays using the to_numpy()
method. Finally, we print the resulting arrays.
Converting Object Data Types to Numpy Arrays
When converting a DataFrame to a Numpy array, the data types of the DataFrame columns are preserved in the resulting array. However, if the DataFrame contains columns with object data types (e.g., strings), the resulting array will have an object data type as well. This can affect the efficiency of operations performed on the array.
Example 8:
import pandas as pd
import numpy as np
# Create a DataFrame with object data types
data = {'Col1': ['numpywhere.com', 'geek-docs.com', 'deepinout.com'], 'Col2': [2024, 2023, 2022]}
df = pd.DataFrame(data)
# Convert DataFrame with object data types to NumPy array
array_object = df.to_numpy()
print(array_object)
print(array_object.dtype)
Output:
In this example, we create a DataFrame with object data types for the Col1
column and integer data types for the Col2
column. We convert the DataFrame to a Numpy array, which results in an array with object data type due to the presence of the Col1
column. Finally, we print the resulting array and its data type.
Converting Pandas Categorical Data to Numpy Arrays
Pandas provides a data type called Categorical that represents categorical data in a memory-efficient manner. When converting a DataFrame with categorical columns to a Numpy array, the resulting array will have the categorical data preserved.
Example 9:
import pandas as pd
import numpy as np
# Create a DataFrame with a categorical column
data = {'Col1': ['a', 'b', 'c', 'a'], 'Col2': [1, 2, 3, 4]}
df = pd.DataFrame(data)
df['Col1'] = df['Col1'].astype('category')
# Convert DataFrame with categorical column to NumPy array
array_categorical = df.to_numpy()
print(array_categorical)
print(array_categorical.dtype)
Output:
In this example, we create a DataFrame with a categorical column (Col1
) and an integer column (Col2
). We convert the DataFrame to a Numpy array, resulting in an array that retains the categorical data type for the Col1
column. Finally, we print the resulting array and its data type.
Conclusion of convert Pandas DataFrame to Numpy array
Converting a Pandas DataFrame to a Numpy array can be advantageous in various scenarios, especially when working with mathematical computations, machine learning algorithms, or other libraries that expect Numpy arrays as inputs. In this article, we explored several methods to convert a DataFrame to a Numpy array, such as using the values
attribute, the to_numpy()
method, the deprecated as_matrix()
method, and the to_records()
method.
We also discussed how to handle missing values, convert specific columns, deal with object data types, and preserve categorical data during the conversion process. By understanding these conversion techniques, you can harness the power of both libraries and efficiently analyze and manipulate your data.