This tutorial demonstrates how to classify structured data (e.g. tabular data in a CSV). We will use Keras to define the model, and feature columns as a bridge to map from columns in a CSV to features used to train the model. This tutorial contains complete code to:
Build an input pipeline to batch and shuffle the rows using tf.data.
Map from columns in the CSV to features used to train the model using feature columns.
Build, train, and evaluate a model using Keras.
The Dataset
We will use a small dataset provided by the Cleveland Clinic Foundation for Heart Disease. There are several hundred rows in the CSV. Each row describes a patient, and each column describes an attribute. We will use this information to predict whether a patient has heart disease, which in this dataset is a binary classification task.
Following is a description of this dataset. Notice there are both numeric and categorical columns.
Column
Description
Feature Type
Data Type
Age
Age in years
Numerical
integer
Sex
(1 = male; 0 = female)
Categorical
integer
CP
Chest pain type (0, 1, 2, 3, 4)
Categorical
integer
Trestbpd
Resting blood pressure (in mm Hg on admission to the hospital)
from tensorflow import feature_column from tensorflow.keras import layers from sklearn.model_selection import train_test_split
Use Pandas to create a dataframe
Pandas is a Python library with many helpful utilities for loading and working with structured data. We will use Pandas to download the dataset from a URL, and load it into a dataframe.
Split the dataframe into train, validation, and test
The dataset we downloaded was a single CSV file. We will split this into train, validation, and test sets.
train, test = train_test_split(dataframe, test_size=0.2) train, val = train_test_split(train, test_size=0.2) print(len(train),'train examples') print(len(val),'validation examples') print(len(test),'test examples')
193 train examples
49 validation examples
61 test examples
Create an input pipeline using tf.data
Next, we will wrap the dataframes with tf.data. This will enable us to use feature columns as a bridge to map from the columns in the Pandas dataframe to features used to train the model. If we were working with a very large CSV file (so large that it does not fit into memory), we would use tf.data to read it from disk directly. That is not covered in this tutorial.
# A utility method to create a tf.data dataset from a Pandas Dataframe def df_to_dataset(dataframe, shuffle=True, batch_size=32): dataframe = dataframe.copy() labels = dataframe.pop('target') ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels)) if shuffle: ds = ds.shuffle(buffer_size=len(dataframe)) ds = ds.batch(batch_size) return ds
batch_size =5# A small batch sized is used for demonstration purposes train_ds = df_to_dataset(train, batch_size=batch_size) val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size) test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)
Understand the input pipeline
Now that we have created the input pipeline, let's call it to see the format of the data it returns. We have used a small batch size to keep the output readable.
for feature_batch, label_batch in train_ds.take(1): print('Every feature:', list(feature_batch.keys())) print('A batch of ages:', feature_batch['age']) print('A batch of targets:', label_batch )
Every feature: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
A batch of ages: tf.Tensor([44 59 43 66 61], shape=(5,), dtype=int32)
A batch of targets: tf.Tensor([0 0 0 0 0], shape=(5,), dtype=int32)
We can see that the dataset returns a dictionary of column names (from the dataframe) that map to column values from rows in the dataframe.
Demonstrate several types of feature column
TensorFlow provides many types of feature columns. In this section, we will create several types of feature columns, and demonstrate how they transform a column from the dataframe.
# We will use this batch to demonstrate several types of feature columns example_batch =next(iter(train_ds))[0]
# A utility method to create a feature column # and to transform a batch of data def demo(feature_column): feature_layer = layers.DenseFeatures(feature_column) print(feature_layer(example_batch).numpy())
Numeric columns
The output of a feature column becomes the input to the model (using the demo function defined above, we will be able to see exactly how each column from the dataframe is transformed). A numeric column is the simplest type of column. It is used to represent real valued features. When using this column, your model will receive the column value from the dataframe unchanged.
age = feature_column.numeric_column("age") demo(age)
[[50.]
[59.]
[69.]
[57.]
[70.]]
In the heart disease dataset, most columns from the dataframe are numeric.
Bucketized columns
Often, you don't want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. Consider raw data that represents a person's age. Instead of representing age as a numeric column, we could split the age into several buckets using a bucketized column. Notice the one-hot values below describe which age range each row matches.
In this dataset, thal is represented as a string (e.g. 'fixed', 'normal', or 'reversible'). We cannot feed strings directly to a model. Instead, we must first map them to numeric values. The categorical vocabulary columns provide a way to represent strings as a one-hot vector (much like you have seen above with age buckets). The vocabulary can be passed as a list using categorical_column_with_vocabulary_list, or loaded from a file using categorical_column_with_vocabulary_file.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4267: IndicatorColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4322: VocabularyListCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[0. 0. 1.]
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 0. 1.]]
In a more complex dataset, many columns would be categorical (e.g. strings). Feature columns are most valuable when working with categorical data. Although there is only one categorical column in this dataset, we will use it to demonstrate several important types of feature columns that you could use when working with other datasets.
Embedding columns
Suppose instead of having just a few possible strings, we have thousands (or more) values per category. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using one-hot encodings. We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an embedding column represents that data as a lower-dimensional, dense vector in which each cell can contain any number, not just 0 or 1. The size of the embedding (8, in the example below) is a parameter that must be tuned.
# Notice the input to the embedding column is the categorical column # we previously created thal_embedding = feature_column.embedding_column(thal, dimension=8) demo(thal_embedding)
Another way to represent a categorical column with a large number of values is to use a categorical_column_with_hash_bucket. This feature column calculates a hash value of the input, then selects one of the hash_bucket_size buckets to encode a string. When using this column, you do not need to provide the vocabulary, and you can choose to make the number of hash_buckets significantly smaller than the number of actual categories to save space.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4322: HashedCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
Crossed feature columns
Combining features into a single feature, better known as feature crosses, enables a model to learn separate weights for each combination of features. Here, we will create a new feature that is the cross of age and thal. Note that crossed_column does not build the full table of all possible combinations (which could be very large). Instead, it is backed by a hashed_column, so you can choose how large the table is.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4322: CrossedColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
Choose which columns to use
We have seen how to use several types of feature columns. Now we will use them to train a model. The goal of this tutorial is to show you the complete code (e.g. mechanics) needed to work with feature columns. We have selected a few columns to train our model below arbitrarily.
feature_columns =[]
# numeric cols for header in['age','trestbps','chol','thalach','oldpeak','slope','ca']: feature_columns.append(feature_column.numeric_column(header))
The best way to learn more about classifying structured data is to try it yourself. We suggest finding another dataset to work with, and training a model to classify it using code similar to the above. To improve accuracy, think carefully about which features to include in your model, and how they should be represented.
Pooling is performed in neural networks to reduce variance and computation complexity. Many times, beginners blindly use a pooling method without knowing the reason for using it. Here is a comparison of three basic pooling methods that are widely used. The three types of pooling operations are: Max pooling: The maximum pixel value of the batch is selected. Min pooling: The minimum pixel value of the batch is selected. Average pooling: The average value of all the pixels in the batch is selected. The batch here means a group of pixels of size equal to the filter size which is decided based on the size of the image. In the following example, a filter of 9x9 is chosen. The output of the pooling method varies with the varying value of the filter size. The operations are illustrated through the following figures. Average, Max and Min pooling of size 9x9 applied on an image We cannot say that a particular pooling method is better over others generally. The choice of pooling operation is made...
Percentiles Percentile: the value below which a percentage of data falls. Example: You are the fourth tallest person in a group of 20 80% of people are shorter than you: That means you are at the 80th percentile . If your height is 1.85m then "1.85m" is the 80th percentile height in that group. In Order Have the data in order , so you know which values are above and below. To calculate percentiles of height: have the data in height order (sorted by height). To calculate percentiles of age: have the data in age order. And so on. Grouped Data When the data is grouped: Add up all percentages below the score, plus half the percentage at the score. Example: You Score a B! In the test 12% got D, 50% got C, 30% got B and 8% got A You got a B, so add up all the 12% that got D, all the 50% that got C, half of the 30% that got B, for a total percentile of 12% + 50% + 15% = 77% In other words you...
What is Data Warehouse? A Data Warehouse collects and manages data from varied sources to provide meaningful business insights. It is a collection of data that is separate from the operational systems and supports the decision making of the company. In Data Warehouse, data is stored from a historical perspective. The data in the warehouse is extracted from multiple functional units. It is checked, cleansed and then integrated with the Data warehouse system. Data warehouse used a very fast computer system having large storage capacity. This tool can answer any complex queries relating to data. What is Data Mart? A data mart is a simple form of a Data Warehouse. It is focused on a single subject. Data Mart draws data from only a few sources. These sources may be central Data warehouse, internal operational systems, or external data sources. A Data Mart is an index and extraction system. It is an important subset of a data warehouse. It is subject-oriented, and it is desig...
Comments
Post a Comment