What's New
Vector Database and Embedding Support
You can use Featureform to define and orchestrate data pipelines that generate embeddings. Featureform can write them into either Redis for nearest neighbor lookup. This also allows users to version, re-use, and manage embeddings declaratively.
Registering Redis for use as a Vector Store (it’s the same as registering it typically)
ff.register_redis(
name = "redis",
description = "Example inference store",
team = "Featureform",
host = "0.0.0.0",
port = 6379,
)
A Pipeline to Generate Embeddings from Text
docs = spark.register_file(...)
spark.df_transform(
inputs=[docs],
)
def embed_docs():
docs[“embedding”] = docs[“text”].map(lambda txt: openai.Embedding.create(
model="text-embedding-ada-002",
input=txt,
)["data"]
return docs
Defining and Versioning an Embedding
ff.entity
def Article:
embedding = ff.Embedding(embed_docs[[“id”, “embedding”]], dims=1024, vector_db=redis)
ff.entity
class Article:
embedding = ff.Embedding(
embed_docs[["id", "embedding"]],
dims=1024,
variant="test-variant",
vector_db=redis,
)
Performing a Nearest Neighbor Lookup
client.Nearest(Article.embedding, “id_123”, 25)
Interact with Training Sets as Dataframes
You can already interact with sources as dataframes, this release adds the same functionality to training sets as well.
Interacting with a training set as Pandas
import featureform as ff
client = ff.Client(...)
df = client.training_set(“fraud”, “simple”).dataframe()
print(df.head())
Enhanced Scheduling across Offline Stores
Featureform supports Cron syntax for scheduling transformations to run. This release rebuffs this functionality to make it more stable and efficient, and also adds more verbose error messages.
A transformation that runs every hour on Snowflake
snowflake.sql_transform(schedule=“0 * * * *”)
def avg_transaction_price()
return “SELECT user, AVG(price) FROM {{transaction}} GROUP BY user”
Run Pandas Transformations on K8s with S3
Featureform schedules and runs your transformations for you. We support running Pandas directly, Featureform spins up a Kubernetes job to run it. This isn’t a replacement for distributed processing frameworks like Spark (which we also support), but it’s a great option for teams that are already using Pandas for production.
Defining our Pandas on Kubernetes Provider
aws_creds = ff.AWSCredentials(
aws_access_key_id="<aws_access_key_id>",
aws_secret_access_key="<aws_secret_access_key>",
)
s3 = ff.register_s3(
name="s3",
credentials=aws_creds,
bucket_path="<s3_bucket_path>",
bucket_region="<s3_bucket_region>"
)
pandas_k8s = ff.register_k8s(
name="k8s",
description="Native featureform kubernetes compute",
store=s3,
team="featureform-team"
)
Registering a file in S3 and a Transformation on it
src = pandas_k8s.register_file(...)
pandas_k8s.df_transform(inputs=[src])
def transform(src):
return src.groupby("CustomerID")["TransactionAmount"].mean()