-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Earnings call extraction demo #5
base: main
Are you sure you want to change the base?
Conversation
@willkurt I think this is ready for a review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start! Here are a couple of things I'd like to see changed:
- I really want to see this use
generate.regex
to have this go straight to CSV. There are a couple of reasons for this:- This shows off something that simply cannot be done with JSON-mode on other platforms, and can be used as a great example for near term content being created as to why structured generation is not just JSON-mode.
- The point of LLMs is to not write code, so there's no reason not to go straight to CSV
- Right now this example mimics your other demos very closely, so there's not a lot of new insight into how to think about structured gen. Our support for Pydantic models is awesome, but not the only way to use Outlines.
- It's good for you to become more familiar with using regular expressions for structured gen. At the end of the day the heart of structured generation is regular languages, and to help improve Outlines in the future, everyone on the team needs to have a deep understanding of this.
- I think this example could be simpler! We just want to give enough so that the user gets a feel for how they could extend it. Additionally we're asking the model do some things here that I'm not really sure it can (do you earnestly think those are good buy/sell recommendations?). When I think of earnings call transcript extraction, I think mostly about not having to hand extract certain figures (basically being able to replicate what ycharts has quickly). So the focus should be on a demo that actually works even if that demo is smaller.
- Related: we absolutely need some sort of simple evaluation for this. People are inherently suspect of LLMs and doubly so in the case of financial data. As a reader I want to see that this can earnestly replace reading the earnings transcript.
- These evals can be stupidly simple, we just need to show that this works.
- For the demo to be ready, the results need to be good, but that can be achieved by sub-setting the problems to a case that works well.
- This one is optional, but it would be nice to remove the modal dependency so this demo can be easily run locally. Definitely more in the "nice to have" category, and of course depends on your own compute resources.
Alright cool, thanks for the comments.
My sense is that the desired demo is not this demo, so we should decide how much of this to save elsewhere. I can use the same data stuff but most of this is not particularly applicable to the CSV example, with the exception of a few parts related to data processing. I do think a CSV example is a great idea and I'm happy to pivot towards that, so let me try a few things and see what I can do. I'll open a separate, simpler PR in case we want to do anything with this example as it stands. |
Working example of CSV extraction, though maybe a bit verbose. csv_pattern = r"company_name,company_ticker,year,quarter,quarterly_revenue,quarterly_revenue_growth\n(\w+?),([A-Z]+?),(\d{4}),(q[1-4]),([0-9]+?){1},(\d*|null){1}" Which yields
Unfortunately, that 123 figure is not correct, since it refers to YoY growth and not quarterly growth. Been tricky to get that go go way. I tried adding a "null" field but that seems to be having difficulty. Across all firms, this is
None of these are quite correct. There's name problems (first column is tickers, not names), the revenues are incorrect, and revenue growth is either insane or the YoY growth rate rather than the quarterly growth rate. Here's an example of use: import outlines
language_model = "microsoft/Phi-3-mini-128k-instruct"
model = outlines.models.transformers(
language_model,
device="cuda"
)
from transformers import AutoTokenizer
# Load the tokenizer
TOKENIZER = AutoTokenizer.from_pretrained(language_model)
def to_prompt(user_prompt="", system_prompt=""):
chat = []
if len(system_prompt) > 0:
chat.append({'role':'system', 'content':system_prompt})
if len(user_prompt) > 0:
chat.append({'role':'user', 'content':user_prompt})
tokenized_chat = TOKENIZER.apply_chat_template(
chat,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
)
decoded_chat = TOKENIZER.decode(tokenized_chat[0])
return decoded_chat
# Example
to_prompt(
user_prompt="Please extract the data from the following text and output it in CSV format:\n\n{data}\n\nYou should have columns for name, age, and occupation.",
system_prompt="You extract data from text and output it in CSV format."
)
data = """
Thank you, Tejas, and good afternoon. Today, we are proud to announce Apple's biggest quarter ever. Through the busy holiday season, we set an all-time revenue record of nearly $124 billion, up 11% from last year and better than we had expected at the beginning of the quarter. And we are pleased to see that our active installed base of devices is now at a new record with more than 1.8 billion devices.
We set all-time records for both developed and emerging markets and saw revenue growth across all of our product categories, except for iPad, which we said would be supply constrained. As expected, in the aggregate, we experienced supply constraints that were higher than the September quarter. Before I discuss our results in greater detail, I want to first acknowledge the toll that COVID continues to have on communities around the world. In many places, case counts are higher and health systems more strained than at any point throughout the pandemic.
On behalf of all of us at Apple, I want to extend our deep gratitude to the scientists, doctors, nurses, and so many others on the front lines of combating COVID-19. This is our eighth quarter reporting results in the shadow of the pandemic. And while I can't say it gets any easier, I can say I'm incredibly proud of the way our teams have come together and continue to innovate on behalf of our customers. A few weeks ago, we marked the 15th anniversary of the day Steve revealed iPhone to the world.
"""
def prompt_for_csv(data: str) -> str:
return to_prompt(
system_prompt="""
You extract data from quarterly earnings call transcripts and output it in CSV format.
The CSV should have columns for company name, company ticker, revenue, and revenue growth.
""",
user_prompt=f"""
Please extract the data from the following text and output it in CSV
format:\n\n{data}\n\n
You should have columns for company name, company_ticker, revenue, and revenue growth.
Revenue should be in units of millions of dollars, i.e.
- 92,000,000 means 92 million dollars
- 114 billion should be 114,000 million dollars
Revenue growth should be in units of percentage. Extract the quarterly growth, not the year-over-year growth.
When a value is not mentioned in the transcript, use "null" for that value. For example if the transcript
says "year over year revenue was up 10%" and quarterly revenue growth is not mentioned, then the
revenue growth should be set to null, as it is not mentioned in the transcript.
Be exact as possible. Use what is mentioned in the transcript.
"""
)
print(prompt_for_csv(data))
csv_pattern = r"company_name,company_ticker,year,quarter,quarterly_revenue,quarterly_revenue_growth\n(\w+?),([A-Z]+?),(\d{4}),(q[1-4]),([0-9]+?){1},(\d*|null){1}"
csv_extractor = outlines.generate.regex(
model,
csv_pattern,
sampler=outlines.samplers.multinomial()
)
def extract_csv(data: str) -> str:
result = csv_extractor(prompt_for_csv(data), max_tokens=100)
return result
print(extract_csv(data)) |
Question: I think the approach here in general is kind of clunky. The class-based approach I have above works well for extracting a large amount of complicated, possible optional fields of different units. Directly translating to the CSV approach to extract headline numbers like revenue, growth rate, etc. doesn't really showcase how to handle this, largely because we would usually only get one row from an earnings transcript. Alternatives:
Could also try just extracting all available metrics in a long format using columns |
Still needs a little polish, but the idea is that the model acts as an analyst. The analyst will extract all relevant information from an earnings call, including