LLM-Powered Parsing and Analysis of Semi-Structured & Unstructured Documents | by Umair Ali Khan

[ad_1]

The code to implement this entire workflow is available on GitHub.

Let’s go through these steps one by one.

1. Text Extraction

The documents used in this example include the AI advisory feedback that we provide to companies after an advisory session. These companies include startups and established companies who want to integrate AI into their business or want to advance their existing AI solutions. The feedback document is a semi-structured document whose format is shown below. The names and other information in this document have been changed due to privacy constraints.

Example document of our AI advisory feedback (image by the author)

The AI experts provide their analysis for each field. However, with hundreds of such documents, extracting insights from the data becomes a challenging task. To gain insights into this data, it needs to be converted into a concise, structured format that can be analyzed using existing statistical or machine learning methods. Performing this conversion manually is not only labor-intensive and time-consuming but also prone to errors.

In addition to the readily visible information in the document, such as the company name, consultation date, and expert(s) involved, I aimed to extract specific details for further analysis. These included the major industry or domain each company operates in, a concise description of the current solutions offered, the AI topic(s), company type, AI maturity level, aim, and a brief summary of the recommendations. This extraction needed to be performed on the detailed text associated with each field. Additionally, the feedback template has evolved over time, which has resulted in documents with inconsistent formats.

Before we discuss the text extraction from the documents, please note that following libraries need to be installed for running the complete code used in this article.

# Install the required libraries
!pip install tqdm  # For displaying a progress bar for document processing
!pip install requests  # For making HTTP requests
!pip install pandas  # For data manipulation and analysis
!pip install python-docx  # For processing Word documents
!pip install plotly  # For creating interactive visualizations
!pip install numpy  # For numerical computations
!pip install scikit-learn  # For machine learning algorithms and tools
!pip install matplotlib  # For creating static, animated, and interactive plots
!pip install openai  # For interacting with the OpenAI API
!pip install seaborn  # For statistical data visualization

The following code extracts text from a document (.docx format) using python-docx library. It is important to extract text from all the formats, including paragraphs, tables, headers, and footers.

def extract_text_from_docx(docx_path: str):
"""
Extract text content from a Word (.docx) file.
"""
doc = docx.Document(docx_path)
full_text = []# Extract text from paragraphs
for para in doc.paragraphs:
full_text.append(para.text)
# Extract text from tables
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
full_text.append(cell.text)
# Extract text from headers and footers 
for section in doc.sections:
header = section.header
footer = section.footer
for para in header.paragraphs:
full_text.append(para.text)
for para in footer.paragraphs:
full_text.append(para.text)
return '\n'.join(full_text).strip()

2. Set LLM Prompts

We need to instruct the LLM on how to extract the required information from the documents. Also, we need to explain the meaning of each field of interest to be extracted so that it can extract the semantically matching information from the documents. This is particularly important because a required field
comprising one or more words can be interpreted in several ways. For instance, we need to explain what we mean by “aim”, which basically refers to the company’s plans for AI integration or how it wants to advance its current solution. Therefore, crafting the right prompt for this purpose is very important.

I set the instructions in system prompt to guide the LLM’s behavior. The input prompt comprises the data to be processed by the LLM. The system prompt is shown below.

# System prompt with extraction instructions
system_message = """
You are an expert in analyzing and extracting information from the feedback forms written by AI experts after AI advisory sessions with companies.  
Please carefully read the provided feedback form and extract the following 15 key information. Make sure that the key names are exactly the same as 
given below. Do not create any additional key names other than these 15. 
Key names and their descriptions:
1. Company name: name of the company seeking AI advisory
2. Country: Company's country [output 'N/A' if not available]
3. Consultation Date [output 'N/A' if not available]
4. Experts: persons providing AI consultancy [output 'N/A' if not available]
5. Consultation type: Regular or pop-up [output 'N/A' if not available]
6. Area/domain: Field of the company’s operations. Some examples: healthcare, industrial manufacturing, business development, education, etc. 
7. Current Solution: description of the current solution offered by the company. The company could be currently in ideation phase. Some examples of ‘Current Solution’ field include i) Recommendation system for cars, houses, and other items, ii) Professional guidance system, iii) AI-based matchmaking service for educational peer-to-peer support. [Be very specific and concise]
8. AI field: AI's sub-field in use or required. Some examples: image processing, large language models, computer vision, natural language processing, predictive modeling, speech recognition, etc. [This field is not explicitly available in the document. Extract it by the semantic understanding of the overall document.]
9. AI maturity level: low, moderate, high [output 'N/A' if not available].
10. Company type: ‘startup’ or ‘established company’
11. Aim: The AI tasks the company is looking for. Some examples: i) Enhance AI-driven systems for diagnosing heart diseases, ii) to automate identification of key variable combinations in customer surveys, iii) to develop AI-based system for automatic quotation generation from engineering drawings, iv) to building and managing enterprise-grade LLM applications. [Be very specific and concise]
12. Identified target market: The targeted customers. Some examples: healthcare professionals, construction firms, hospitality, educational institutions, etc. 
13. Data Requirement Assessment: The type of data required for the intended AI integration? Some examples: Transcripts of therapy sessions, patient data, textual data, image data, videos, etc. 
14. FAIR Services Sought: The services expected from FAIR. For instance, technical advice, proof of concept. 
15. Recommendations: A brief summary of the recommendations in the form of key words or phrase list. Some examples: i) Focus on data balance, monitor for bias, prioritize transparency, ii) Explore machine learning algorithms, implement decision trees, gradient boosting. [Be very specific and concise] 
Guidelines:
- Very important: do not make up anything. If the information of a required field is not available, output ‘N/A’ for it.
- Output in JSON format. The JSON should contain the above 15 keys.
"""

It is important to emphasize what the LLM should focus on. For instance, the number of key elements to be extracted, using exactly the same field names as specified, and not inventing any information if not available. An explanation of each field and some examples of the required information (if possible) are also important. It is worth mentioning that an optimal prompt may not be crafted in the first attempt.

3. Process Documents

Processing the documents refers to sending the data to an LLM for parsing. I used OpenAI’s gpt-4o-mini model for document parsing which is an affordable and intelligent small model for fast, lightweight tasks. GPT-4o mini is cheaper and more capable than GPT-3.5 Turbo. However, the lightweight versions of open LLMs such as Llama, Mistral, or Phi-3 can also be tested for this purpose.

The following code walks through a directory and its sub-directories to find the AI advisory documents (.docx format), extract text from each document, and send the document to gpt-4o-mini via an API call.

def process_files(directory_path: str, api_key: str, system_message: str):
"""
Process all .docx files in the given directory and its subdirectories,
send their content to the LLM, and store the JSON responses.
"""
json_outputs = []
docx_files = []# Walk through the directory and its subdirectories to find .docx files
for root, dirs, files in os.walk(directory_path):
for file in files:
if file.endswith(".docx"):
docx_files.append(os.path.join(root, file))
if not docx_files:
print("No .docx files found in the specified directory or sub-directories.")
return json_outputs
# Iterate through all .docx files in the directory with a progress bar
for file_path in tqdm(docx_files, desc="Processing files...", unit="file"):
filename = os.path.basename(file_path)
extracted_text = extract_text_from_docx(file_path)
# Prepare the user message with the extracted text
input_message = extracted_text
# Prepare the API request payload
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
}
payload = {
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": system_message},
{"role": "user", "content": input_message}
],
"max_tokens": 2000,
"temperature": 0.2
}
# Send the request to the LLM API
response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
# Extract the JSON response
json_response = response.json()
content = json_response['choices'][0]['message']['content'].strip("```json\n").strip("```")
parsed_json = json.loads(content)
# Normalize the parsed JSON output
normalized_json = normalize_json_output(parsed_json)
# Append the normalized JSON output to the list
json_outputs.append(normalized_json)
return json_outputs

In the call’s payload, I set the maximum number of tokens (max_tokens) to 2000 to accommodate the input/output tokens. I set a relatively low temperature (0.2) so that the LLM does not have a high creativity which is not required for this task. A high temperature may lead to hallucinations where the LLM may invent new information.

The LLM’s response is received in a JSON object and is further parsed and normalized as discussed in the next section.

4. Parse LLM Output

As shown in the above code, the response from the API is received in a JSON object (parsed_json) which is further normalized using the following function.

def normalize_json_output(json_output):
"""
Normalize the keys and convert list values to comma-separated strings.
"""
normalized_output = {}
for key, value in json_output.items():
normalized_key = key.lower().replace(" ", "_")
if isinstance(value, list):
normalized_output[normalized_key] = ', '.join(value)
else:
normalized_output[normalized_key] = value
return normalized_output

This function standardizes the keys of the JSON object by converting them to lowercase and replacing spaces with underscores. Additionally, it converts any list values into comma-separated strings to make the data easier to work with and analyze.

The normalized JSON object (json_outputs), containing the extracted key information from all the documents, is finally saved to an Excel file.

def save_json_to_excel(json_outputs, output_file_path: str):
"""
Save the list of JSON objects to an Excel file with a SNO. column.
"""
# Convert the list of JSON objects to a DataFrame
df = pd.DataFrame(json_outputs)# Add a Serial Number (SNO.) column
df.insert(0, 'SNO.', range(1, len(df) + 1))
# Ensure all columns are consistent and save the DataFrame to an Excel file
df.to_excel(output_file_path, index=False)

A snapshot of the Excel file is shown below. LLM-powered parsing produced precise information pertaining to the required fields. The “N/A” in the snapshot represents the data unavailable in the documents (old feedback templates missing this information).

[ad_2]

Source link

LLM-Powered Parsing and Analysis of Semi-Structured & Unstructured Documents | by Umair Ali Khan | Aug, 2024